Pandas高效更新子表数据的5种实用技巧：从merge到numpy.where全解析

在 Pandas 中，利用主表（master_df）更新子表（child_df）的指定列是一个常见需求，尤其是在数据清洗或合并操作中。以下是几种高效的小技巧，涵盖不同场景下的实现方法。

1. 使用 `merge` + 条件更新

适用于 主表和子表有共同键（Key），且需要更新子表的指定列。

场景

主表 master_df 包含最新数据。
子表 child_df 需要根据主表更新某些列（如 price、status 等）。

方法

import pandas as pd

# 示例数据
master_df = pd.DataFrame({
    'id': [1, 2, 3],
    'price': [100, 200, 300],
    'status': ['active', 'inactive', 'active']
})

child_df = pd.DataFrame({
    'id': [1, 2, 4],  # 注意：id=4 在主表中不存在
    'price': [50, 150, 250],
    'status': ['pending', 'pending', 'pending']
})

# 只更新 child_df 中 id 匹配 master_df 的行
update_mask = child_df['id'].isin(master_df['id'])
child_df.loc[update_mask, ['price', 'status']] = (
    child_df.loc[update_mask]
    .merge(master_df[['id', 'price', 'status']], on='id', how='left')
    [['price', 'status']]
)

print(child_df)

输出：

   id  price    status
0   1  100.0    active
1   2  200.0  inactive
2   4  250.0   pending

说明：

merge 只更新 child_df 中 id 匹配 master_df 的行。
id=4 不在主表中，因此保持原值。

2. 使用 `map` 更新单列

适用于 子表需要根据主表的键更新单列（如 status）。

方法

# 创建映射字典（主表的 id -> status）
status_map = dict(zip(master_df['id'], master_df['status']))

# 更新子表的 status 列
child_df['status'] = child_df['id'].map(status_map).combine_first(child_df['status'])

print(child_df)

输出：

   id  price    status
0   1   50.0    active
1   2  150.0  inactive
2   4  250.0   pending

说明：

map 仅更新 child_df 中 id 存在于 master_df 的行。
combine_first 确保未匹配的行保留原值。

3. 使用 `update` 方法（直接修改原 DataFrame）

适用于 子表需要完全覆盖某些列的值（需确保索引对齐）。

方法

# 确保子表和主表的键列（id）作为索引
child_df.set_index('id', inplace=True)
master_df.set_index('id', inplace=True)

# 更新子表的 price 和 status 列
child_df.update(master_df[['price', 'status']])

# 恢复原索引（可选）
child_df.reset_index(inplace=True)

print(child_df)

输出：

   id  price    status
0   1  100.0    active
1   2  200.0  inactive
2   4  250.0   pending

说明：

update 仅修改 child_df 中 id 匹配 master_df 的行。
需确保索引对齐（通过 set_index 实现）。

4. 使用 `combine_first` 或 `where` 处理条件更新

适用于 更复杂的条件更新逻辑（如仅当主表的值不为 NaN 时才更新）。

方法 1：`combine_first`

# 先合并数据
merged_df = child_df.merge(master_df, on='id', how='left', suffixes=('_child', '_master'))

# 更新 price 列（仅当 master_df 的 price 不为 NaN）
merged_df['price'] = merged_df['price_master'].combine_first(merged_df['price_child'])

# 选择需要的列
child_df = merged_df[['id', 'price', 'status_child']].rename(columns={'status_child': 'status'})

print(child_df)

方法 2：`where`

# 创建更新后的 DataFrame
updated_df = child_df.copy()
updated_df['price'] = master_df.set_index('id')['price'].reindex(child_df['id']).where(
    master_df.set_index('id')['price'].notna(),
    child_df['price']
)

updated_df['status'] = master_df.set_index('id')['status'].reindex(child_df['id']).where(
    master_df.set_index('id')['status'].notna(),
    child_df['status']
)

print(updated_df)

5. 使用 `numpy.where` 进行条件更新

适用于 需要更灵活的条件判断（如多列联合判断）。

方法

import numpy as np

# 创建更新掩码
update_mask = child_df['id'].isin(master_df['id'])

# 更新 price 和 status
child_df['price'] = np.where(
    update_mask,
    master_df.set_index('id')['price'].reindex(child_df['id']).values,
    child_df['price']
)

child_df['status'] = np.where(
    update_mask,
    master_df.set_index('id')['status'].reindex(child_df['id']).values,
    child_df['status']
)

print(child_df)

总结

方法	适用场景	特点
`merge` + 条件赋值	需要更新多列	灵活，但代码稍长
`map`	更新单列	简洁，适合键值映射
`update`	直接修改原 DataFrame	高效，需对齐索引
`combine_first` / `where`	条件更新	适合复杂逻辑
`numpy.where`	多条件判断	灵活，但需处理数组