使用Python有效识别和合并PDF的跨页表格

使用 Python 有效识别和合并 PDF 中的跨页表格是一个常见的需求，尤其在数据提取和处理中。由于 PDF 本质上是为呈现设计的，表格跨页时会断裂成多个部分，直接提取容易丢失结构。下面我将提供一个完整、高效、鲁棒的方案，使用 pdfplumber（擅长表格识别）和 camelot（备选）结合自定义逻辑来识别和合并跨页表格。

1. 核心思路

逐页提取表格：使用 pdfplumber 或 camelot 提取每一页的表格。
识别跨页表格：
- 通过比较相邻页面表格的列结构（列数、列名、数据类型）判断是否同源。
- 通过表格在页面底部的位置（如最后一行是否接近页面底部）推测是否断裂。
合并表格：
- 将前一页表格的 body 与后一页表格的 body 拼接。
- 保留表头（通常第一页的表头有效，后续页可能无表头或是重复的表头需去除）。
处理特殊情况：
- 跨页表格有重复表头（需去重）。
- 表格跨越多页（递归合并）。
- 表格中间有分页符/标题行。

2. 环境准备

pip install pdfplumber camelot-py[cv] pandas opencv-python

camelot 需要 Ghostscript 和 Tkinter，若安装失败可优先用 pdfplumber。

3. 方案一：使用 pdfplumber（推荐，纯 Python）

pdfplumber 能很好地保留文本位置信息，适合判断表格是否跨页。

import pdfplumber
import pandas as pd
from collections import defaultdict

def extract_and_merge_tables(pdf_path):
    all_tables = []
    current_table = None
    prev_columns = None
    prev_bottom = None

    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            tables = page.extract_tables()
            if not tables:
                continue

            # 假设每页第一个表格是我们要的（可根据实际情况调整）
            table = tables[0]
            df = pd.DataFrame(table[1:], columns=table[0]) if table else pd.DataFrame()

            if df.empty:
                continue

            # 获取表格边界框
            table_bbox = page.find_tables()[0].bbox  # (x0, top, x1, bottom)
            table_bottom = table_bbox[3]  # 表格底部 y 坐标
            page_height = page.height
            # 判断是否接近页面底部（阈值可调整）
            is_near_bottom = (page_height - table_bottom) < 50

            # 列结构相似度判断（简单示例：比较列名）
            curr_columns = tuple(df.columns)
            columns_match = (prev_columns is not None and curr_columns == prev_columns)

            if current_table is None:
                # 第一个表格
                current_table = df
                prev_columns = curr_columns
                prev_bottom = table_bottom
            elif columns_match and is_near_bottom:
                # 列结构相同且上一表格接近页底，认为是跨页延续
                current_table = pd.concat([current_table, df.iloc[1:] if df.iloc[0].equals(current_table.columns) else df], ignore_index=True)
                prev_bottom = table_bottom
            else:
                # 不是同一个表格，保存当前表并开始新表
                all_tables.append(current_table)
                current_table = df
                prev_columns = curr_columns
                prev_bottom = table_bottom

        # 添加最后一个表格
        if current_table is not None:
            all_tables.append(current_table)

    return all_tables

# 使用示例
tables = extract_and_merge_tables("example.pdf")
for i, tbl in enumerate(tables):
    print(f"Table {i+1}:")
    print(tbl.head())
    print("\n")

关键点解释：

**page.find_tables()**：获取表格边界，用于判断是否在页底。
**columns_match**：比较列名元组判断是否为同一表格。
**is_near_bottom**：通过表格底部与页面底部距离判断是否断裂。
去重逻辑：如果下一页表格第一行是表头（与 current_table.columns 相同），则跳过该行 (df.iloc[1:])。

4. 方案二：使用 Camelot（适合规整表格）

Camelot 的 stream 模式对跨页表格支持稍弱，但可结合页面坐标手动合并。

import camelot
import pandas as pd

def camelot_merge_cross_page(pdf_path, pages='all'):
    tables = camelot.read_pdf(pdf_path, pages=pages, flavor='stream')
    merged_tables = []
    current = None

    for i, table in enumerate(tables):
        df = table.df
        if current is None:
            current = df
        else:
            # 简单列数匹配
            if len(df.columns) == len(current.columns):
                # 检查第一行是否是表头（与 current 列名相同）
                if list(df.iloc[0]) == list(current.columns):
                    df = df.iloc[1:]
                current = pd.concat([current, df], ignore_index=True)
            else:
                merged_tables.append(current)
                current = df
        # 可加入位置判断优化
    if current is not None:
        merged_tables.append(current)
    return merged_tables

# 使用
tables = camelot_merge_cross_page("example.pdf")

5. 高级优化：基于文本位置的跨页判断

更鲁棒的方法是分析表格的 y 坐标连续性：

# 在 pdfplumber 中可获取表格的 top 和 bottom
prev_table_bottom = None
for page in pdf.pages:
    for tbl in page.find_tables():
        bbox = tbl.bbox
        if prev_table_bottom is not None and abs(bbox[1] - prev_table_bottom) < 20:  # 上下页表格顶部接近
            # 可能是同一表格延续
            ...
        prev_table_bottom = bbox[3]

6. 处理多页复杂表格的建议

预处理：先用 pdfplumber 提取所有文本和表格，人工标注跨页样本训练分类器（如需自动化）。
表头去重：跨页表格常重复表头，合并时检查第一行是否与已有列名一致。
容错：允许列数略有差异（如最后一列因换行被拆开），可用模糊匹配。
输出：合并后的表格保存为 CSV/Excel： merged_df.to_excel("merged_tables.xlsx", index=False)

7. 完整工具函数（pdfplumber 加强版）

def merge_cross_page_tables(pdf_path, threshold=50):
    all_tables = []
    current_table = None
    prev_cols = None

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_tables = page.extract_tables()
            if not page_tables:
                continue

            # 取第一个表格
            tbl_data = page_tables[0]
            if not tbl_data:
                continue
            df = pd.DataFrame(tbl_data[1:], columns=tbl_data[0])

            # 获取表格位置
            try:
                table_obj = page.find_tables()[0]
                _, table_top, _, table_bottom = table_obj.bbox
            except IndexError:
                continue

            page_height = page.height
            near_bottom = (page_height - table_bottom) < threshold

            curr_cols = tuple(df.columns)
            if current_table is None:
                current_table = df
                prev_cols = curr_cols
            elif curr_cols == prev_cols and near_bottom:
                # 去重表头
                if not df.empty and list(df.iloc[0]) == list(current_table.columns):
                    df = df.iloc[1:]
                current_table = pd.concat([current_table, df], ignore_index=True)
            else:
                all_tables.append(current_table)
                current_table = df
                prev_cols = curr_cols

        if current_table is not None:
            all_tables.append(current_table)

    return all_tables