Python实现PDF信息的精准提取与结构化输出

PDF作为文档交换的主流格式，其信息提取需解决文本提取、表格解析、关键信息抽取、结构化输出四大核心问题。本文将基于Python生态工具链，提供从基础提取到高阶结构化的完整方案，并附实战示例。

一、PDF提取的核心挑战

PDF本质是“页面描述语言”，其内容可能以文本流、矢量图形、位图扫描件等形式存在，导致提取难点：

文本提取：流式PDF可直接提取文本，但扫描件需OCR；
布局混乱：多栏排版、图文混排导致文本顺序错乱；
表格解析：无固定边框的表格（如虚线分隔）难以识别行列；
关键信息定位：如合同中的“甲方”“乙方”、发票中的“金额”需语义理解。

二、工具选型与核心库

根据PDF类型（文本型/扫描型）和需求（提取/解析/结构化），选择以下工具：

场景	推荐工具	特点
文本型PDF基础提取	`PyPDF2`/`pdfplumber`	轻量、易用，支持文本流提取与基础布局分析
高精度文本+布局提取	`pdfplumber`	基于PDFMiner，支持字符坐标、表格边框检测
扫描件OCR提取	`pytesseract`+`pdf2image`	结合Tesseract OCR，支持图像转文本
表格解析（复杂结构）	`camelot`/`tabula-py`	基于OpenCV/Java Tabula，支持无框表格识别
关键信息抽取（NLP）	`spaCy`/`transformers`+自定义规则	结合正则表达式与预训练模型，实现语义抽取

三、分步实现：从提取到结构化输出

步骤1：文本型PDF的基础提取（`pdfplumber`）

pdfplumber是文本型PDF提取的首选工具，支持按页、按区域提取文本，并保留字符坐标信息（用于布局分析）。

安装：

pip install pdfplumber

示例：提取文本并保留布局

import pdfplumber

def extract_text_with_layout(pdf_path):
    structured_data = {"pages": []}
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            # 提取页面文本（按阅读顺序）
            text = page.extract_text()
            # 提取字符坐标（用于分析布局）
            chars = page.chars
            # 提取表格（自动检测）
            tables = page.extract_tables()
            
            structured_data["pages"].append({
                "page_number": page_num + 1,
                "text": text,
                "char_coordinates": [(char["x0"], char["top"], char["text"]) for char in chars],
                "tables": tables
            })
    return structured_data

# 使用示例
data = extract_text_with_layout("sample.pdf")
print(f"第1页文本：{data['pages'][0]['text'][:200]}...")  # 打印前200字符

步骤2：扫描件PDF的OCR提取（`pytesseract`）

若PDF为扫描件（图像构成），需先用pdf2image将PDF转为图像，再用pytesseract调用Tesseract OCR识别文本。

安装：

pip install pdf2image pytesseract
# 需额外安装Tesseract引擎：https://github.com/tesseract-ocr/tesseract

示例：扫描件OCR提取

from pdf2image import convert_from_path
import pytesseract

def ocr_pdf_to_text(pdf_path, dpi=300):
    # 将PDF转为图像列表（每页一张图）
    images = convert_from_path(pdf_path, dpi=dpi)
    full_text = ""
    for i, image in enumerate(images):
        # OCR识别（指定中文需下载chi_sim语言包）
        text = pytesseract.image_to_string(image, lang="chi_sim+eng")
        full_text += f"\n--- 第{i+1}页 ---\n{text}"
    return full_text

# 使用示例（需确保Tesseract已安装并配置环境变量）
text = ocr_pdf_to_text("scanned_sample.pdf")
print(text[:500])  # 打印前500字符

步骤3：表格解析（`camelot`）

对于复杂表格（如无框线、跨页表格），camelot通过检测文本块的坐标关系识别行列，支持输出为DataFrame或JSON。

安装：

pip install camelot-py[cv]  # 需OpenCV依赖

示例：提取表格并结构化

import camelot

def extract_tables_from_pdf(pdf_path, pages="all"):
    # 提取表格（lattice：基于边框；stream：基于文本流）
    tables = camelot.read_pdf(pdf_path, pages=pages, flavor="lattice")  # 优先用lattice检测有框表格
    structured_tables = []
    for table in tables:
        df = table.df  # 转为Pandas DataFrame
        structured_tables.append({
            "page": table.page,
            "rows": df.shape[0],
            "columns": df.shape[1],
            "data": df.to_dict(orient="records"),  # 转为字典列表
            "accuracy": table.accuracy  # 识别准确率（0-100）
        })
    return structured_tables

# 使用示例
tables = extract_tables_from_pdf("table_sample.pdf")
for t in tables:
    print(f"第{t['page']}页表格（准确率{t['accuracy']}%）：")
    print(t["data"][:3])  # 打印前3行

步骤4：关键信息抽取与结构化输出

提取文本/表格后，需通过规则匹配或NLP模型抽取关键信息（如合同中的“甲方”“金额”，发票中的“税号”）。

方法1：正则表达式+规则匹配（适合格式固定的文档）

import re

def extract_contract_info(text):
    # 定义正则规则（示例：提取甲方、乙方、金额）
    patterns = {
        "甲方": r"甲方[：:]\s*([^\n]+)",
        "乙方": r"乙方[：:]\s*([^\n]+)",
        "合同金额": r"金额[：:]\s*([¥￥]?\s*\d+(?:,\d{3})*(?:\.\d{2})?)"
    }
    info = {}
    for key, pattern in patterns.items():
        match = re.search(pattern, text)
        info[key] = match.group(1).strip() if match else "未找到"
    return info

# 使用示例（假设text为合同文本）
contract_text = """
甲方：XX科技有限公司
乙方：YY贸易有限公司
合同金额：¥1,234,567.89
"""
info = extract_contract_info(contract_text)
print(info)
# 输出：{'甲方': 'XX科技有限公司', '乙方': 'YY贸易有限公司', '合同金额': '¥1,234,567.89'}

方法2：NLP模型（适合非固定格式的文档）
使用spaCy或transformers预训练模型（如BERT）进行实体识别（NER），抽取“公司名”“金额”等语义信息。

import spacy

# 加载中文预训练模型（需先下载：python -m spacy download zh_core_web_sm）
nlp = spacy.load("zh_core_web_sm")

def extract_entities_with_nlp(text):
    doc = nlp(text)
    entities = {"ORG": [], "MONEY": []}  # ORG：组织机构；MONEY：金额
    for ent in doc.ents:
        if ent.label_ in entities:
            entities[ent.label_].append(ent.text)
    return entities

# 使用示例
text = "本合同由ABC科技（甲方）与123贸易（乙方）签订，总金额500万元。"
entities = extract_entities_with_nlp(text)
print(entities)
# 输出：{'ORG': ['ABC科技', '123贸易'], 'MONEY': ['500万元']}

步骤5：结构化输出（JSON/CSV/数据库）

将提取的信息整合为统一结构，输出为JSON、CSV或直接存入数据库。

示例：整合所有信息并输出JSON

import json

def pdf_to_structured_json(pdf_path, output_path):
    # 步骤1：提取文本与表格
    text_data = extract_text_with_layout(pdf_path)
    table_data = extract_tables_from_pdf(pdf_path)
    
    # 步骤2：抽取关键信息（假设为合同）
    contract_text = "\n".join([page["text"] for page in text_data["pages"]])
    contract_info = extract_contract_info(contract_text)
    
    # 步骤3：整合结构化数据
    structured_output = {
        "metadata": {
            "source_file": pdf_path,
            "total_pages": len(text_data["pages"])
        },
        "contract_info": contract_info,
        "tables": table_data,
        "full_text": contract_text
    }
    
    # 输出为JSON
    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(structured_output, f, ensure_ascii=False, indent=2)
    return structured_output

# 使用示例
result = pdf_to_structured_json("contract.pdf", "output.json")
print("结构化数据已保存至 output.json")

四、进阶优化：处理复杂场景

1. 多栏排版的顺序校正

使用pdfplumber的字符坐标（x0, top）按阅读顺序排序文本：

def sort_text_by_reading_order(chars, page_width):
    # 按top（垂直位置）分组行，每行内按x0（水平位置）排序
    lines = {}
    for char in chars:
        top = int(char["top"])  # 取整减少误差
        if top not in lines:
            lines[top] = []
        lines[top].append((char["x0"], char["text"]))
    # 按top升序排列行，每行内按x0升序排列字符
    sorted_lines = []
    for top in sorted(lines.keys()):
        line_chars = sorted(lines[top], key=lambda x: x[0])
        sorted_lines.append("".join([c[1] for c in line_chars]))
    return "\n".join(sorted_lines)

2. 扫描件OCR的预处理（提升准确率）

使用OpenCV对图像去噪、二值化，增强OCR识别效果：

import cv2
import numpy as np

def preprocess_image(image):
    # 转为灰度图
    gray = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2GRAY)
    # 二值化（自适应阈值）
    binary = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)
    # 去噪
    denoised = cv2.fastNlMeansDenoising(binary, h=10)
    return denoised

# 在OCR前调用预处理
images = convert_from_path("scanned_sample.pdf")
processed_images = [preprocess_image(img) for img in images]
text = pytesseract.image_to_string(processed_images[0], lang="chi_sim+eng")

3. 表格跨页合并

通过表格的“页眉/页脚”或“列名一致性”判断是否为同一表格，合并数据：

def merge_cross_page_tables(tables):
    merged_tables = []
    current_table = None
    for table in tables:
        if current_table is None:
            current_table = table
        elif table["columns"] == current_table["columns"] and is_same_table(current_table, table):
            # 合并行（假设表格有唯一标识列，如“序号”）
            current_table["data"].extend(table["data"])
        else:
            merged_tables.append(current_table)
            current_table = table
    if current_table:
        merged_tables.append(current_table)
    return merged_tables

def is_same_table(table1, table2):
    # 简单判断：检查前两行是否包含相同列名（如“姓名”“金额”）
    headers1 = set(table1["data"][0].keys())
    headers2 = set(table2["data"][0].keys())
    return headers1 == headers2