Python快速精准抓取网页数据的实用指南

无界猴

6个月前更新

025475

要高效精准地抓取网页数据，Python提供了多种工具和方法。以下是系统化的解决方案：

一、基础工具选择

1. 请求库

Requests (推荐): 简单易用的HTTP库

  import requests
  response = requests.get('https://example.com')
  print(response.text)

aiohttp: 异步HTTP请求，适合大规模抓取

  import aiohttp
  async with aiohttp.ClientSession() as session:
      async with session.get('https://example.com') as response:
          print(await response.text())

2. 解析库

BeautifulSoup (bs4): 结构化解析HTML/XML

  from bs4 import BeautifulSoup
  soup = BeautifulSoup(response.text, 'html.parser')
  titles = soup.find_all('h2', class_='title')

lxml: 高性能解析器，支持XPath

  from lxml import etree
  tree = etree.HTML(response.text)
  titles = tree.xpath('//h2[@class="title"]/text()')

PyQuery: 类似jQuery的语法

  from pyquery import PyQuery as pq
  doc = pq(response.text)
  titles = doc('.title').text()

二、高效抓取策略

1. 并发/异步抓取

import asyncio
import aiohttp

async def fetch(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            return await response.text()

async def main():
    urls = ['https://example.com/page1', 'https://example.com/page2']
    tasks = [fetch(url) for url in urls]
    results = await asyncio.gather(*tasks)
    # 处理结果...

asyncio.run(main())

2. 分布式爬虫框架

Scrapy: 功能全面的爬虫框架

  scrapy startproject myproject

Feapder: 新一代分布式爬虫框架

  pip install feapder
  feapder create -p my_spider

3. 增量爬取

使用Redis记录已抓取URL
设置合理的爬取间隔(建议1-5秒/页)
实现断点续爬功能

三、反爬应对策略

1. 请求头伪装

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...',
    'Referer': 'https://www.google.com/',
    'Accept-Language': 'en-US,en;q=0.9',
}
response = requests.get(url, headers=headers)

2. IP代理池

proxies = {
    'http': 'http://your_proxy_ip:port',
    'https': 'https://your_proxy_ip:port',
}
response = requests.get(url, proxies=proxies)

3. 请求频率控制

import time
import random

for url in urls:
    # 随机延迟1-3秒
    time.sleep(random.uniform(1, 3))
    # 抓取逻辑...

4. 验证码处理

使用OCR库(如Tesseract)
第三方验证码识别服务(如2Captcha)
手动输入(适合小规模爬取)

四、数据存储方案

1. 结构化数据

import pandas as pd
data = {'title': [...], 'content': [...]}
df = pd.DataFrame(data)
df.to_csv('output.csv', index=False)
# 或存入数据库
import sqlite3
conn = sqlite3.connect('data.db')
df.to_sql('table_name', conn, if_exists='replace', index=False)

2. 非结构化数据

import json
with open('data.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

五、高级技巧

1. 动态页面抓取

Selenium: 控制浏览器

  from selenium import webdriver
  driver = webdriver.Chrome()
  driver.get(url)
  html = driver.page_source

Playwright: 更现代的无头浏览器

  from playwright.sync_api import sync_playwright
  with sync_playwright() as p:
      browser = p.chromium.launch()
      page = browser.new_page()
      page.goto(url)
      html = page.content()

API逆向工程: 直接调用网站API(推荐)

2. 数据清洗

import re
from datetime import datetime

def clean_text(text):
    # 去除HTML标签
    text = re.sub(r'<[^>]+>', '', text)
    # 去除特殊字符
    text = re.sub(r'[^\w\s]', '', text)
    return text.strip()

def parse_date(date_str):
    # 处理多种日期格式
    for fmt in ('%Y-%m-%d', '%d/%m/%Y', '%B %d, %Y'):
        try:
            return datetime.strptime(date_str, fmt).date()
        except ValueError:
            continue
    return None

六、监控与维护

日志记录

import logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    filename='spider.log'
)
logging.info('Spider started')

异常处理

try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()  # 检查HTTP错误
except requests.exceptions.RequestException as e:
    logging.error(f"Request failed: {e}")
    # 重试逻辑或记录失败URL