利用Python实现HTML页面数据抓取与可视化分析

利用Python抓取HTML页面数据并进行可视化数据分析是一个常见的任务，通常包括以下几个步骤：数据抓取、数据清洗、数据分析和可视化。下面是一个基本的流程示例，展示如何实现这一过程。

所需库

requests 或 httpx：用于发送HTTP请求并获取HTML页面。
BeautifulSoup：用于解析HTML页面。
pandas：用于数据清洗和处理。
matplotlib 或 seaborn：用于数据可视化。

示例流程

1. 数据抓取

假设我们要抓取一个简单的HTML页面，比如一个包含表格的页面。

import requests
from bs4 import BeautifulSoup

# 发送HTTP请求
url = 'http://example.com/some-page-with-table.html'
response = requests.get(url)

# 检查请求是否成功
if response.status_code == 200:
    html_content = response.text
else:
    print("Failed to retrieve the page")
    exit()

# 解析HTML页面
soup = BeautifulSoup(html_content, 'html.parser')

2. 数据提取

假设我们要提取一个表格中的数据。

import pandas as pd

# 找到表格
table = soup.find('table')

# 提取表头
headers = []
for th in table.find_all('th'):
    headers.append(th.text.strip())

# 提取表格行
rows = []
for tr in table.find_all('tr')[1:]:  # 跳过表头行
    cells = tr.find_all('td')
    row = [cell.text.strip() for cell in cells]
    rows.append(row)

# 创建DataFrame
df = pd.DataFrame(rows, columns=headers)

3. 数据清洗

根据需要清洗数据，例如转换数据类型、处理缺失值等。

# 假设第一列是日期，第二列是数值
df['Date'] = pd.to_datetime(df['Date'])  # 转换日期格式
df['Value'] = pd.to_numeric(df['Value'], errors='coerce')  # 转换数值格式，处理无法转换的为NaN

# 处理缺失值（例如，填充为0或删除）
df.fillna(0, inplace=True)  # 或者 df.dropna(inplace=True)

4. 数据分析与可视化

使用matplotlib或seaborn进行数据可视化。

import matplotlib.pyplot as plt
import seaborn as sns

# 设置seaborn风格
sns.set(style="whitegrid")

# 绘制时间序列图
plt.figure(figsize=(10, 6))
sns.lineplot(data=df, x='Date', y='Value')
plt.title('Value Over Time')
plt.xlabel('Date')
plt.ylabel('Value')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()