Nginx 实时请求监测项目实践指南

在现代 Web 架构中，实时监控 Nginx 的请求状况对于性能优化、故障排查、安全防护至关重要。本文将提供一套完整的 Nginx 实时请求监测项目实践方案，涵盖数据采集、传输、存储、可视化、告警全流程，并提供可直接落地的代码示例和配置。

一、监测核心指标与方案选型

1.1 关键监测指标

类别	核心指标	说明
连接状态	Active connections, Reading, Writing, Waiting	反映 Nginx 并发处理能力
请求吞吐	Requests/sec, Total requests	QPS、TPS 性能基准
请求延迟	Request time, Upstream response time	评估服务响应性能
状态码分布	2xx, 3xx, 4xx, 5xx counts	判断业务逻辑正确性与错误率
上游服务	Upstream health, Response time, Failures	监控后端服务健康状态
流量带宽	Bytes sent/received per sec	网络资源使用情况

1.2 技术方案选型对比

方案	原理	优点	缺点	适用场景
Stub Status	Nginx 内置模块，暴露基础状态页	简单、零依赖、性能好	指标少，无细分维度	快速查看基础连接数/QPS
Access Log	解析 Nginx 访问日志	指标最全（IP、URL、UA、耗时等）	需日志采集、解析开销大	深度业务分析与审计
OpenTelemetry	注入 tracing 链路，收集指标	现代化、标准化、生态丰富	需改造应用或 Nginx 插件	微服务全链路观测
Commercial APM	New Relic, Datadog 等	开箱即用、功能强大	成本高、数据隐私风险	企业级付费监控

推荐组合方案：
Stub Status（实时监控） + Access Log（深度分析） + Prometheus/Grafana（可视化），兼顾实时性、深度与成本。

二、数据采集层实践

2.1 方案一：Nginx Stub Status 模块（必选）

步骤 1：启用 Stub Status 模块

确保 Nginx 编译时包含 --with-http_stub_status_module（通过 nginx -V 检查）。
在 nginx.conf 的 server 块中添加：

server {
    listen 8080; # 专用于状态监控的端口，避免与主业务冲突
    location /nginx_status {
        stub_status on;
        access_log off; # 关闭此状态的日志记录，避免干扰
        allow 127.0.0.1; # 只允许本地或内网 IP 访问
        deny all;
    }
}

重载 Nginx：nginx -s reload

步骤 2：验证状态页

访问 http://your_nginx_server:8080/nginx_status，输出示例：

Active connections: 291 
server accepts handled requests
 16630948 16630948 31070465 
Reading: 6 Writing: 179 Waiting: 106

Active connections: 当前活跃连接数
accepts/handled/requests: 总接收连接数/成功处理连接数/总请求数
Reading/Writing/Waiting: 正在读请求头/正在写响应/空闲等待的连接数

步骤 3：Prometheus 采集脚本

创建 nginx_status_exporter.py（Python 示例），定期抓取并暴露 Prometheus 指标：

from prometheus_client import start_http_server, Gauge
import requests
from time import sleep

# 定义 Prometheus 指标
NGINX_ACTIVE_CONN = Gauge('nginx_active_connections', 'Active connections')
NGINX_READING = Gauge('nginx_reading', 'Reading connections')
NGINX_WRITING = Gauge('nginx_writing', 'Writing connections')
NGINX_WAITING = Gauge('nginx_waiting', 'Waiting connections')
NGINX_ACCEPTS = Gauge('nginx_accepts_total', 'Total accepted connections')
NGINX_HANDLED = Gauge('nginx_handled_total', 'Total handled connections')
NGINX_REQUESTS = Gauge('nginx_requests_total', 'Total requests')

def scrape_nginx_status():
    try:
        resp = requests.get('http://localhost:8080/nginx_status', timeout=5)
        lines = resp.text.split('\n')
        # 解析第二行：server accepts handled requests
        accepts, handled, requests = map(int, lines[1].split()[1:])
        # 解析第三行：Reading: 6 Writing: 179 Waiting: 106
        reading, writing, waiting = map(int, lines[2].split()[1:])
        active = int(lines[0].split(':')[1].strip().split()[0])

        # 更新指标
        NGINX_ACTIVE_CONN.set(active)
        NGINX_READING.set(reading)
        NGINX_WRITING.set(writing)
        NGINX_WAITING.set(waiting)
        NGINX_ACCEPTS.set(accepts)
        NGINX_HANDLED.set(handled)
        NGINX_REQUESTS.set(requests)
    except Exception as e:
        print(f"Scrape error: {e}")

if __name__ == '__main__':
    start_http_server(9113)  # 暴露指标的端口
    while True:
        scrape_nginx_status()
        sleep(10)  # 每 10 秒采集一次

运行脚本后，Prometheus 可配置抓取 http://exporter_host:9113/metrics。

2.2 方案二：Access Log 深度解析（推荐）

步骤 1：优化 Nginx Access Log 格式

自定义日志格式，包含关键指标（耗时、状态码、上游响应时间等）：

http {
    log_format timed_combined '$remote_addr - $remote_user [$time_local] '
                            '"$request" $status $body_bytes_sent '
                            '"$http_referer" "$http_user_agent" '
                            '$request_time $upstream_response_time $upstream_addr';
    access_log /var/log/nginx/access.log timed_combined;
}

$request_time: 请求总耗时（秒，精确到毫秒）
$upstream_response_time: 上游服务器响应时间（秒）
$upstream_addr: 上游服务器地址

步骤 2：使用 Filebeat 采集日志

Filebeat 轻量高效，适合日志采集。配置 filebeat.yml：

filebeat.inputs:
- type: log
  paths:
    - /var/log/nginx/access.log
  fields:
    service: nginx
output.logstash:
  hosts: ["logstash:5044"]  # 发送到 Logstash 或直接到 Elasticsearch

步骤 3：Logstash 解析日志（可选）

若需复杂处理（如 GeoIP 解析、字段拆分），用 Logstash 过滤：

input { beats { port => 5044 } }
filter {
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" } # 或使用自定义正则解析 timed_combined
  }
  date { match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ] }
  geoip { source => "clientip" } # 解析 IP 地理位置
}
output { elasticsearch { hosts => ["elasticsearch:9200"] } }

步骤 4：Prometheus + Loki 轻量方案

若倾向 Prometheus 生态，可用 Promtail（Loki 的 Agent）采集日志并转换为指标：

# promtail-config.yaml
server:
  http_listen_port: 9080
positions:
  filename: /tmp/positions.yaml
clients:
  - url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: nginx_access
  static_configs:
  - targets: [localhost]
    labels:
      job: nginx
      __path__: /var/log/nginx/access.log
  pipeline_stages:
    - regex:
        expression: '^(?P<ip>\S+) \S+ \S+ \[(?P<time>[^\]]+)\] "(?P<method>\S+) (?P<path>\S+) \S+" (?P<status>\d+) \S+ "(?P<referer>[^"]*)" "(?P<agent>[^"]*)" (?P<request_time>\S+) (?P<upstream_time>\S+) (?P<upstream_addr>\S+)'
    - metrics:
        - counter:
            name: nginx_requests_total
            description: "Total Nginx Requests"
            match_all: true
            labels:
              status: "$status"
              method: "$method"
        - histogram:
            name: nginx_request_duration_seconds
            description: "Nginx Request Duration"
            match_all: true
            buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]

三、数据存储与可视化

3.1 Prometheus + Grafana（经典组合）

步骤 1：Prometheus 配置抓取任务

prometheus.yml 中添加：

scrape_configs:
  - job_name: 'nginx-stub-status'
    static_configs:
      - targets: ['exporter_host:9113']  # Stub Status Exporter
  - job_name: 'nginx-logs'  # 若用 Loki，此部分可省略，通过 Grafana 直接查 Loki

步骤 2：Grafana 仪表盘配置

导入官方 Nginx 仪表盘（ID: 12708）或自定义面板，核心图表包括：

QPS 趋势图：rate(nginx_requests_total[1m])
连接数实时图：nginx_active_connections
状态码分布：sum by (status) (nginx_requests_total)
请求耗时 P99：histogram_quantile(0.99, rate(nginx_request_duration_seconds_bucket[1m]))

3.2 ELK Stack（深度日志分析）

Elasticsearch：存储解析后的日志数据；
Kibana：构建日志查询、可视化仪表盘（如 Top URL、错误请求热力图）；
优势：支持全文检索、复杂聚合分析，适合审计与排障。

四、实时告警与自动化响应

4.1 Prometheus Alertmanager 告警规则

alert.rules.yml 示例：

groups:
- name: nginx_alerts
  rules:
  - alert: HighRequestLatency
    expr: histogram_quantile(0.95, rate(nginx_request_duration_seconds_bucket[1m])) > 2
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Nginx 请求延迟过高"
      description: "95% 请求延迟超过 2 秒，当前值：{{ $value }}s"

  - alert: High5xxRate
    expr: rate(nginx_requests_total{status=~"5.."}[1m]) / rate(nginx_requests_total[1m]) > 0.05
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Nginx 5xx 错误率过高"
      description: "5xx 错误率超过 5%，当前值：{{ $value | humanizePercentage }}"

4.2 告警通知集成

Alertmanager 支持邮件、Slack、PagerDuty、钉钉等，配置 alertmanager.yml：

receivers:
- name: 'dingtalk-webhook'
  webhook_configs:
  - url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN'
    send_resolved: true
route:
  group_by: ['alertname']
  receiver: 'dingtalk-webhook'