Runbook: Alerting

Durata: ~15 minuti
Ruolo: DevOps, SRE
Prerequisito: Prometheus, Alertmanager

Notifiche automatiche per problemi Gateway.


Workflow

flowchart TD A[Start] --> B[Definire regole Alert] B --> C[Configurare Alertmanager] C --> D[Configurare Receiver] D --> E[Attivare Test-Alert] E --> F{Notifica ricevuta?} F -->|Si| G[Finito] F -->|No| H[Controllare Config] style G fill:#e8f5e9 style H fill:#ffebee


1. Regole Alert (Prometheus)

/etc/prometheus/rules/gateway-alerts.yml:

groups:
  - name: data-gateway
    interval: 30s
    rules:
      # Gateway non raggiungibile
      - alert: GatewayDown
        expr: up{job="data-gateway"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Data Gateway non raggiungibile"
          description: "{{ $labels.instance }} non raggiungibile da 1 minuto."
 
      # Alto Error-Rate
      - alert: GatewayHighErrorRate
        expr: |
          sum(rate(http_requests_total{job="data-gateway",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{job="data-gateway"}[5m]))
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Alto tasso di errori nel Gateway"
          description: "Error-Rate e {{ $value | humanizePercentage }} (> 5%)."
 
      # Tempi di risposta lenti
      - alert: GatewaySlowResponses
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket{job="data-gateway"}[5m])) by (le)
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Gateway risponde lentamente"
          description: "P95 Response-Zeit e {{ $value | humanizeDuration }}."
 
      # Alto consumo memoria
      - alert: GatewayHighMemory
        expr: |
          process_resident_memory_bytes{job="data-gateway"}
          / 1024 / 1024 > 450
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Gateway consuma molta memoria"
          description: "Utilizzo memoria e {{ $value | humanize }}MB."
 
      # Certificato in scadenza
      - alert: GatewayCertExpiringSoon
        expr: |
          (probe_ssl_earliest_cert_expiry{job="gateway-tls"} - time())
          / 86400 < 14
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Certificato TLS Gateway in scadenza"
          description: "Certificato scade tra {{ $value | humanize }} giorni."

2. Aggiornare Config Prometheus

/etc/prometheus/prometheus.yml:

rule_files:
  - "rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093
# Ricaricare Prometheus
curl -X POST http://localhost:9090/-/reload

3. Configurazione Alertmanager

/etc/alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager@example.com'
  smtp_auth_password: 'secret'

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    # Alert critici subito
    - match:
        severity: critical
      receiver: 'critical'
      group_wait: 10s
      repeat_interval: 1h

receivers:
  - name: 'default'
    email_configs:
      - to: 'ops@example.com'
        send_resolved: true

  - name: 'critical'
    email_configs:
      - to: 'oncall@example.com'
        send_resolved: true
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
        channel: '#alerts-critical'
        send_resolved: true
        title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

4. Integrazione Slack

# Solo parte Slack
receivers:
  - name: 'slack-alerts'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
        channel: '#gateway-alerts'
        send_resolved: true
        title: '{{ if eq .Status "firing" }}ALERT{{ else }}OK{{ end }} {{ .CommonAnnotations.summary }}'
        text: |
          *Alert:* {{ .CommonLabels.alertname }}
          *Severity:* {{ .CommonLabels.severity }}
          *Description:* {{ .CommonAnnotations.description }}
          *Instance:* {{ .CommonLabels.instance }}

5. Microsoft Teams

# Via Prometheus MS Teams Webhook
receivers:
  - name: 'teams-alerts'
    webhook_configs:
      - url: 'http://prometheus-msteams:2000/gateway'
        send_resolved: true

6. Test-Alert

# Controllare regole Alert
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {alert: .name, state: .state}'
 
# Alert attivi
curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[]'
 
# Stato Alertmanager
curl http://localhost:9093/api/v2/status | jq
 
# Inviare Test-Alert
curl -X POST http://localhost:9093/api/v2/alerts \
  -H "Content-Type: application/json" \
  -d '[{
    "labels": {"alertname": "TestAlert", "severity": "warning"},
    "annotations": {"summary": "Test-Alert", "description": "Questo e un test."}
  }]'

7. Checklist

# Punto di verifica v
———–
1 Regole Alert create
2 Prometheus config aggiornata
3 Alertmanager configurato
4 Receiver testato (E-Mail/Slack)
5 Test-Alert ricevuto

Troubleshooting

Problema Causa Soluzione
————————–
No alerts Sintassi regola errata promtool check rules rules.yml
Alert not firing Condizione non soddisfatta Testare query manualmente
No notification Receiver errato Controllare log Alertmanager
Duplicate alerts Raggruppamento errato Modificare group_by

Soglie Raccomandate

Alert Soglia Durata
——-————-——-
GatewayDown up == 0 1m
HighErrorRate > 5% 5m
SlowResponses p95 > 2s 5m
HighMemory > 450MB 10m
CertExpiring < 14 giorni 1h

Runbook Correlati


« <- Grafana Dashboard | -> Panoramica Operatore »


Wolfgang van der Stille @ EMSR DATA d.o.o. - Data Gateway Professional

Zuletzt geändert: il 29/01/2026 alle 23:36