====== Runbook: Alerting ======
**Durata:** ~15 minuti \\
**Ruolo:** DevOps, SRE \\
**Prerequisito:** Prometheus, Alertmanager
Notifiche automatiche per problemi Gateway.
----
===== Workflow =====
flowchart TD
A[Start] --> B[Definire regole Alert]
B --> C[Configurare Alertmanager]
C --> D[Configurare Receiver]
D --> E[Attivare Test-Alert]
E --> F{Notifica ricevuta?}
F -->|Si| G[Finito]
F -->|No| H[Controllare Config]
style G fill:#e8f5e9
style H fill:#ffebee
----
===== 1. Regole Alert (Prometheus) =====
**/etc/prometheus/rules/gateway-alerts.yml:**
groups:
- name: data-gateway
interval: 30s
rules:
# Gateway non raggiungibile
- alert: GatewayDown
expr: up{job="data-gateway"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Data Gateway non raggiungibile"
description: "{{ $labels.instance }} non raggiungibile da 1 minuto."
# Alto Error-Rate
- alert: GatewayHighErrorRate
expr: |
sum(rate(http_requests_total{job="data-gateway",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="data-gateway"}[5m]))
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Alto tasso di errori nel Gateway"
description: "Error-Rate e {{ $value | humanizePercentage }} (> 5%)."
# Tempi di risposta lenti
- alert: GatewaySlowResponses
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{job="data-gateway"}[5m])) by (le)
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Gateway risponde lentamente"
description: "P95 Response-Zeit e {{ $value | humanizeDuration }}."
# Alto consumo memoria
- alert: GatewayHighMemory
expr: |
process_resident_memory_bytes{job="data-gateway"}
/ 1024 / 1024 > 450
for: 10m
labels:
severity: warning
annotations:
summary: "Gateway consuma molta memoria"
description: "Utilizzo memoria e {{ $value | humanize }}MB."
# Certificato in scadenza
- alert: GatewayCertExpiringSoon
expr: |
(probe_ssl_earliest_cert_expiry{job="gateway-tls"} - time())
/ 86400 < 14
for: 1h
labels:
severity: warning
annotations:
summary: "Certificato TLS Gateway in scadenza"
description: "Certificato scade tra {{ $value | humanize }} giorni."
----
===== 2. Aggiornare Config Prometheus =====
**/etc/prometheus/prometheus.yml:**
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Ricaricare Prometheus
curl -X POST http://localhost:9090/-/reload
----
===== 3. Configurazione Alertmanager =====
**/etc/alertmanager/alertmanager.yml:**
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'secret'
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
# Alert critici subito
- match:
severity: critical
receiver: 'critical'
group_wait: 10s
repeat_interval: 1h
receivers:
- name: 'default'
email_configs:
- to: 'ops@example.com'
send_resolved: true
- name: 'critical'
email_configs:
- to: 'oncall@example.com'
send_resolved: true
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
channel: '#alerts-critical'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
----
===== 4. Integrazione Slack =====
# Solo parte Slack
receivers:
- name: 'slack-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
channel: '#gateway-alerts'
send_resolved: true
title: '{{ if eq .Status "firing" }}ALERT{{ else }}OK{{ end }} {{ .CommonAnnotations.summary }}'
text: |
*Alert:* {{ .CommonLabels.alertname }}
*Severity:* {{ .CommonLabels.severity }}
*Description:* {{ .CommonAnnotations.description }}
*Instance:* {{ .CommonLabels.instance }}
----
===== 5. Microsoft Teams =====
# Via Prometheus MS Teams Webhook
receivers:
- name: 'teams-alerts'
webhook_configs:
- url: 'http://prometheus-msteams:2000/gateway'
send_resolved: true
----
===== 6. Test-Alert =====
# Controllare regole Alert
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {alert: .name, state: .state}'
# Alert attivi
curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[]'
# Stato Alertmanager
curl http://localhost:9093/api/v2/status | jq
# Inviare Test-Alert
curl -X POST http://localhost:9093/api/v2/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {"alertname": "TestAlert", "severity": "warning"},
"annotations": {"summary": "Test-Alert", "description": "Questo e un test."}
}]'
----
===== 7. Checklist =====
| # | Punto di verifica | v |
|---|-----------|---|
| 1 | Regole Alert create | ☐ |
| 2 | Prometheus config aggiornata | ☐ |
| 3 | Alertmanager configurato | ☐ |
| 4 | Receiver testato (E-Mail/Slack) | ☐ |
| 5 | Test-Alert ricevuto | ☐ |
----
===== Troubleshooting =====
| Problema | Causa | Soluzione |
|---------|---------|--------|
| ''No alerts'' | Sintassi regola errata | ''promtool check rules rules.yml'' |
| ''Alert not firing'' | Condizione non soddisfatta | Testare query manualmente |
| ''No notification'' | Receiver errato | Controllare log Alertmanager |
| ''Duplicate alerts'' | Raggruppamento errato | Modificare ''group_by'' |
----
===== Soglie Raccomandate =====
| Alert | Soglia | Durata |
|-------|-------------|-------|
| GatewayDown | up == 0 | 1m |
| HighErrorRate | > 5% | 5m |
| SlowResponses | p95 > 2s | 5m |
| HighMemory | > 450MB | 10m |
| CertExpiring | < 14 giorni | 1h |
----
===== Runbook Correlati =====
* [[.:prometheus|Prometheus]] - Raccogliere metriche
* [[.:grafana-dashboard|Grafana]] - Visualizzazione
* [[..:sicherheit:tls-einrichten|Configurare TLS]] - Alerting certificati
----
<< [[.:grafana-dashboard|<- Grafana Dashboard]] | [[..:start|-> Panoramica Operatore]] >>
----
//Wolfgang van der Stille @ EMSR DATA d.o.o. - Data Gateway Professional//
{{tag>operator runbook alerting prometheus alertmanager}}