====== Runbook: Alerting ======
**Dauer:** ~15 Minuten \\
**Rolle:** DevOps, SRE \\
**Voraussetzung:** Prometheus, Alertmanager
Automatische Benachrichtigungen bei Gateway-Problemen.
----
===== Workflow =====
flowchart TD
A[Start] --> B[Alert-Regeln definieren]
B --> C[Alertmanager konfigurieren]
C --> D[Receiver einrichten]
D --> E[Test-Alert auslösen]
E --> F{Benachrichtigt?}
F -->|Ja| G[Fertig]
F -->|Nein| H[Config prüfen]
style G fill:#e8f5e9
style H fill:#ffebee
----
===== 1. Alert-Regeln (Prometheus) =====
**/etc/prometheus/rules/gateway-alerts.yml:**
groups:
- name: data-gateway
interval: 30s
rules:
# Gateway nicht erreichbar
- alert: GatewayDown
expr: up{job="data-gateway"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Data Gateway ist nicht erreichbar"
description: "{{ $labels.instance }} ist seit 1 Minute nicht erreichbar."
# Hohe Error-Rate
- alert: GatewayHighErrorRate
expr: |
sum(rate(http_requests_total{job="data-gateway",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="data-gateway"}[5m]))
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Hohe Fehlerrate im Gateway"
description: "Error-Rate ist {{ $value | humanizePercentage }} (> 5%)."
# Langsame Response-Zeiten
- alert: GatewaySlowResponses
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{job="data-gateway"}[5m])) by (le)
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Gateway antwortet langsam"
description: "P95 Response-Zeit ist {{ $value | humanizeDuration }}."
# Hoher Memory-Verbrauch
- alert: GatewayHighMemory
expr: |
process_resident_memory_bytes{job="data-gateway"}
/ 1024 / 1024 > 450
for: 10m
labels:
severity: warning
annotations:
summary: "Gateway verbraucht viel Speicher"
description: "Memory-Nutzung ist {{ $value | humanize }}MB."
# Zertifikat läuft bald ab
- alert: GatewayCertExpiringSoon
expr: |
(probe_ssl_earliest_cert_expiry{job="gateway-tls"} - time())
/ 86400 < 14
for: 1h
labels:
severity: warning
annotations:
summary: "Gateway TLS-Zertifikat läuft bald ab"
description: "Zertifikat läuft in {{ $value | humanize }} Tagen ab."
----
===== 2. Prometheus-Config aktualisieren =====
**/etc/prometheus/prometheus.yml:**
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Prometheus neu laden
curl -X POST http://localhost:9090/-/reload
----
===== 3. Alertmanager-Konfiguration =====
**/etc/alertmanager/alertmanager.yml:**
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'secret'
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
# Kritische Alerts sofort
- match:
severity: critical
receiver: 'critical'
group_wait: 10s
repeat_interval: 1h
receivers:
- name: 'default'
email_configs:
- to: 'ops@example.com'
send_resolved: true
- name: 'critical'
email_configs:
- to: 'oncall@example.com'
send_resolved: true
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
channel: '#alerts-critical'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
----
===== 4. Slack-Integration =====
# Nur Slack-Teil
receivers:
- name: 'slack-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
channel: '#gateway-alerts'
send_resolved: true
title: '{{ if eq .Status "firing" }}🔥{{ else }}✅{{ end }} {{ .CommonAnnotations.summary }}'
text: |
*Alert:* {{ .CommonLabels.alertname }}
*Severity:* {{ .CommonLabels.severity }}
*Description:* {{ .CommonAnnotations.description }}
*Instance:* {{ .CommonLabels.instance }}
----
===== 5. Microsoft Teams =====
# Via Prometheus MS Teams Webhook
receivers:
- name: 'teams-alerts'
webhook_configs:
- url: 'http://prometheus-msteams:2000/gateway'
send_resolved: true
----
===== 6. Test-Alert =====
# Alert-Regeln prüfen
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {alert: .name, state: .state}'
# Aktive Alerts
curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[]'
# Alertmanager Status
curl http://localhost:9093/api/v2/status | jq
# Test-Alert senden
curl -X POST http://localhost:9093/api/v2/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {"alertname": "TestAlert", "severity": "warning"},
"annotations": {"summary": "Test-Alert", "description": "Dies ist ein Test."}
}]'
----
===== 7. Checkliste =====
| # | Prüfpunkt | ✓ |
|---|-----------|---|
| 1 | Alert-Regeln erstellt | ☐ |
| 2 | Prometheus config aktualisiert | ☐ |
| 3 | Alertmanager konfiguriert | ☐ |
| 4 | Receiver getestet (E-Mail/Slack) | ☐ |
| 5 | Test-Alert empfangen | ☐ |
----
===== Troubleshooting =====
| Problem | Ursache | Lösung |
|---------|---------|--------|
| ''No alerts'' | Regel-Syntax falsch | ''promtool check rules rules.yml'' |
| ''Alert not firing'' | Bedingung nicht erfüllt | Query manuell testen |
| ''No notification'' | Receiver falsch | Alertmanager-Logs prüfen |
| ''Duplicate alerts'' | Falsche Gruppierung | ''group_by'' anpassen |
----
===== Empfohlene Schwellwerte =====
| Alert | Schwellwert | Dauer |
|-------|-------------|-------|
| GatewayDown | up == 0 | 1m |
| HighErrorRate | > 5% | 5m |
| SlowResponses | p95 > 2s | 5m |
| HighMemory | > 450MB | 10m |
| CertExpiring | < 14 Tage | 1h |
----
===== Verwandte Runbooks =====
* [[.:prometheus|Prometheus]] – Metriken sammeln
* [[.:grafana-dashboard|Grafana]] – Visualisierung
* [[..:sicherheit:tls-einrichten|TLS einrichten]] – Zertifikats-Alerting
----
<< [[.:grafana-dashboard|← Grafana Dashboard]] | [[..:start|→ Operator-Übersicht]] >>
----
//Wolfgang van der Stille @ EMSR DATA d.o.o. - Data Gateway Professional//
{{tag>operator runbook alerting prometheus alertmanager}}