Inhaltsverzeichnis
Runbook: Alerting
Dauer: ~15 Minuten
Rolle: DevOps, SRE
Voraussetzung: Prometheus, Alertmanager
Automatische Benachrichtigungen bei Gateway-Problemen.
Workflow
flowchart TD
A[Start] --> B[Alert-Regeln definieren]
B --> C[Alertmanager konfigurieren]
C --> D[Receiver einrichten]
D --> E[Test-Alert auslösen]
E --> F{Benachrichtigt?}
F -->|Ja| G[Fertig]
F -->|Nein| H[Config prüfen]
style G fill:#e8f5e9
style H fill:#ffebee
1. Alert-Regeln (Prometheus)
/etc/prometheus/rules/gateway-alerts.yml:
groups: - name: data-gateway interval: 30s rules: # Gateway nicht erreichbar - alert: GatewayDown expr: up{job="data-gateway"} == 0 for: 1m labels: severity: critical annotations: summary: "Data Gateway ist nicht erreichbar" description: "{{ $labels.instance }} ist seit 1 Minute nicht erreichbar." # Hohe Error-Rate - alert: GatewayHighErrorRate expr: | sum(rate(http_requests_total{job="data-gateway",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="data-gateway"}[5m])) > 0.05 for: 5m labels: severity: warning annotations: summary: "Hohe Fehlerrate im Gateway" description: "Error-Rate ist {{ $value | humanizePercentage }} (> 5%)." # Langsame Response-Zeiten - alert: GatewaySlowResponses expr: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="data-gateway"}[5m])) by (le) ) > 2 for: 5m labels: severity: warning annotations: summary: "Gateway antwortet langsam" description: "P95 Response-Zeit ist {{ $value | humanizeDuration }}." # Hoher Memory-Verbrauch - alert: GatewayHighMemory expr: | process_resident_memory_bytes{job="data-gateway"} / 1024 / 1024 > 450 for: 10m labels: severity: warning annotations: summary: "Gateway verbraucht viel Speicher" description: "Memory-Nutzung ist {{ $value | humanize }}MB." # Zertifikat läuft bald ab - alert: GatewayCertExpiringSoon expr: | (probe_ssl_earliest_cert_expiry{job="gateway-tls"} - time()) / 86400 < 14 for: 1h labels: severity: warning annotations: summary: "Gateway TLS-Zertifikat läuft bald ab" description: "Zertifikat läuft in {{ $value | humanize }} Tagen ab."
2. Prometheus-Config aktualisieren
/etc/prometheus/prometheus.yml:
rule_files: - "rules/*.yml" alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093
# Prometheus neu laden curl -X POST http://localhost:9090/-/reload
3. Alertmanager-Konfiguration
/etc/alertmanager/alertmanager.yml:
global: resolve_timeout: 5m smtp_smarthost: 'smtp.example.com:587' smtp_from: 'alertmanager@example.com' smtp_auth_username: 'alertmanager@example.com' smtp_auth_password: 'secret' route: group_by: ['alertname', 'severity'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'default' routes: # Kritische Alerts sofort - match: severity: critical receiver: 'critical' group_wait: 10s repeat_interval: 1h receivers: - name: 'default' email_configs: - to: 'ops@example.com' send_resolved: true - name: 'critical' email_configs: - to: 'oncall@example.com' send_resolved: true slack_configs: - api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ' channel: '#alerts-critical' send_resolved: true title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}' text: '{{ .CommonAnnotations.description }}' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']
4. Slack-Integration
# Nur Slack-Teil receivers: - name: 'slack-alerts' slack_configs: - api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ' channel: '#gateway-alerts' send_resolved: true title: '{{ if eq .Status "firing" }}🔥{{ else }}✅{{ end }} {{ .CommonAnnotations.summary }}' text: | *Alert:* {{ .CommonLabels.alertname }} *Severity:* {{ .CommonLabels.severity }} *Description:* {{ .CommonAnnotations.description }} *Instance:* {{ .CommonLabels.instance }}
5. Microsoft Teams
# Via Prometheus MS Teams Webhook receivers: - name: 'teams-alerts' webhook_configs: - url: 'http://prometheus-msteams:2000/gateway' send_resolved: true
6. Test-Alert
# Alert-Regeln prüfen curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {alert: .name, state: .state}' # Aktive Alerts curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[]' # Alertmanager Status curl http://localhost:9093/api/v2/status | jq # Test-Alert senden curl -X POST http://localhost:9093/api/v2/alerts \ -H "Content-Type: application/json" \ -d '[{ "labels": {"alertname": "TestAlert", "severity": "warning"}, "annotations": {"summary": "Test-Alert", "description": "Dies ist ein Test."} }]'
7. Checkliste
| # | Prüfpunkt | ✓ |
| — | ———– | — |
| 1 | Alert-Regeln erstellt | ☐ |
| 2 | Prometheus config aktualisiert | ☐ |
| 3 | Alertmanager konfiguriert | ☐ |
| 4 | Receiver getestet (E-Mail/Slack) | ☐ |
| 5 | Test-Alert empfangen | ☐ |
Troubleshooting
| Problem | Ursache | Lösung |
| ——— | ——— | ——– |
No alerts | Regel-Syntax falsch | promtool check rules rules.yml |
Alert not firing | Bedingung nicht erfüllt | Query manuell testen |
No notification | Receiver falsch | Alertmanager-Logs prüfen |
Duplicate alerts | Falsche Gruppierung | group_by anpassen |
Empfohlene Schwellwerte
| Alert | Schwellwert | Dauer |
| ——- | ————- | ——- |
| GatewayDown | up == 0 | 1m |
| HighErrorRate | > 5% | 5m |
| SlowResponses | p95 > 2s | 5m |
| HighMemory | > 450MB | 10m |
| CertExpiring | < 14 Tage | 1h |
Verwandte Runbooks
- Prometheus – Metriken sammeln
- Grafana – Visualisierung
- TLS einrichten – Zertifikats-Alerting
« ← Grafana Dashboard | → Operator-Übersicht »
Wolfgang van der Stille @ EMSR DATA d.o.o. - Data Gateway Professional
Zuletzt geändert: den 29.01.2026 um 15:12