====== Runbook: Alerting ====== **Dauer:** ~15 Minuten \\ **Rolle:** DevOps, SRE \\ **Voraussetzung:** Prometheus, Alertmanager Automatische Benachrichtigungen bei Gateway-Problemen. ---- ===== Workflow ===== flowchart TD A[Start] --> B[Alert-Regeln definieren] B --> C[Alertmanager konfigurieren] C --> D[Receiver einrichten] D --> E[Test-Alert auslösen] E --> F{Benachrichtigt?} F -->|Ja| G[Fertig] F -->|Nein| H[Config prüfen] style G fill:#e8f5e9 style H fill:#ffebee ---- ===== 1. Alert-Regeln (Prometheus) ===== **/etc/prometheus/rules/gateway-alerts.yml:**


groups:
  - name: data-gateway
    interval: 30s
    rules:
      # Gateway nicht erreichbar
      - alert: GatewayDown
        expr: up{job="data-gateway"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Data Gateway ist nicht erreichbar"
          description: "{{ $labels.instance }} ist seit 1 Minute nicht erreichbar."

      # Hohe Error-Rate
      - alert: GatewayHighErrorRate
        expr: |
          sum(rate(http_requests_total{job="data-gateway",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{job="data-gateway"}[5m]))
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Hohe Fehlerrate im Gateway"
          description: "Error-Rate ist {{ $value | humanizePercentage }} (> 5%)."

      # Langsame Response-Zeiten
      - alert: GatewaySlowResponses
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket{job="data-gateway"}[5m])) by (le)
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Gateway antwortet langsam"
          description: "P95 Response-Zeit ist {{ $value | humanizeDuration }}."

      # Hoher Memory-Verbrauch
      - alert: GatewayHighMemory
        expr: |
          process_resident_memory_bytes{job="data-gateway"}
          / 1024 / 1024 > 450
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Gateway verbraucht viel Speicher"
          description: "Memory-Nutzung ist {{ $value | humanize }}MB."

      # Zertifikat läuft bald ab
      - alert: GatewayCertExpiringSoon
        expr: |
          (probe_ssl_earliest_cert_expiry{job="gateway-tls"} - time())
          / 86400 < 14
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Gateway TLS-Zertifikat läuft bald ab"
          description: "Zertifikat läuft in {{ $value | humanize }} Tagen ab."

---- ===== 2. Prometheus-Config aktualisieren ===== **/etc/prometheus/prometheus.yml:**


rule_files:
  - "rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093


# Prometheus neu laden
curl -X POST http://localhost:9090/-/reload

---- ===== 3. Alertmanager-Konfiguration ===== **/etc/alertmanager/alertmanager.yml:**


global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager@example.com'
  smtp_auth_password: 'secret'

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    # Kritische Alerts sofort
    - match:
        severity: critical
      receiver: 'critical'
      group_wait: 10s
      repeat_interval: 1h

receivers:
  - name: 'default'
    email_configs:
      - to: 'ops@example.com'
        send_resolved: true

  - name: 'critical'
    email_configs:
      - to: 'oncall@example.com'
        send_resolved: true
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
        channel: '#alerts-critical'
        send_resolved: true
        title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

---- ===== 4. Slack-Integration =====


# Nur Slack-Teil
receivers:
  - name: 'slack-alerts'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
        channel: '#gateway-alerts'
        send_resolved: true
        title: '{{ if eq .Status "firing" }}🔥{{ else }}✅{{ end }} {{ .CommonAnnotations.summary }}'
        text: |
          *Alert:* {{ .CommonLabels.alertname }}
          *Severity:* {{ .CommonLabels.severity }}
          *Description:* {{ .CommonAnnotations.description }}
          *Instance:* {{ .CommonLabels.instance }}

---- ===== 5. Microsoft Teams =====


# Via Prometheus MS Teams Webhook
receivers:
  - name: 'teams-alerts'
    webhook_configs:
      - url: 'http://prometheus-msteams:2000/gateway'
        send_resolved: true

---- ===== 6. Test-Alert =====


# Alert-Regeln prüfen
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {alert: .name, state: .state}'

# Aktive Alerts
curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[]'

# Alertmanager Status
curl http://localhost:9093/api/v2/status | jq

# Test-Alert senden
curl -X POST http://localhost:9093/api/v2/alerts \
  -H "Content-Type: application/json" \
  -d '[{
    "labels": {"alertname": "TestAlert", "severity": "warning"},
    "annotations": {"summary": "Test-Alert", "description": "Dies ist ein Test."}
  }]'

---- ===== 7. Checkliste ===== | # | Prüfpunkt | ✓ | |---|-----------|---| | 1 | Alert-Regeln erstellt | ☐ | | 2 | Prometheus config aktualisiert | ☐ | | 3 | Alertmanager konfiguriert | ☐ | | 4 | Receiver getestet (E-Mail/Slack) | ☐ | | 5 | Test-Alert empfangen | ☐ | ---- ===== Troubleshooting ===== | Problem | Ursache | Lösung | |---------|---------|--------| | ''No alerts'' | Regel-Syntax falsch | ''promtool check rules rules.yml'' | | ''Alert not firing'' | Bedingung nicht erfüllt | Query manuell testen | | ''No notification'' | Receiver falsch | Alertmanager-Logs prüfen | | ''Duplicate alerts'' | Falsche Gruppierung | ''group_by'' anpassen | ---- ===== Empfohlene Schwellwerte ===== | Alert | Schwellwert | Dauer | |-------|-------------|-------| | GatewayDown | up == 0 | 1m | | HighErrorRate | > 5% | 5m | | SlowResponses | p95 > 2s | 5m | | HighMemory | > 450MB | 10m | | CertExpiring | < 14 Tage | 1h | ---- ===== Verwandte Runbooks ===== * [[.:prometheus|Prometheus]] – Metriken sammeln * [[.:grafana-dashboard|Grafana]] – Visualisierung * [[..:sicherheit:tls-einrichten|TLS einrichten]] – Zertifikats-Alerting ---- << [[.:grafana-dashboard|← Grafana Dashboard]] | [[..:start|→ Operator-Übersicht]] >> ---- //Wolfgang van der Stille @ EMSR DATA d.o.o. - Data Gateway Professional// {{tag>operator runbook alerting prometheus alertmanager}}