====== Runbook: Alerting ====== **Trajanje:** ~15 minuta \\ **Uloga:** DevOps, SRE \\ **Preduvjet:** Prometheus, Alertmanager Automatske obavijesti kod Gateway problema. ---- ===== Tijek rada ===== flowchart TD A[Start] --> B[Alert-pravila definirati] B --> C[Alertmanager konfigurirati] C --> D[Receiver postaviti] D --> E[Test-Alert pokrenuti] E --> F{Obavijesten?} F -->|Da| G[Gotovo] F -->|Ne| H[Config provjeriti] style G fill:#e8f5e9 style H fill:#ffebee ---- ===== 1. Alert pravila (Prometheus) ===== **/etc/prometheus/rules/gateway-alerts.yml:** groups: - name: data-gateway interval: 30s rules: # Gateway nije dostupan - alert: GatewayDown expr: up{job="data-gateway"} == 0 for: 1m labels: severity: critical annotations: summary: "Data Gateway nije dostupan" description: "{{ $labels.instance }} nije dostupan vec 1 minutu." # Visoka Error-Rate - alert: GatewayHighErrorRate expr: | sum(rate(http_requests_total{job="data-gateway",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="data-gateway"}[5m])) > 0.05 for: 5m labels: severity: warning annotations: summary: "Visoka stopa gresaka u Gatewayu" description: "Error-Rate je {{ $value | humanizePercentage }} (> 5%)." # Spori odgovori - alert: GatewaySlowResponses expr: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="data-gateway"}[5m])) by (le) ) > 2 for: 5m labels: severity: warning annotations: summary: "Gateway odgovara sporo" description: "P95 vrijeme odgovora je {{ $value | humanizeDuration }}." # Visoka Memory potrosnja - alert: GatewayHighMemory expr: | process_resident_memory_bytes{job="data-gateway"} / 1024 / 1024 > 450 for: 10m labels: severity: warning annotations: summary: "Gateway trosi puno memorije" description: "Memory potrosnja je {{ $value | humanize }}MB." # Certifikat uskoro istjece - alert: GatewayCertExpiringSoon expr: | (probe_ssl_earliest_cert_expiry{job="gateway-tls"} - time()) / 86400 < 14 for: 1h labels: severity: warning annotations: summary: "Gateway TLS certifikat uskoro istjece" description: "Certifikat istjece za {{ $value | humanize }} dana." ---- ===== 2. Prometheus-Config azurirati ===== **/etc/prometheus/prometheus.yml:** rule_files: - "rules/*.yml" alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 # Prometheus ponovno ucitati curl -X POST http://localhost:9090/-/reload ---- ===== 3. Alertmanager konfiguracija ===== **/etc/alertmanager/alertmanager.yml:** global: resolve_timeout: 5m smtp_smarthost: 'smtp.example.com:587' smtp_from: 'alertmanager@example.com' smtp_auth_username: 'alertmanager@example.com' smtp_auth_password: 'secret' route: group_by: ['alertname', 'severity'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'default' routes: # Kriticni Alerti odmah - match: severity: critical receiver: 'critical' group_wait: 10s repeat_interval: 1h receivers: - name: 'default' email_configs: - to: 'ops@example.com' send_resolved: true - name: 'critical' email_configs: - to: 'oncall@example.com' send_resolved: true slack_configs: - api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ' channel: '#alerts-critical' send_resolved: true title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}' text: '{{ .CommonAnnotations.description }}' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance'] ---- ===== 4. Slack integracija ===== # Samo Slack dio receivers: - name: 'slack-alerts' slack_configs: - api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ' channel: '#gateway-alerts' send_resolved: true title: '{{ if eq .Status "firing" }}UPOZORENJE{{ else }}RIJESENO{{ end }} {{ .CommonAnnotations.summary }}' text: | *Alert:* {{ .CommonLabels.alertname }} *Severity:* {{ .CommonLabels.severity }} *Description:* {{ .CommonAnnotations.description }} *Instance:* {{ .CommonLabels.instance }} ---- ===== 5. Microsoft Teams ===== # Preko Prometheus MS Teams Webhook receivers: - name: 'teams-alerts' webhook_configs: - url: 'http://prometheus-msteams:2000/gateway' send_resolved: true ---- ===== 6. Test-Alert ===== # Alert pravila provjeriti curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {alert: .name, state: .state}' # Aktivni Alerti curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[]' # Alertmanager Status curl http://localhost:9093/api/v2/status | jq # Test-Alert poslati curl -X POST http://localhost:9093/api/v2/alerts \ -H "Content-Type: application/json" \ -d '[{ "labels": {"alertname": "TestAlert", "severity": "warning"}, "annotations": {"summary": "Test-Alert", "description": "Ovo je test."} }]' ---- ===== 7. Kontrolna lista ===== | # | Provjera | Da/Ne | |---|-----------|---| | 1 | Alert pravila kreirana | - | | 2 | Prometheus config azuriran | - | | 3 | Alertmanager konfiguriran | - | | 4 | Receiver testiran (E-Mail/Slack) | - | | 5 | Test-Alert primljen | - | ---- ===== Rjesavanje problema ===== | Problem | Uzrok | Rjesenje | |---------|---------|--------| | ''No alerts'' | Sintaksa pravila pogresna | ''promtool check rules rules.yml'' | | ''Alert not firing'' | Uvjet nije ispunjen | Query rucno testirati | | ''No notification'' | Receiver pogresan | Alertmanager logove provjeriti | | ''Duplicate alerts'' | Pogresno grupiranje | ''group_by'' prilagoditi | ---- ===== Preporuceni pragovi ===== | Alert | Prag | Trajanje | |-------|-------------|-------| | GatewayDown | up == 0 | 1m | | HighErrorRate | > 5% | 5m | | SlowResponses | p95 > 2s | 5m | | HighMemory | > 450MB | 10m | | CertExpiring | < 14 dana | 1h | ---- ===== Povezani runbookovi ===== * [[.:prometheus|Prometheus]] - Prikupljanje metrika * [[.:grafana-dashboard|Grafana]] - Vizualizacija * [[..:sicherheit:tls-einrichten|TLS postavljanje]] - Certifikat-Alerting ---- << [[.:grafana-dashboard|<- Grafana Dashboard]] | [[..:start|-> Operator pregled]] >> ---- //Wolfgang van der Stille @ EMSR DATA d.o.o. - Data Gateway Professional// {{tag>operator runbook alerting prometheus alertmanager}}