====== Runbook: Alerting ======
**Trajanje:** ~15 minuta \\
**Uloga:** DevOps, SRE \\
**Preduvjet:** Prometheus, Alertmanager
Automatske obavijesti kod Gateway problema.
----
===== Tijek rada =====
flowchart TD
A[Start] --> B[Alert-pravila definirati]
B --> C[Alertmanager konfigurirati]
C --> D[Receiver postaviti]
D --> E[Test-Alert pokrenuti]
E --> F{Obavijesten?}
F -->|Da| G[Gotovo]
F -->|Ne| H[Config provjeriti]
style G fill:#e8f5e9
style H fill:#ffebee
----
===== 1. Alert pravila (Prometheus) =====
**/etc/prometheus/rules/gateway-alerts.yml:**
groups:
- name: data-gateway
interval: 30s
rules:
# Gateway nije dostupan
- alert: GatewayDown
expr: up{job="data-gateway"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Data Gateway nije dostupan"
description: "{{ $labels.instance }} nije dostupan vec 1 minutu."
# Visoka Error-Rate
- alert: GatewayHighErrorRate
expr: |
sum(rate(http_requests_total{job="data-gateway",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="data-gateway"}[5m]))
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Visoka stopa gresaka u Gatewayu"
description: "Error-Rate je {{ $value | humanizePercentage }} (> 5%)."
# Spori odgovori
- alert: GatewaySlowResponses
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{job="data-gateway"}[5m])) by (le)
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Gateway odgovara sporo"
description: "P95 vrijeme odgovora je {{ $value | humanizeDuration }}."
# Visoka Memory potrosnja
- alert: GatewayHighMemory
expr: |
process_resident_memory_bytes{job="data-gateway"}
/ 1024 / 1024 > 450
for: 10m
labels:
severity: warning
annotations:
summary: "Gateway trosi puno memorije"
description: "Memory potrosnja je {{ $value | humanize }}MB."
# Certifikat uskoro istjece
- alert: GatewayCertExpiringSoon
expr: |
(probe_ssl_earliest_cert_expiry{job="gateway-tls"} - time())
/ 86400 < 14
for: 1h
labels:
severity: warning
annotations:
summary: "Gateway TLS certifikat uskoro istjece"
description: "Certifikat istjece za {{ $value | humanize }} dana."
----
===== 2. Prometheus-Config azurirati =====
**/etc/prometheus/prometheus.yml:**
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Prometheus ponovno ucitati
curl -X POST http://localhost:9090/-/reload
----
===== 3. Alertmanager konfiguracija =====
**/etc/alertmanager/alertmanager.yml:**
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'secret'
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
# Kriticni Alerti odmah
- match:
severity: critical
receiver: 'critical'
group_wait: 10s
repeat_interval: 1h
receivers:
- name: 'default'
email_configs:
- to: 'ops@example.com'
send_resolved: true
- name: 'critical'
email_configs:
- to: 'oncall@example.com'
send_resolved: true
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
channel: '#alerts-critical'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
----
===== 4. Slack integracija =====
# Samo Slack dio
receivers:
- name: 'slack-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
channel: '#gateway-alerts'
send_resolved: true
title: '{{ if eq .Status "firing" }}UPOZORENJE{{ else }}RIJESENO{{ end }} {{ .CommonAnnotations.summary }}'
text: |
*Alert:* {{ .CommonLabels.alertname }}
*Severity:* {{ .CommonLabels.severity }}
*Description:* {{ .CommonAnnotations.description }}
*Instance:* {{ .CommonLabels.instance }}
----
===== 5. Microsoft Teams =====
# Preko Prometheus MS Teams Webhook
receivers:
- name: 'teams-alerts'
webhook_configs:
- url: 'http://prometheus-msteams:2000/gateway'
send_resolved: true
----
===== 6. Test-Alert =====
# Alert pravila provjeriti
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {alert: .name, state: .state}'
# Aktivni Alerti
curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[]'
# Alertmanager Status
curl http://localhost:9093/api/v2/status | jq
# Test-Alert poslati
curl -X POST http://localhost:9093/api/v2/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {"alertname": "TestAlert", "severity": "warning"},
"annotations": {"summary": "Test-Alert", "description": "Ovo je test."}
}]'
----
===== 7. Kontrolna lista =====
| # | Provjera | Da/Ne |
|---|-----------|---|
| 1 | Alert pravila kreirana | - |
| 2 | Prometheus config azuriran | - |
| 3 | Alertmanager konfiguriran | - |
| 4 | Receiver testiran (E-Mail/Slack) | - |
| 5 | Test-Alert primljen | - |
----
===== Rjesavanje problema =====
| Problem | Uzrok | Rjesenje |
|---------|---------|--------|
| ''No alerts'' | Sintaksa pravila pogresna | ''promtool check rules rules.yml'' |
| ''Alert not firing'' | Uvjet nije ispunjen | Query rucno testirati |
| ''No notification'' | Receiver pogresan | Alertmanager logove provjeriti |
| ''Duplicate alerts'' | Pogresno grupiranje | ''group_by'' prilagoditi |
----
===== Preporuceni pragovi =====
| Alert | Prag | Trajanje |
|-------|-------------|-------|
| GatewayDown | up == 0 | 1m |
| HighErrorRate | > 5% | 5m |
| SlowResponses | p95 > 2s | 5m |
| HighMemory | > 450MB | 10m |
| CertExpiring | < 14 dana | 1h |
----
===== Povezani runbookovi =====
* [[.:prometheus|Prometheus]] - Prikupljanje metrika
* [[.:grafana-dashboard|Grafana]] - Vizualizacija
* [[..:sicherheit:tls-einrichten|TLS postavljanje]] - Certifikat-Alerting
----
<< [[.:grafana-dashboard|<- Grafana Dashboard]] | [[..:start|-> Operator pregled]] >>
----
//Wolfgang van der Stille @ EMSR DATA d.o.o. - Data Gateway Professional//
{{tag>operator runbook alerting prometheus alertmanager}}