====== Runbook: Opozarjanje ======
**Trajanje:** ~15 minut \\
**Vloga:** DevOps, SRE \\
**Predpogoj:** Prometheus, Alertmanager
Samodejno obveščanje pri težavah z Gateway.
----
===== Potek dela =====
flowchart TD
A[Začetek] --> B[Definiraj pravila opozoril]
B --> C[Konfiguriraj Alertmanager]
C --> D[Nastavi prejemnike]
D --> E[Sproži testno opozorilo]
E --> F{Obveščen?}
F -->|Da| G[Končano]
F -->|Ne| H[Preveri konfiguracijo]
style G fill:#e8f5e9
style H fill:#ffebee
----
===== 1. Pravila opozoril (Prometheus) =====
**/etc/prometheus/rules/gateway-alerts.yml:**
groups:
- name: data-gateway
interval: 30s
rules:
# Gateway ni dosegljiv
- alert: GatewayDown
expr: up{job="data-gateway"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Data Gateway ni dosegljiv"
description: "{{ $labels.instance }} ni dosegljiv že 1 minuto."
# Visoka stopnja napak
- alert: GatewayHighErrorRate
expr: |
sum(rate(http_requests_total{job="data-gateway",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="data-gateway"}[5m]))
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Visoka stopnja napak v Gateway"
description: "Stopnja napak je {{ $value | humanizePercentage }} (> 5%)."
# Počasni odzivni časi
- alert: GatewaySlowResponses
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{job="data-gateway"}[5m])) by (le)
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Gateway odgovarja počasi"
description: "P95 odzivni čas je {{ $value | humanizeDuration }}."
# Visoka poraba pomnilnika
- alert: GatewayHighMemory
expr: |
process_resident_memory_bytes{job="data-gateway"}
/ 1024 / 1024 > 450
for: 10m
labels:
severity: warning
annotations:
summary: "Gateway porablja veliko pomnilnika"
description: "Poraba pomnilnika je {{ $value | humanize }}MB."
# Certifikat kmalu poteče
- alert: GatewayCertExpiringSoon
expr: |
(probe_ssl_earliest_cert_expiry{job="gateway-tls"} - time())
/ 86400 < 14
for: 1h
labels:
severity: warning
annotations:
summary: "Gateway TLS certifikat kmalu poteče"
description: "Certifikat poteče čez {{ $value | humanize }} dni."
----
===== 2. Posodobitev Prometheus konfiguracije =====
**/etc/prometheus/prometheus.yml:**
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Ponovno naloži Prometheus
curl -X POST http://localhost:9090/-/reload
----
===== 3. Konfiguracija Alertmanager =====
**/etc/alertmanager/alertmanager.yml:**
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'secret'
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
# Kritična opozorila takoj
- match:
severity: critical
receiver: 'critical'
group_wait: 10s
repeat_interval: 1h
receivers:
- name: 'default'
email_configs:
- to: 'ops@example.com'
send_resolved: true
- name: 'critical'
email_configs:
- to: 'oncall@example.com'
send_resolved: true
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
channel: '#alerts-critical'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
----
===== 4. Slack integracija =====
# Samo Slack del
receivers:
- name: 'slack-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
channel: '#gateway-alerts'
send_resolved: true
title: '{{ if eq .Status "firing" }}OPOZORILO{{ else }}REŠENO{{ end }} {{ .CommonAnnotations.summary }}'
text: |
*Opozorilo:* {{ .CommonLabels.alertname }}
*Resnost:* {{ .CommonLabels.severity }}
*Opis:* {{ .CommonAnnotations.description }}
*Instanca:* {{ .CommonLabels.instance }}
----
===== 5. Microsoft Teams =====
# Prek Prometheus MS Teams Webhook
receivers:
- name: 'teams-alerts'
webhook_configs:
- url: 'http://prometheus-msteams:2000/gateway'
send_resolved: true
----
===== 6. Testno opozorilo =====
# Preveri pravila opozoril
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {alert: .name, state: .state}'
# Aktivna opozorila
curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[]'
# Status Alertmanagerja
curl http://localhost:9093/api/v2/status | jq
# Pošlji testno opozorilo
curl -X POST http://localhost:9093/api/v2/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {"alertname": "TestAlert", "severity": "warning"},
"annotations": {"summary": "Testno opozorilo", "description": "To je test."}
}]'
----
===== 7. Kontrolni seznam =====
| # | Točka preverjanja | V |
|---|-----------|---|
| 1 | Pravila opozoril ustvarjena | |
| 2 | Prometheus konfiguracija posodobljena | |
| 3 | Alertmanager konfiguriran | |
| 4 | Prejemniki testirani (e-pošta/Slack) | |
| 5 | Testno opozorilo prejeto | |
----
===== Odpravljanje težav =====
| Težava | Vzrok | Rešitev |
|---------|---------|--------|
| ''No alerts'' | Sintaksa pravil napačna | ''promtool check rules rules.yml'' |
| ''Alert not firing'' | Pogoj ni izpolnjen | Ročno testiraj poizvedbo |
| ''No notification'' | Napačen prejemnik | Preveri dnevnike Alertmanagerja |
| ''Duplicate alerts'' | Napačno grupiranje | Prilagodi ''group_by'' |
----
===== Priporočeni pragovi =====
| Opozorilo | Prag | Trajanje |
|-------|-------------|-------|
| GatewayDown | up == 0 | 1m |
| HighErrorRate | > 5% | 5m |
| SlowResponses | p95 > 2s | 5m |
| HighMemory | > 450MB | 10m |
| CertExpiring | < 14 dni | 1h |
----
===== Povezani Runbooks =====
* [[.:prometheus|Prometheus]] - Zbiranje metrik
* [[.:grafana-dashboard|Grafana]] - Vizualizacija
* [[..:sicherheit:tls-einrichten|Nastavitev TLS]] - Opozarjanje o certifikatih
----
<< [[.:grafana-dashboard|<- Grafana nadzorna plošča]] | [[..:start|-> Pregled operaterja]] >>
----
//Wolfgang van der Stille @ EMSR DATA d.o.o. - Data Gateway Professional//
{{tag>operator runbook alerting prometheus alertmanager}}