Runbook: Opozarjanje

Trajanje: ~15 minut
Vloga: DevOps, SRE
Predpogoj: Prometheus, Alertmanager

Samodejno obveščanje pri težavah z Gateway.


Potek dela

flowchart TD A[Začetek] --> B[Definiraj pravila opozoril] B --> C[Konfiguriraj Alertmanager] C --> D[Nastavi prejemnike] D --> E[Sproži testno opozorilo] E --> F{Obveščen?} F -->|Da| G[Končano] F -->|Ne| H[Preveri konfiguracijo] style G fill:#e8f5e9 style H fill:#ffebee


1. Pravila opozoril (Prometheus)

/etc/prometheus/rules/gateway-alerts.yml:

groups:
  - name: data-gateway
    interval: 30s
    rules:
      # Gateway ni dosegljiv
      - alert: GatewayDown
        expr: up{job="data-gateway"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Data Gateway ni dosegljiv"
          description: "{{ $labels.instance }} ni dosegljiv že 1 minuto."
 
      # Visoka stopnja napak
      - alert: GatewayHighErrorRate
        expr: |
          sum(rate(http_requests_total{job="data-gateway",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{job="data-gateway"}[5m]))
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Visoka stopnja napak v Gateway"
          description: "Stopnja napak je {{ $value | humanizePercentage }} (> 5%)."
 
      # Počasni odzivni časi
      - alert: GatewaySlowResponses
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket{job="data-gateway"}[5m])) by (le)
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Gateway odgovarja počasi"
          description: "P95 odzivni čas je {{ $value | humanizeDuration }}."
 
      # Visoka poraba pomnilnika
      - alert: GatewayHighMemory
        expr: |
          process_resident_memory_bytes{job="data-gateway"}
          / 1024 / 1024 > 450
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Gateway porablja veliko pomnilnika"
          description: "Poraba pomnilnika je {{ $value | humanize }}MB."
 
      # Certifikat kmalu poteče
      - alert: GatewayCertExpiringSoon
        expr: |
          (probe_ssl_earliest_cert_expiry{job="gateway-tls"} - time())
          / 86400 < 14
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Gateway TLS certifikat kmalu poteče"
          description: "Certifikat poteče čez {{ $value | humanize }} dni."

2. Posodobitev Prometheus konfiguracije

/etc/prometheus/prometheus.yml:

rule_files:
  - "rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093
# Ponovno naloži Prometheus
curl -X POST http://localhost:9090/-/reload

3. Konfiguracija Alertmanager

/etc/alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager@example.com'
  smtp_auth_password: 'secret'

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    # Kritična opozorila takoj
    - match:
        severity: critical
      receiver: 'critical'
      group_wait: 10s
      repeat_interval: 1h

receivers:
  - name: 'default'
    email_configs:
      - to: 'ops@example.com'
        send_resolved: true

  - name: 'critical'
    email_configs:
      - to: 'oncall@example.com'
        send_resolved: true
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
        channel: '#alerts-critical'
        send_resolved: true
        title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

4. Slack integracija

# Samo Slack del
receivers:
  - name: 'slack-alerts'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
        channel: '#gateway-alerts'
        send_resolved: true
        title: '{{ if eq .Status "firing" }}OPOZORILO{{ else }}REŠENO{{ end }} {{ .CommonAnnotations.summary }}'
        text: |
          *Opozorilo:* {{ .CommonLabels.alertname }}
          *Resnost:* {{ .CommonLabels.severity }}
          *Opis:* {{ .CommonAnnotations.description }}
          *Instanca:* {{ .CommonLabels.instance }}

5. Microsoft Teams

# Prek Prometheus MS Teams Webhook
receivers:
  - name: 'teams-alerts'
    webhook_configs:
      - url: 'http://prometheus-msteams:2000/gateway'
        send_resolved: true

6. Testno opozorilo

# Preveri pravila opozoril
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {alert: .name, state: .state}'
 
# Aktivna opozorila
curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[]'
 
# Status Alertmanagerja
curl http://localhost:9093/api/v2/status | jq
 
# Pošlji testno opozorilo
curl -X POST http://localhost:9093/api/v2/alerts \
  -H "Content-Type: application/json" \
  -d '[{
    "labels": {"alertname": "TestAlert", "severity": "warning"},
    "annotations": {"summary": "Testno opozorilo", "description": "To je test."}
  }]'

7. Kontrolni seznam

# Točka preverjanja V
———–
1 Pravila opozoril ustvarjena
2 Prometheus konfiguracija posodobljena
3 Alertmanager konfiguriran
4 Prejemniki testirani (e-pošta/Slack)
5 Testno opozorilo prejeto

Odpravljanje težav

Težava Vzrok Rešitev
————————–
No alerts Sintaksa pravil napačna promtool check rules rules.yml
Alert not firing Pogoj ni izpolnjen Ročno testiraj poizvedbo
No notification Napačen prejemnik Preveri dnevnike Alertmanagerja
Duplicate alerts Napačno grupiranje Prilagodi group_by

Priporočeni pragovi

Opozorilo Prag Trajanje
——-————-——-
GatewayDown up == 0 1m
HighErrorRate > 5% 5m
SlowResponses p95 > 2s 5m
HighMemory > 450MB 10m
CertExpiring < 14 dni 1h

Povezani Runbooks


« <- Grafana nadzorna plošča | -> Pregled operaterja »


Wolfgang van der Stille @ EMSR DATA d.o.o. - Data Gateway Professional

Zuletzt geändert: dne 29.01.2026 ob 23:37