Runbook: Alerting

Trajanje: ~15 minuta
Uloga: DevOps, SRE
Preduvjet: Prometheus, Alertmanager

Automatske obavijesti kod Gateway problema.


Tijek rada

flowchart TD A[Start] --> B[Alert-pravila definirati] B --> C[Alertmanager konfigurirati] C --> D[Receiver postaviti] D --> E[Test-Alert pokrenuti] E --> F{Obavijesten?} F -->|Da| G[Gotovo] F -->|Ne| H[Config provjeriti] style G fill:#e8f5e9 style H fill:#ffebee


1. Alert pravila (Prometheus)

/etc/prometheus/rules/gateway-alerts.yml:

groups:
  - name: data-gateway
    interval: 30s
    rules:
      # Gateway nije dostupan
      - alert: GatewayDown
        expr: up{job="data-gateway"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Data Gateway nije dostupan"
          description: "{{ $labels.instance }} nije dostupan vec 1 minutu."
 
      # Visoka Error-Rate
      - alert: GatewayHighErrorRate
        expr: |
          sum(rate(http_requests_total{job="data-gateway",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{job="data-gateway"}[5m]))
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Visoka stopa gresaka u Gatewayu"
          description: "Error-Rate je {{ $value | humanizePercentage }} (> 5%)."
 
      # Spori odgovori
      - alert: GatewaySlowResponses
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket{job="data-gateway"}[5m])) by (le)
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Gateway odgovara sporo"
          description: "P95 vrijeme odgovora je {{ $value | humanizeDuration }}."
 
      # Visoka Memory potrosnja
      - alert: GatewayHighMemory
        expr: |
          process_resident_memory_bytes{job="data-gateway"}
          / 1024 / 1024 > 450
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Gateway trosi puno memorije"
          description: "Memory potrosnja je {{ $value | humanize }}MB."
 
      # Certifikat uskoro istjece
      - alert: GatewayCertExpiringSoon
        expr: |
          (probe_ssl_earliest_cert_expiry{job="gateway-tls"} - time())
          / 86400 < 14
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Gateway TLS certifikat uskoro istjece"
          description: "Certifikat istjece za {{ $value | humanize }} dana."

2. Prometheus-Config azurirati

/etc/prometheus/prometheus.yml:

rule_files:
  - "rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093
# Prometheus ponovno ucitati
curl -X POST http://localhost:9090/-/reload

3. Alertmanager konfiguracija

/etc/alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager@example.com'
  smtp_auth_password: 'secret'

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    # Kriticni Alerti odmah
    - match:
        severity: critical
      receiver: 'critical'
      group_wait: 10s
      repeat_interval: 1h

receivers:
  - name: 'default'
    email_configs:
      - to: 'ops@example.com'
        send_resolved: true

  - name: 'critical'
    email_configs:
      - to: 'oncall@example.com'
        send_resolved: true
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
        channel: '#alerts-critical'
        send_resolved: true
        title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

4. Slack integracija

# Samo Slack dio
receivers:
  - name: 'slack-alerts'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
        channel: '#gateway-alerts'
        send_resolved: true
        title: '{{ if eq .Status "firing" }}UPOZORENJE{{ else }}RIJESENO{{ end }} {{ .CommonAnnotations.summary }}'
        text: |
          *Alert:* {{ .CommonLabels.alertname }}
          *Severity:* {{ .CommonLabels.severity }}
          *Description:* {{ .CommonAnnotations.description }}
          *Instance:* {{ .CommonLabels.instance }}

5. Microsoft Teams

# Preko Prometheus MS Teams Webhook
receivers:
  - name: 'teams-alerts'
    webhook_configs:
      - url: 'http://prometheus-msteams:2000/gateway'
        send_resolved: true

6. Test-Alert

# Alert pravila provjeriti
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {alert: .name, state: .state}'
 
# Aktivni Alerti
curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[]'
 
# Alertmanager Status
curl http://localhost:9093/api/v2/status | jq
 
# Test-Alert poslati
curl -X POST http://localhost:9093/api/v2/alerts \
  -H "Content-Type: application/json" \
  -d '[{
    "labels": {"alertname": "TestAlert", "severity": "warning"},
    "annotations": {"summary": "Test-Alert", "description": "Ovo je test."}
  }]'

7. Kontrolna lista

# Provjera Da/Ne
———–
1 Alert pravila kreirana -
2 Prometheus config azuriran -
3 Alertmanager konfiguriran -
4 Receiver testiran (E-Mail/Slack) -
5 Test-Alert primljen -

Rjesavanje problema

Problem Uzrok Rjesenje
————————–
No alerts Sintaksa pravila pogresna promtool check rules rules.yml
Alert not firing Uvjet nije ispunjen Query rucno testirati
No notification Receiver pogresan Alertmanager logove provjeriti
Duplicate alerts Pogresno grupiranje group_by prilagoditi

Preporuceni pragovi

Alert Prag Trajanje
——-————-——-
GatewayDown up == 0 1m
HighErrorRate > 5% 5m
SlowResponses p95 > 2s 5m
HighMemory > 450MB 10m
CertExpiring < 14 dana 1h

Povezani runbookovi


« <- Grafana Dashboard | -> Operator pregled »


Wolfgang van der Stille @ EMSR DATA d.o.o. - Data Gateway Professional

Zuletzt geändert: 29.01.2026. u 23:40