====== Runbook: Alerting ======
**Duration:** ~15 minutes \\
**Role:** DevOps, SRE \\
**Prerequisite:** Prometheus, Alertmanager
Automatic notifications for Gateway problems.
----
===== Workflow =====
flowchart TD
A[Start] --> B[Define alert rules]
B --> C[Configure Alertmanager]
C --> D[Set up receivers]
D --> E[Trigger test alert]
E --> F{Notified?}
F -->|Yes| G[Done]
F -->|No| H[Check config]
style G fill:#e8f5e9
style H fill:#ffebee
----
===== 1. Alert Rules (Prometheus) =====
**/etc/prometheus/rules/gateway-alerts.yml:**
groups:
- name: data-gateway
interval: 30s
rules:
# Gateway not reachable
- alert: GatewayDown
expr: up{job="data-gateway"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Data Gateway is not reachable"
description: "{{ $labels.instance }} has been unreachable for 1 minute."
# High error rate
- alert: GatewayHighErrorRate
expr: |
sum(rate(http_requests_total{job="data-gateway",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="data-gateway"}[5m]))
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate in Gateway"
description: "Error rate is {{ $value | humanizePercentage }} (> 5%)."
# Slow response times
- alert: GatewaySlowResponses
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{job="data-gateway"}[5m])) by (le)
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Gateway responding slowly"
description: "P95 response time is {{ $value | humanizeDuration }}."
# High memory usage
- alert: GatewayHighMemory
expr: |
process_resident_memory_bytes{job="data-gateway"}
/ 1024 / 1024 > 450
for: 10m
labels:
severity: warning
annotations:
summary: "Gateway using high memory"
description: "Memory usage is {{ $value | humanize }}MB."
# Certificate expiring soon
- alert: GatewayCertExpiringSoon
expr: |
(probe_ssl_earliest_cert_expiry{job="gateway-tls"} - time())
/ 86400 < 14
for: 1h
labels:
severity: warning
annotations:
summary: "Gateway TLS certificate expiring soon"
description: "Certificate expires in {{ $value | humanize }} days."
----
===== 2. Update Prometheus Config =====
**/etc/prometheus/prometheus.yml:**
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Reload Prometheus
curl -X POST http://localhost:9090/-/reload
----
===== 3. Alertmanager Configuration =====
**/etc/alertmanager/alertmanager.yml:**
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'secret'
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
# Critical alerts immediately
- match:
severity: critical
receiver: 'critical'
group_wait: 10s
repeat_interval: 1h
receivers:
- name: 'default'
email_configs:
- to: 'ops@example.com'
send_resolved: true
- name: 'critical'
email_configs:
- to: 'oncall@example.com'
send_resolved: true
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
channel: '#alerts-critical'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
----
===== 4. Slack Integration =====
# Slack section only
receivers:
- name: 'slack-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
channel: '#gateway-alerts'
send_resolved: true
title: '{{ if eq .Status "firing" }}ALERT{{ else }}RESOLVED{{ end }} {{ .CommonAnnotations.summary }}'
text: |
*Alert:* {{ .CommonLabels.alertname }}
*Severity:* {{ .CommonLabels.severity }}
*Description:* {{ .CommonAnnotations.description }}
*Instance:* {{ .CommonLabels.instance }}
----
===== 5. Microsoft Teams =====
# Via Prometheus MS Teams Webhook
receivers:
- name: 'teams-alerts'
webhook_configs:
- url: 'http://prometheus-msteams:2000/gateway'
send_resolved: true
----
===== 6. Test Alert =====
# Check alert rules
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {alert: .name, state: .state}'
# Active alerts
curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[]'
# Alertmanager status
curl http://localhost:9093/api/v2/status | jq
# Send test alert
curl -X POST http://localhost:9093/api/v2/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {"alertname": "TestAlert", "severity": "warning"},
"annotations": {"summary": "Test Alert", "description": "This is a test."}
}]'
----
===== 7. Checklist =====
| # | Check | Done |
|---|-------|------|
| 1 | Alert rules created | [ ] |
| 2 | Prometheus config updated | [ ] |
| 3 | Alertmanager configured | [ ] |
| 4 | Receiver tested (E-Mail/Slack) | [ ] |
| 5 | Test alert received | [ ] |
----
===== Troubleshooting =====
| Problem | Cause | Solution |
|---------|-------|----------|
| ''No alerts'' | Rule syntax wrong | ''promtool check rules rules.yml'' |
| ''Alert not firing'' | Condition not met | Test query manually |
| ''No notification'' | Receiver wrong | Check Alertmanager logs |
| ''Duplicate alerts'' | Wrong grouping | Adjust ''group_by'' |
----
===== Recommended Thresholds =====
| Alert | Threshold | Duration |
|-------|-----------|----------|
| GatewayDown | up == 0 | 1m |
| HighErrorRate | > 5% | 5m |
| SlowResponses | p95 > 2s | 5m |
| HighMemory | > 450MB | 10m |
| CertExpiring | < 14 days | 1h |
----
===== Related Runbooks =====
* [[.:prometheus|Prometheus]] - Collect metrics
* [[.:grafana-dashboard|Grafana]] - Visualization
* [[..:sicherheit:tls-einrichten|Set Up TLS]] - Certificate alerting
----
<< [[.:grafana-dashboard|<- Grafana Dashboard]] | [[..:start|-> Operator Overview]] >>
----
//Wolfgang van der Stille @ EMSR DATA d.o.o. - Data Gateway Professional//
{{tag>operator runbook alerting prometheus alertmanager}}