Inhaltsverzeichnis

Alerting Setup

Complexity: Low-Medium
Duration: 30-60 minutes setup
Goal: Proactive notification on PKI problems

Configuration of alerting for PKI monitoring with various notification channels.


Architecture

flowchart LR subgraph TRIGGER["TRIGGER"] T1[Prometheus Alert] T2[Grafana Alert] T3[Custom Script] end subgraph ROUTE["ROUTING"] R[Alertmanager] end subgraph NOTIFY["NOTIFICATION"] N1[E-Mail] N2[Slack] N3[MS Teams] N4[PagerDuty] N5[OpsGenie] end T1 --> R T2 --> R T3 --> R R --> N1 & N2 & N3 & N4 & N5 style R fill:#fff3e0 style N4 fill:#e8f5e9


Alert Categories

Category Examples Severity Response
———-———-———-———-
Critical Certificate expired, CA down P1 Immediate
Warning Certificate < 7 days, CRL < 24h P2 4h
Info Certificate < 30 days P3 Next business day

Prometheus Alertmanager

Installation

# Download Alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
tar xzf alertmanager-*.tar.gz
sudo mv alertmanager-*/alertmanager /usr/local/bin/
sudo mv alertmanager-*/amtool /usr/local/bin/

Configuration

# /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'secret'

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # Critical PKI alerts -> PagerDuty + E-Mail
    - match:
        severity: critical
        job: pki
      receiver: 'pki-critical'
      repeat_interval: 15m
 
    # Warnings -> E-Mail + Slack
    - match:
        severity: warning
        job: pki
      receiver: 'pki-warning'
      repeat_interval: 4h
 
    # Info -> Slack only
    - match:
        severity: info
        job: pki
      receiver: 'pki-info'
      repeat_interval: 24h

receivers:
  - name: 'default'
    email_configs:
      - to: 'ops@example.com'

  - name: 'pki-critical'
    email_configs:
      - to: 'pki-team@example.com'
        send_resolved: true
    pagerduty_configs:
      - service_key: '<PAGERDUTY_SERVICE_KEY>'
        severity: critical
    slack_configs:
      - api_url: '<SLACK_WEBHOOK_URL>'
        channel: '#pki-alerts'
        title: 'PKI CRITICAL: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

  - name: 'pki-warning'
    email_configs:
      - to: 'pki-team@example.com'
    slack_configs:
      - api_url: '<SLACK_WEBHOOK_URL>'
        channel: '#pki-alerts'
        title: 'PKI Warning: {{ .GroupLabels.alertname }}'

  - name: 'pki-info'
    slack_configs:
      - api_url: '<SLACK_WEBHOOK_URL>'
        channel: '#pki-info'
        title: 'PKI Info: {{ .GroupLabels.alertname }}'

inhibit_rules:
  # Suppress warnings when critical is active
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname']

Systemd Service

# /etc/systemd/system/alertmanager.service
[Unit]
Description=Prometheus Alertmanager
After=network.target
 
[Service]
Type=simple
ExecStart=/usr/local/bin/alertmanager \
    --config.file=/etc/alertmanager/alertmanager.yml \
    --storage.path=/var/lib/alertmanager
Restart=always
 
[Install]
WantedBy=multi-user.target

Microsoft Teams

# Alertmanager Teams Webhook
receivers:
  - name: 'pki-teams'
    webhook_configs:
      - url: 'https://outlook.office.com/webhook/...'
        send_resolved: true
        http_config:
          bearer_token: ''

Teams Message Card Template:

{
  "@type": "MessageCard",
  "@context": "http://schema.org/extensions",
  "themeColor": "{{ if eq .Status \"firing\" }}FF0000{{ else }}00FF00{{ end }}",
  "summary": "PKI Alert: {{ .GroupLabels.alertname }}",
  "sections": [{
    "activityTitle": "{{ .GroupLabels.alertname }}",
    "activitySubtitle": "{{ .Status | toUpper }}",
    "facts": [
      {{ range .Alerts }}
      {
        "name": "{{ .Labels.instance }}",
        "value": "{{ .Annotations.summary }}"
      },
      {{ end }}
    ],
    "markdown": true
  }],
  "potentialAction": [{
    "@type": "OpenUri",
    "name": "Open Runbook",
    "targets": [{
      "os": "default",
      "uri": "{{ (index .Alerts 0).Annotations.runbook_url }}"
    }]
  }]
}

Slack

# Alertmanager Slack Configuration
receivers:
  - name: 'pki-slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
        channel: '#pki-alerts'
        username: 'PKI-Alertmanager'
        icon_emoji: ':lock:'
        send_resolved: true
        title: '{{ template "slack.title" . }}'
        text: '{{ template "slack.text" . }}'
        actions:
          - type: button
            text: 'Runbook'
            url: '{{ (index .Alerts 0).Annotations.runbook_url }}'
          - type: button
            text: 'Dashboard'
            url: 'https://grafana.example.com/d/pki'

PagerDuty

# Alertmanager PagerDuty Integration
receivers:
  - name: 'pki-pagerduty'
    pagerduty_configs:
      - service_key: '<INTEGRATION_KEY>'
        severity: '{{ if eq .GroupLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
        description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ template "pagerduty.firing" . }}'
          num_firing: '{{ .Alerts.Firing | len }}'
          num_resolved: '{{ .Alerts.Resolved | len }}'

E-Mail Templates

# /etc/alertmanager/templates/email.tmpl
{{ define "email.subject" }}
[{{ .Status | toUpper }}] PKI Alert: {{ .GroupLabels.alertname }}
{{ end }}
 
{{ define "email.html" }}
<!DOCTYPE html>
<html>
<head>
  <style>
    .critical { background-color: #ffebee; border-left: 4px solid #f44336; }
    .warning { background-color: #fff3e0; border-left: 4px solid #ff9800; }
    .resolved { background-color: #e8f5e9; border-left: 4px solid #4caf50; }
  </style>
</head>
<body>
  <h2>PKI Alert: {{ .GroupLabels.alertname }}</h2>
  <p>Status: <strong>{{ .Status | toUpper }}</strong></p>

  {{ range .Alerts }}
  <div class="{{ .Labels.severity }}">
    <h3>{{ .Labels.instance }}</h3>
    <p><strong>Summary:</strong> {{ .Annotations.summary }}</p>
    <p><strong>Description:</strong> {{ .Annotations.description }}</p>
    {{ if .Annotations.runbook_url }}
    <p><a href="{{ .Annotations.runbook_url }}">Open Runbook</a></p>
    {{ end }}
  </div>
  {{ end }}
 
  <hr>
  <p>
    <a href="https://grafana.example.com/d/pki">Dashboard</a> |
    <a href="https://alertmanager.example.com">Alertmanager</a>
  </p>
</body>
</html>
{{ end }}

# /etc/prometheus/rules/pki-alerts.yml
groups:
  - name: pki-alerts
    rules:
      - alert: CertificateExpiringSoon
        expr: x509_cert_not_after - time() < 7 * 86400
        for: 1h
        labels:
          severity: warning
          team: pki
        annotations:
          summary: "Certificate {{ $labels.filepath }} expires in < 7 days"
          description: "Time remaining: {{ $value | humanizeDuration }}"
          runbook_url: "https://wiki.example.com/pki/runbook/renew-certificate"

      - alert: CertificateExpired
        expr: x509_cert_not_after - time() < 0
        labels:
          severity: critical
          team: pki
        annotations:
          summary: "CRITICAL: Certificate {{ $labels.filepath }} has EXPIRED"
          runbook_url: "https://wiki.example.com/pki/runbook/issue-certificate"

      - alert: CANotReachable
        expr: up{job="ca"} == 0
        for: 2m
        labels:
          severity: critical
          team: pki
        annotations:
          summary: "CA server not reachable"
          runbook_url: "https://wiki.example.com/pki/runbook/ca-troubleshooting"

Grafana Alerting (Alternative)

# Grafana Alert Rule (UI or Provisioning)
apiVersion: 1
groups:
  - orgId: 1
    name: PKI Alerts
    folder: PKI
    interval: 1m
    rules:
      - uid: cert-expiry-warning
        title: Certificate Expiring Soon
        condition: B
        data:
          - refId: A
            relativeTimeRange:
              from: 600
              to: 0
            datasourceUid: prometheus
            model:
              expr: x509_cert_not_after - time() < 7 * 86400
          - refId: B
            datasourceUid: '-100'
            model:
              conditions:
                - evaluator:
                    params: [0]
                    type: gt
                  operator:
                    type: and
                  query:
                    params: [A]
                  reducer:
                    type: count
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: Certificate expiring soon

Test & Validation

# Check Alertmanager configuration
amtool check-config /etc/alertmanager/alertmanager.yml
 
# Send test alert
amtool alert add alertname=TestAlert severity=warning instance=test \
    --alertmanager.url=http://localhost:9093
 
# Show active alerts
amtool alert --alertmanager.url=http://localhost:9093
 
# Create silence (e.g., for maintenance)
amtool silence add alertname=CertificateExpiringSoon \
    --alertmanager.url=http://localhost:9093 \
    --comment="Planned maintenance" \
    --duration=2h

Checklist

# Checkpoint Done
——————
1 Alertmanager installed
2 Routing configured
3 E-Mail receiver
4 Slack/Teams webhook
5 PagerDuty integration
6 Alert rules defined
7 Runbook links inserted
8 Test alert sent


« <- Audit Logging | -> Operator Scenarios »


Wolfgang van der Stille @ EMSR DATA d.o.o. - Post-Quantum Cryptography Professional