Runbook: Grafana Dashboard

Dauer: ~20 Minuten
Rolle: DevOps, SRE
Voraussetzung: Grafana, Prometheus als Datasource

Visualisierung der Gateway-Metriken in Grafana.


Workflow

flowchart TD A[Start] --> B[Datasource hinzufügen] B --> C[Dashboard importieren] C --> D[Panels anpassen] D --> E[Variablen konfigurieren] E --> F[Dashboard speichern] F --> G[Fertig] style G fill:#e8f5e9


1. Prometheus Datasource

Grafana UI: Configuration → Data Sources → Add data source

Name: Prometheus
Type: Prometheus
URL: http://prometheus:9090
Access: Server (default)

Oder via Provisioning:

# /etc/grafana/provisioning/datasources/prometheus.yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true

2. Dashboard JSON

Dashboard importieren: Create → Import → Paste JSON

{
  "title": "Data Gateway",
  "uid": "data-gateway",
  "timezone": "browser",
  "panels": [
    {
      "title": "Request Rate",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
      "targets": [{
        "expr": "sum(rate(http_requests_total{job=\"data-gateway\"}[5m])) by (endpoint)",
        "legendFormat": "{{endpoint}}"
      }]
    },
    {
      "title": "Response Time (p95)",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
      "targets": [{
        "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job=\"data-gateway\"}[5m])) by (le))",
        "legendFormat": "p95"
      }]
    },
    {
      "title": "Error Rate",
      "type": "stat",
      "gridPos": {"h": 4, "w": 6, "x": 0, "y": 8},
      "targets": [{
        "expr": "sum(rate(http_requests_total{job=\"data-gateway\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"data-gateway\"}[5m])) * 100",
        "legendFormat": "Error %"
      }],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "thresholds": {
            "steps": [
              {"color": "green", "value": null},
              {"color": "yellow", "value": 1},
              {"color": "red", "value": 5}
            ]
          }
        }
      }
    },
    {
      "title": "Memory Usage",
      "type": "gauge",
      "gridPos": {"h": 4, "w": 6, "x": 6, "y": 8},
      "targets": [{
        "expr": "process_resident_memory_bytes{job=\"data-gateway\"} / 1024 / 1024",
        "legendFormat": "Memory MB"
      }],
      "fieldConfig": {
        "defaults": {
          "unit": "decmbytes",
          "max": 512,
          "thresholds": {
            "steps": [
              {"color": "green", "value": null},
              {"color": "yellow", "value": 300},
              {"color": "red", "value": 450}
            ]
          }
        }
      }
    },
    {
      "title": "Active Requests",
      "type": "stat",
      "gridPos": {"h": 4, "w": 6, "x": 12, "y": 8},
      "targets": [{
        "expr": "http_requests_in_progress{job=\"data-gateway\"}",
        "legendFormat": "Active"
      }]
    },
    {
      "title": "Uptime",
      "type": "stat",
      "gridPos": {"h": 4, "w": 6, "x": 18, "y": 8},
      "targets": [{
        "expr": "time() - process_start_time_seconds{job=\"data-gateway\"}",
        "legendFormat": "Uptime"
      }],
      "fieldConfig": {
        "defaults": {"unit": "s"}
      }
    }
  ],
  "templating": {
    "list": [{
      "name": "instance",
      "type": "query",
      "query": "label_values(http_requests_total{job=\"data-gateway\"}, instance)",
      "multi": true,
      "includeAll": true
    }]
  },
  "refresh": "10s"
}

3. Wichtige Panels

Panel Query Zweck
——-——-——-
Request Rate sum(rate(http_requests_total[5m])) Durchsatz
Response Time histogram_quantile(0.95, …) Latenz
Error Rate …status=~„5..“… * 100 Fehlerquote
Memory process_resident_memory_bytes RAM-Nutzung
CPU rate(process_cpu_seconds_total[5m]) CPU-Last
Active Requests http_requests_in_progress Parallelität

4. Dashboard-Variablen

Für Multi-Instance-Setups:

Name: instance
Type: Query
Query: label_values(http_requests_total{job="data-gateway"}, instance)
Multi-value: enabled
Include All: enabled

Dann in Queries: http_requests_total{instance=~„$instance“}


5. Checkliste

# Prüfpunkt
———–
1 Prometheus Datasource konfiguriert
2 Dashboard importiert
3 Metriken werden angezeigt
4 Variablen funktionieren
5 Dashboard gespeichert

Troubleshooting

Problem Ursache Lösung
————————–
No data Falscher Job-Name job=„data-gateway“ prüfen
Datasource error Prometheus nicht erreichbar URL prüfen
Leere Graphen Kein Traffic Gateway benutzen
Falsche Werte Falsche Query PromQL syntax prüfen

Dashboard-Export

# Dashboard als JSON exportieren
curl -s -H "Authorization: Bearer $GRAFANA_TOKEN" \
    "http://grafana:3000/api/dashboards/uid/data-gateway" | jq '.dashboard' > dashboard.json
 
# Dashboard importieren
curl -X POST -H "Content-Type: application/json" \
    -H "Authorization: Bearer $GRAFANA_TOKEN" \
    -d @dashboard.json \
    "http://grafana:3000/api/dashboards/db"

Verwandte Runbooks


« ← Prometheus | → Alerting »


Wolfgang van der Stille @ EMSR DATA d.o.o. - Data Gateway Professional

Zuletzt geändert: den 29.01.2026 um 15:12