====== Runbook: Grafana Dashboard ====== **Duration:** ~20 minutes \\ **Role:** DevOps, SRE \\ **Prerequisite:** Grafana, Prometheus as datasource Visualization of Gateway metrics in Grafana. ---- ===== Workflow ===== flowchart TD A[Start] --> B[Add datasource] B --> C[Import dashboard] C --> D[Customize panels] D --> E[Configure variables] E --> F[Save dashboard] F --> G[Done] style G fill:#e8f5e9 ---- ===== 1. Prometheus Datasource ===== **Grafana UI:** Configuration -> Data Sources -> Add data source Name: Prometheus Type: Prometheus URL: http://prometheus:9090 Access: Server (default) Or via provisioning: # /etc/grafana/provisioning/datasources/prometheus.yaml apiVersion: 1 datasources: - name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true ---- ===== 2. Dashboard JSON ===== **Import dashboard:** Create -> Import -> Paste JSON { "title": "Data Gateway", "uid": "data-gateway", "timezone": "browser", "panels": [ { "title": "Request Rate", "type": "timeseries", "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}, "targets": [{ "expr": "sum(rate(http_requests_total{job=\"data-gateway\"}[5m])) by (endpoint)", "legendFormat": "{{endpoint}}" }] }, { "title": "Response Time (p95)", "type": "timeseries", "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}, "targets": [{ "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job=\"data-gateway\"}[5m])) by (le))", "legendFormat": "p95" }] }, { "title": "Error Rate", "type": "stat", "gridPos": {"h": 4, "w": 6, "x": 0, "y": 8}, "targets": [{ "expr": "sum(rate(http_requests_total{job=\"data-gateway\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"data-gateway\"}[5m])) * 100", "legendFormat": "Error %" }], "fieldConfig": { "defaults": { "unit": "percent", "thresholds": { "steps": [ {"color": "green", "value": null}, {"color": "yellow", "value": 1}, {"color": "red", "value": 5} ] } } } }, { "title": "Memory Usage", "type": "gauge", "gridPos": {"h": 4, "w": 6, "x": 6, "y": 8}, "targets": [{ "expr": "process_resident_memory_bytes{job=\"data-gateway\"} / 1024 / 1024", "legendFormat": "Memory MB" }], "fieldConfig": { "defaults": { "unit": "decmbytes", "max": 512, "thresholds": { "steps": [ {"color": "green", "value": null}, {"color": "yellow", "value": 300}, {"color": "red", "value": 450} ] } } } }, { "title": "Active Requests", "type": "stat", "gridPos": {"h": 4, "w": 6, "x": 12, "y": 8}, "targets": [{ "expr": "http_requests_in_progress{job=\"data-gateway\"}", "legendFormat": "Active" }] }, { "title": "Uptime", "type": "stat", "gridPos": {"h": 4, "w": 6, "x": 18, "y": 8}, "targets": [{ "expr": "time() - process_start_time_seconds{job=\"data-gateway\"}", "legendFormat": "Uptime" }], "fieldConfig": { "defaults": {"unit": "s"} } } ], "templating": { "list": [{ "name": "instance", "type": "query", "query": "label_values(http_requests_total{job=\"data-gateway\"}, instance)", "multi": true, "includeAll": true }] }, "refresh": "10s" } ---- ===== 3. Important Panels ===== | Panel | Query | Purpose | |-------|-------|---------| | Request Rate | ''sum(rate(http_requests_total[5m]))'' | Throughput | | Response Time | ''histogram_quantile(0.95, ...)'' | Latency | | Error Rate | ''...status=~"5.."... * 100'' | Error quota | | Memory | ''process_resident_memory_bytes'' | RAM usage | | CPU | ''rate(process_cpu_seconds_total[5m])'' | CPU load | | Active Requests | ''http_requests_in_progress'' | Concurrency | ---- ===== 4. Dashboard Variables ===== For multi-instance setups: Name: instance Type: Query Query: label_values(http_requests_total{job="data-gateway"}, instance) Multi-value: enabled Include All: enabled Then in queries: ''http_requests_total{instance=~"$instance"}'' ---- ===== 5. Checklist ===== | # | Check | Done | |---|-------|------| | 1 | Prometheus datasource configured | [ ] | | 2 | Dashboard imported | [ ] | | 3 | Metrics are displayed | [ ] | | 4 | Variables work | [ ] | | 5 | Dashboard saved | [ ] | ---- ===== Troubleshooting ===== | Problem | Cause | Solution | |---------|-------|----------| | ''No data'' | Wrong job name | Check ''job="data-gateway"'' | | ''Datasource error'' | Prometheus not reachable | Check URL | | Empty graphs | No traffic | Use Gateway | | Wrong values | Wrong query | Check PromQL syntax | ---- ===== Dashboard Export ===== # Export dashboard as JSON curl -s -H "Authorization: Bearer $GRAFANA_TOKEN" \ "http://grafana:3000/api/dashboards/uid/data-gateway" | jq '.dashboard' > dashboard.json # Import dashboard curl -X POST -H "Content-Type: application/json" \ -H "Authorization: Bearer $GRAFANA_TOKEN" \ -d @dashboard.json \ "http://grafana:3000/api/dashboards/db" ---- ===== Related Runbooks ===== * [[.:prometheus|Prometheus]] - Data source * [[.:alerting|Alerting]] - Notifications * [[..:tagesgeschaeft:health-check|Health Check]] - Manual check ---- << [[.:prometheus|<- Prometheus]] | [[.:alerting|-> Alerting]] >> ---- //Wolfgang van der Stille @ EMSR DATA d.o.o. - Data Gateway Professional// {{tag>operator runbook grafana dashboard visualization}}