====== Runbook: Prometheus ====== **Dauer:** ~15 Minuten \\ **Rolle:** DevOps, SRE \\ **Voraussetzung:** Prometheus Server, Gateway läuft Metriken vom Data Gateway mit Prometheus sammeln. ---- ===== Workflow ===== flowchart TD A[Start] --> B[Metrics aktivieren] B --> C[Prometheus Config] C --> D[Scrape-Job hinzufügen] D --> E[Prometheus reload] E --> F[Targets prüfen] F --> G{Up?} G -->|Ja| H[Fertig] G -->|Nein| I[Firewall/Endpoint prüfen] style H fill:#e8f5e9 style I fill:#ffebee ---- ===== 1. Metrics im Gateway aktivieren ===== **appsettings.json:** { "Metrics": { "Enabled": true, "Endpoint": "/metrics" } } **Oder via NuGet (wenn nicht eingebaut):** # prometheus-net.AspNetCore dotnet add package prometheus-net.AspNetCore **Program.cs:** // Metrics Middleware app.UseHttpMetrics(); app.MapMetrics(); // /metrics Endpoint ---- ===== 2. Metrics-Endpoint testen ===== curl http://localhost:5000/metrics # Erwartete Ausgabe (Prometheus-Format): # HELP http_requests_total Total HTTP requests # TYPE http_requests_total counter # http_requests_total{method="GET",endpoint="/api/v1/dsn/demo/tables",status="200"} 42 ---- ===== 3. Prometheus-Konfiguration ===== **/etc/prometheus/prometheus.yml:** global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: # Data Gateway - job_name: 'data-gateway' static_configs: - targets: ['gateway.example.com:5000'] metrics_path: /metrics scheme: http # oder https # Mehrere Instanzen - job_name: 'data-gateway-cluster' static_configs: - targets: - 'gateway-1.example.com:5000' - 'gateway-2.example.com:5000' - 'gateway-3.example.com:5000' ---- ===== 4. Prometheus neu laden ===== # Config-Reload (ohne Neustart) curl -X POST http://localhost:9090/-/reload # Oder Neustart sudo systemctl restart prometheus ---- ===== 5. Targets prüfen ===== **Web UI:** ''http://prometheus:9090/targets'' Oder via API: curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}' **Erwartete Ausgabe:** { "job": "data-gateway", "health": "up" } ---- ===== 6. Wichtige Queries ===== **PromQL-Beispiele:** # Request-Rate (pro Sekunde) rate(http_requests_total{job="data-gateway"}[5m]) # Durchschnittliche Response-Zeit rate(http_request_duration_seconds_sum{job="data-gateway"}[5m]) / rate(http_request_duration_seconds_count{job="data-gateway"}[5m]) # Error-Rate (5xx) sum(rate(http_requests_total{job="data-gateway",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="data-gateway"}[5m])) # Memory-Nutzung process_resident_memory_bytes{job="data-gateway"} # Aktive Connections http_requests_in_progress{job="data-gateway"} ---- ===== 7. Checkliste ===== | # | Prüfpunkt | ✓ | |---|-----------|---| | 1 | Metrics-Endpoint aktiviert | ☐ | | 2 | /metrics erreichbar | ☐ | | 3 | Prometheus-Config aktualisiert | ☐ | | 4 | Prometheus reloaded | ☐ | | 5 | Target "up" in Prometheus | ☐ | | 6 | Metriken in Grafana sichtbar | ☐ | ---- ===== Troubleshooting ===== | Problem | Ursache | Lösung | |---------|---------|--------| | Target "down" | Endpoint nicht erreichbar | Firewall, URL prüfen | | ''connection refused'' | Gateway läuft nicht | Gateway starten | | ''404 Not Found'' | Metrics nicht aktiviert | appsettings.json prüfen | | Keine Metriken | Falscher Pfad | ''metrics_path'' prüfen | ---- ===== Kubernetes ServiceMonitor ===== Für Prometheus Operator: apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: data-gateway namespace: monitoring labels: release: prometheus spec: selector: matchLabels: app: data-gateway namespaceSelector: matchNames: - data-gateway endpoints: - port: http path: /metrics interval: 15s ---- ===== Verwandte Runbooks ===== * [[.:grafana-dashboard|Grafana Dashboard]] – Visualisierung * [[.:alerting|Alerting]] – Benachrichtigungen * [[..:automatisierung:kubernetes|Kubernetes]] – K8s Deployment ---- << [[.:start|← Monitoring]] | [[.:grafana-dashboard|→ Grafana Dashboard]] >> ---- //Wolfgang van der Stille @ EMSR DATA d.o.o. - Data Gateway Professional// {{tag>operator runbook prometheus metrics monitoring}}