The entire monitoring stack (Prometheus, Grafana, Alertmanager) has gone down due to a persistent volume running out of IOPS on the Kubernetes node. No metrics are being collected, no dashboards are accessible, and no alerts are firing. The infrastructure team is flying blind.
| Metric | Expected | Actual | Result |
|---|---|---|---|
| Pattern Recognition | STORAGE_IO_LATENCY | STORAGE_IO_LATENCY | |
| Severity Assessment | CRITICAL | CRITICAL | |
| Incident Correlation | Yes | 30 linked | |
| Cascade Escalation | Yes | Yes | |
| Remediation | — | Remote Hands — Corax contacts on-site support via call, email, or API |
Prometheus, Grafana, and Alertmanager running on K8s. Persistent volume IOPS throttled. TSDB compaction failing. Grafana dashboards inaccessible. Alertmanager not processing alerts. Complete monitoring blackout for all services.