PASSEDinfrastructure / prometheus_grafana_stack_down

Prometheus/Grafana Stack Down — Monitoring Blind Spot

The entire monitoring stack (Prometheus, Grafana, Alertmanager) has gone down due to a persistent volume running out of IOPS on the Kubernetes node. No metrics are being collected, no dashboards are accessible, and no alerts are firing. The infrastructure team is flying blind.

Pattern

STORAGE_IO_LATENCY

Severity

CRITICAL

Confidence

90%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	STORAGE_IO_LATENCY	STORAGE_IO_LATENCY
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	30 linked
Cascade Escalation	Yes	Yes
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

Prometheus, Grafana, and Alertmanager running on K8s. Persistent volume IOPS throttled. TSDB compaction failing. Grafana dashboards inaccessible. Alertmanager not processing alerts. Complete monitoring blackout for all services.

Injected Error Messages (3)

Prometheus TSDB storage failure — persistent volume 'prometheus-data' IOPS throttled to zero on underlying EBS volume, TSDB head compaction failing with 'write: input/output error', metrics ingestion halted, scrape targets all showing 'last scrape: 45 minutes ago', storage latency exceeding 30 seconds per write operation, 847 active scrape targets producing zero data

Grafana dashboard unavailable — all dashboards returning 'datasource error: Prometheus unreachable', 150+ monitoring dashboards non-functional, NOC team has zero visibility into infrastructure health, Grafana itself running but all Prometheus queries failing, storage latency on shared persistent volume affecting Grafana database as well

Alertmanager not processing — zero alerts firing despite known issues in infrastructure, alert pipeline completely broken, no notifications being sent to PagerDuty or email, silence/inhibition rules irrelevant as no alerts are being generated, monitoring blind spot affecting entire infrastructure estate of 500+ services, storage latency preventing Alertmanager state persistence

Neural Engine Root Cause Analysis

The Prometheus server is experiencing a complete storage subsystem failure due to EBS volume IOPS throttling reaching zero, causing write operations to fail with I/O errors. This has cascaded into TSDB head compaction failures, halted metrics ingestion, and complete loss of monitoring capability across 847 scrape targets. The underlying EBS volume has likely exhausted its IOPS burst credits or hit account-level IOPS limits, creating a storage bottleneck that prevents Prometheus from persisting any new metrics data.

Remediation Plan

1. Immediately check EBS volume metrics (IOPS utilization, burst credits, queue depth) in AWS CloudWatch. 2. If burst credits are depleted, modify the EBS volume to use gp3 with provisioned IOPS or increase volume size to get more baseline IOPS. 3. Check for any AWS service limits or account-level IOPS throttling. 4. Once storage I/O is restored, restart the Prometheus pod/container to reinitialize TSDB and resume metrics collection. 5. Verify scrape targets are returning to normal collection intervals. 6. Consider implementing EBS volume monitoring and alerting to prevent future occurrences.

Tested: 2026-03-30Monitors: 3 | Incidents: 3Test ID: cmnckgaja09akobqezt79uqzx