PASSEDinfrastructure / elk_stack_disk_full

ELK Stack Disk Full — Log Ingestion Halted

The Elasticsearch cluster backing the ELK logging stack has run out of disk space. Elasticsearch has entered read-only mode, Logstash is backing up, and no new logs are being indexed. Security event logs, application logs, and audit trails are all being dropped.

Pattern

DISK_FULL

Severity

CRITICAL

Confidence

95%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	DISK_FULL	DISK_FULL
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	22 linked
Cascade Escalation	N/A	No
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

3-node Elasticsearch cluster. All nodes at 95%+ disk. Cluster in read-only mode. Logstash pipeline backing up. Filebeat agents on 200 servers buffering locally. 15GB/hour log volume being dropped.

Injected Error Messages (2)

Elasticsearch cluster RED — all 3 data nodes at 96% disk utilization, cluster entered read-only mode (index.blocks.read_only_allow_delete=true), no new documents being indexed, disk full on all data paths, 47 indices in RED state, disk space critical across all nodes: node-1: 1.8TB/1.9TB, node-2: 1.85TB/1.9TB, node-3: 1.87TB/1.9TB, no space left for new index creation, filesystem full preventing log ingestion

Logstash pipeline backed up — 2.3 million events in persistent queue (queue at 95% capacity), output to Elasticsearch failing with 'ClusterBlockException: index read-only', events being dropped at rate of 15GB/hour, Filebeat agents on 200 servers buffering locally (filling up on source servers), security event logs not being ingested, compliance audit trail gap growing, disk full on Elasticsearch preventing any log processing

Neural Engine Root Cause Analysis

The Elasticsearch cluster has reached critical disk capacity (96% utilization across all 3 data nodes) causing the cluster to automatically enter read-only mode as a protective measure. With only ~100GB free space remaining per node out of 1.9TB total, the cluster cannot perform normal operations like indexing new documents or creating indices. The 47 indices in RED state and 11 correlated incidents indicate this disk exhaustion is causing cascading failures across dependent services that rely on log ingestion and search capabilities.

Remediation Plan

1. Immediately delete old indices or snapshots to reclaim disk space (target <80% utilization). 2. Remove the read-only block by setting 'index.blocks.read_only_allow_delete=false' via Elasticsearch API. 3. Monitor cluster status until it returns to GREEN state. 4. Implement data retention policies to prevent recurrence. 5. Consider adding storage capacity or additional nodes for long-term scaling. 6. Set up proactive disk usage alerting at 85% threshold.

Tested: 2026-03-30Monitors: 2 | Incidents: 2Test ID: cmnckgbie09alobqexc46s4e8