PASSEDinfrastructure / ceph_osd_failure

Ceph OSD Failure — Data Rebalancing

Multiple Ceph OSDs fail simultaneously on a storage node after a power supply unit failure, triggering a massive data rebalancing operation. The cluster enters HEALTH_WARN state and client I/O is severely impacted during recovery.

Pattern

PERFORMANCE_DEGRADATION

Severity

CRITICAL

Confidence

95%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	PERFORMANCE_DEGRADATION	PERFORMANCE_DEGRADATION
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	13 linked
Cascade Escalation	N/A	No
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

Ceph Quincy cluster with 5 nodes, 60 OSDs total. Node ceph-04 PSU failure takes out 12 OSDs simultaneously. Cluster enters HEALTH_WARN. Recovery/backfill operations consuming 80% of cluster bandwidth. Client I/O latency 10x normal.

Injected Error Messages (2)

Ceph cluster HEALTH_WARN — 12 OSDs down on node ceph-04 after power supply failure, ceph health detail: 12 osds down, 847 pgs degraded, 312 pgs undersized, recovery operations consuming 80% cluster bandwidth, estimated recovery time: 6 hours, client I/O severely impacted

Ceph dashboard showing cluster degraded state — node ceph-04 offline (12 OSDs), data rebalancing in progress, 847 placement groups in degraded state, cluster capacity reduced by 20%, objects at risk until recovery completes, no data loss due to replication factor 3

Neural Engine Root Cause Analysis

The root cause is a power supply failure on node ceph-04 that brought down 12 OSDs, causing the Ceph cluster to enter HEALTH_WARN state. This has triggered a cascade effect where 847 placement groups are degraded and 312 are undersized, severely impacting client I/O performance. The cluster is attempting automatic recovery but consuming 80% of available bandwidth, indicating the storage infrastructure is operating in a degraded state with reduced redundancy and performance.

Remediation Plan

1. Immediately assess hardware status of ceph-04 node and replace/repair failed power supply. 2. Once node is powered on, verify all 12 OSDs on ceph-04 come back online cleanly. 3. Monitor cluster recovery progress and adjust recovery parameters if needed to balance recovery speed vs client I/O impact. 4. Verify cluster returns to HEALTH_OK state and all placement groups are active+clean. 5. Implement power redundancy monitoring to prevent future single-point-of-failure scenarios.

Tested: 2026-03-30Monitors: 2 | Incidents: 2Test ID: cmnck07n605zxobqetwtagshc