Multiple Ceph OSDs fail simultaneously on a storage node after a power supply unit failure, triggering a massive data rebalancing operation. The cluster enters HEALTH_WARN state and client I/O is severely impacted during recovery.
Pattern
PERFORMANCE_DEGRADATION
Severity
CRITICAL
Confidence
95%
Remediation
Remote Hands
Test Results
Metric
Expected
Actual
Result
Pattern Recognition
PERFORMANCE_DEGRADATION
PERFORMANCE_DEGRADATION
Severity Assessment
CRITICAL
CRITICAL
Incident Correlation
Yes
13 linked
Cascade Escalation
N/A
No
Remediation
—
Remote Hands — Corax contacts on-site support via call, email, or API
Ceph cluster HEALTH_WARN — 12 OSDs down on node ceph-04 after power supply failure, ceph health detail: 12 osds down, 847 pgs degraded, 312 pgs undersized, recovery operations consuming 80% cluster bandwidth, estimated recovery time: 6 hours, client I/O severely impacted
Ceph dashboard showing cluster degraded state — node ceph-04 offline (12 OSDs), data rebalancing in progress, 847 placement groups in degraded state, cluster capacity reduced by 20%, objects at risk until recovery completes, no data loss due to replication factor 3
Neural Engine Root Cause Analysis
The root cause is a power supply failure on node ceph-04 that brought down 12 OSDs, causing the Ceph cluster to enter HEALTH_WARN state. This has triggered a cascade effect where 847 placement groups are degraded and 312 are undersized, severely impacting client I/O performance. The cluster is attempting automatic recovery but consuming 80% of available bandwidth, indicating the storage infrastructure is operating in a degraded state with reduced redundancy and performance.
Remediation Plan
1. Immediately assess hardware status of ceph-04 node and replace/repair failed power supply. 2. Once node is powered on, verify all 12 OSDs on ceph-04 come back online cleanly. 3. Monitor cluster recovery progress and adjust recovery parameters if needed to balance recovery speed vs client I/O impact. 4. Verify cluster returns to HEALTH_OK state and all placement groups are active+clean. 5. Implement power redundancy monitoring to prevent future single-point-of-failure scenarios.