PASSEDinfrastructure / pure_storage_controller_failover

Pure Storage Controller Failover

The active controller on a Pure Storage FlashArray fails, triggering an automatic failover to the standby controller. During the failover, I/O is briefly paused and disk queue depth spikes, causing latency-sensitive applications to experience errors.

Pattern

STORAGE_IO_LATENCY

Severity

CRITICAL

Confidence

90%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	STORAGE_IO_LATENCY	STORAGE_IO_LATENCY
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	18 linked
Cascade Escalation	N/A	No
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

Pure Storage FlashArray//X50. Active controller CT0 kernel panic. Automatic failover to CT1. I/O paused for 12 seconds during failover. 30 hosts connected via FC. Disk queue depth spikes to 256 per LUN during pause.

Injected Error Messages (2)

Pure Storage FlashArray controller failover — CT0 experienced kernel panic, automatic failover to CT1 in progress, disk queue depth spiked to 256 per LUN during 12-second I/O pause, 30 connected hosts experiencing elevated latency, disk queue backlog clearing on standby controller

Database storage latency spike during Pure Storage failover — disk queue depth exceeded threshold, I/O operations queued for 12 seconds, transaction log writes delayed, disk queue causing query execution delays on latency-sensitive workloads

Neural Engine Root Cause Analysis

The Pure Storage FlashArray CT0 controller experienced a kernel panic, triggering an automatic failover to CT1. This hardware-level failure caused a 12-second I/O pause with disk queue depths spiking to 256 per LUN, affecting 30 connected hosts with elevated storage latency. The 10 correlated incidents suggest a cascading effect where dependent services and applications experienced timeouts or performance degradation during the storage I/O interruption.

Remediation Plan

1. Verify CT1 controller is fully operational and handling all I/O requests. 2. Monitor disk queue depths and confirm they are returning to normal levels (typically <32 per LUN). 3. Check all 30 connected hosts for application recovery and clear any stuck I/O operations. 4. Investigate CT0 controller kernel panic logs to identify hardware or firmware issues. 5. Contact Pure Storage support for CT0 controller replacement or repair. 6. Plan maintenance window to fail back to CT0 once repaired, or continue running on CT1 if performance is acceptable. 7. Review and potentially adjust multipathing configurations on connected hosts to improve future failover performance.

Tested: 2026-03-30Monitors: 2 | Incidents: 2Test ID: cmncjzu2705xsobqercqfxh50