The active controller on a Pure Storage FlashArray fails, triggering an automatic failover to the standby controller. During the failover, I/O is briefly paused and disk queue depth spikes, causing latency-sensitive applications to experience errors.
Pattern
STORAGE_IO_LATENCY
Severity
CRITICAL
Confidence
90%
Remediation
Remote Hands
Test Results
Metric
Expected
Actual
Result
Pattern Recognition
STORAGE_IO_LATENCY
STORAGE_IO_LATENCY
Severity Assessment
CRITICAL
CRITICAL
Incident Correlation
Yes
18 linked
Cascade Escalation
N/A
No
Remediation
—
Remote Hands — Corax contacts on-site support via call, email, or API
Scenario Conditions
Pure Storage FlashArray//X50. Active controller CT0 kernel panic. Automatic failover to CT1. I/O paused for 12 seconds during failover. 30 hosts connected via FC. Disk queue depth spikes to 256 per LUN during pause.
Injected Error Messages (2)
Pure Storage FlashArray controller failover — CT0 experienced kernel panic, automatic failover to CT1 in progress, disk queue depth spiked to 256 per LUN during 12-second I/O pause, 30 connected hosts experiencing elevated latency, disk queue backlog clearing on standby controller
Database storage latency spike during Pure Storage failover — disk queue depth exceeded threshold, I/O operations queued for 12 seconds, transaction log writes delayed, disk queue causing query execution delays on latency-sensitive workloads
Neural Engine Root Cause Analysis
The Pure Storage FlashArray CT0 controller experienced a kernel panic, triggering an automatic failover to CT1. This hardware-level failure caused a 12-second I/O pause with disk queue depths spiking to 256 per LUN, affecting 30 connected hosts with elevated storage latency. The 10 correlated incidents suggest a cascading effect where dependent services and applications experienced timeouts or performance degradation during the storage I/O interruption.
Remediation Plan
1. Verify CT1 controller is fully operational and handling all I/O requests. 2. Monitor disk queue depths and confirm they are returning to normal levels (typically <32 per LUN). 3. Check all 30 connected hosts for application recovery and clear any stuck I/O operations. 4. Investigate CT0 controller kernel panic logs to identify hardware or firmware issues. 5. Contact Pure Storage support for CT0 controller replacement or repair. 6. Plan maintenance window to fail back to CT0 once repaired, or continue running on CT1 if performance is acceptable. 7. Review and potentially adjust multipathing configurations on connected hosts to improve future failover performance.