PASSEDvendor / dell_powerstore_controller_failover

Dell PowerStore Controller Failover

A Dell PowerStore 5200 experiences an unexpected controller A failure, triggering automatic failover to controller B. During failover, all I/O is paused for 15 seconds, causing application-level errors.

Pattern

UNKNOWN

Severity

CRITICAL

Confidence

90%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	UNKNOWN	UNKNOWN
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	20 linked
Cascade Escalation	N/A	No
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

Dell PowerStore 5200 dual-controller. Controller A kernel panic. Automatic failover to controller B. 15-second I/O pause. 200 LUNs served. 50 ESXi hosts connected. NVMe-oF and iSCSI paths.

Injected Error Messages (2)

dell PowerStore 5200 controller A failover — controller node-A experienced kernel panic and rebooted, dell PowerStore automatic failover to controller node-B completed in 15 seconds, 200 LUNs now served exclusively from node-B, dell PowerStore cluster status: reduced-redundancy, all I/O paused during failover window causing application-level errors across 50 connected hosts

dell PowerStore 5200 storage I/O disruption during controller failover — NVMe-oF and iSCSI paths to controller node-A marked dead, dell PowerStore ALUA path failover to node-B, 15-second I/O blackout affected 200 volumes, ESXi hosts reporting PDL (Permanent Device Loss) conditions during transition, dell PowerStore node-A attempting recovery restart

Neural Engine Root Cause Analysis

Dell PowerStore 5200 controller node-A experienced a kernel panic causing an immediate system reboot and automatic failover to controller node-B. This is a hardware-level failure likely caused by memory corruption, hardware malfunction, or critical system software bug. The 15-second failover window caused I/O disruption across all 200 LUNs and 50 connected hosts, resulting in application-level errors. The system is now operating in reduced-redundancy mode with single point of failure risk.

Remediation Plan

1. Immediately verify controller node-B health and performance to ensure stable operation. 2. Review controller node-A system logs and crash dumps to identify kernel panic root cause. 3. Check hardware diagnostics including memory, CPU, and storage controller components on node-A. 4. Coordinate with Dell support for hardware replacement or firmware updates if hardware failure detected. 5. Once node-A is restored and healthy, re-establish dual-controller redundancy. 6. Monitor system performance and validate all LUNs are accessible from both controllers. 7. Review and potentially tune failover procedures to minimize future I/O disruption windows.

Tested: 2026-03-30Monitors: 2 | Incidents: 2Test ID: cmnck62q0070dobqe2fhd81dn