PASSEDnetwork / core_switch_psu_failure

Core Switch Power Supply Failure — Stack Degradation

The primary power supply in the core switch stack fails, causing the switch to reboot onto the secondary PSU. During the reboot, the switch stack ring breaks and a stack master re-election occurs, disrupting all traffic through the core for 90 seconds.

Pattern

SWITCH_STACK_EVENT

Severity

CRITICAL

Confidence

92%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	SWITCH_STACK_EVENT	SWITCH_STACK_EVENT
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	36 linked
Cascade Escalation	Yes	Yes
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

Cisco 9300 4-member switch stack. Primary PSU on stack member 1 (master) failed. Stack ring broken during reboot. Master re-election required. 48 trunk uplinks. 2000+ devices downstream.

Injected Error Messages (3)

Switch stack ring failure on core-sw-stack — member 1 PSU-A failed, member rebooting, switch stack master re-election triggered, stack ring broken between member 1 and member 2, all forwarding suspended during convergence

Distribution uplink to core down — switch stack reconverging, all trunk ports on former master (member 1) offline, STP reconvergence on 12 VLANs, 90-second traffic blackout

Data center ToR lost uplink to core — switch stack master election in progress, new master: member 3, all MAC address tables being rebuilt, ARP tables flushed

Neural Engine Root Cause Analysis

The core switch stack is experiencing a critical hardware failure where PSU-A (Power Supply Unit A) on member 1 has failed, causing that switch to reboot. This hardware failure has broken the stack ring topology between member 1 and member 2, triggering a master re-election process and suspending all network forwarding during convergence. The 14 correlated incidents indicate this is causing a network-wide outage affecting downstream systems.

Remediation Plan

1. Immediately contact network operations team for emergency hardware intervention. 2. If possible, temporarily isolate failed member 1 from the stack to restore partial connectivity through remaining stack members. 3. Replace the failed PSU-A unit on member 1. 4. Once hardware is replaced, re-add member 1 to the stack configuration. 5. Verify stack ring integrity and monitor for proper convergence. 6. Test network connectivity across all VLANs and validate all correlated incidents are resolved.

Tested: 2026-03-30Monitors: 3 | Incidents: 3Test ID: cmncjpant03fsobqeh0bfgq5z