PASSEDvendor / hyperv_cluster_node_failure

Hyper-V Host Cluster Node Failure

A Hyper-V host in a 3-node failover cluster experiences a blue screen of death. Failover clustering attempts to live-migrate VMs to surviving nodes but 4 highly available VMs fail to migrate due to anti-affinity rules and insufficient resources.

Pattern

HYPERV_EVENT

Severity

CRITICAL

Confidence

95%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	HYPERV_EVENT	HYPERV_EVENT
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	41 linked
Cascade Escalation	Yes	Yes
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

Windows Server 2022 Hyper-V failover cluster (3 nodes). Node HV-03 BSOD (WHEA_UNCORRECTABLE_ERROR). 18 VMs on failed node. Cluster quorum maintained. Anti-affinity rules for SQL VMs. Surviving nodes at 80% memory.

Injected Error Messages (3)

Hyper-V host HV-03 offline — BSOD WHEA_UNCORRECTABLE_ERROR (bugcheck 0x00000124), hardware MCE on CPU 2, Windows Event ID 6008: unexpected shutdown, host unreachable via WinRM, 18 VMs affected

Failover cluster event: node HV-03 faulted — cluster event ID 1135: cluster node removed from active membership, live migration initiated for 18 VMs, 4 VMs failed to migrate (anti-affinity violation and insufficient memory on target nodes)

SQL Server AlwaysOn availability group failover — secondary replica on HV-03 lost, automatic failover to HV-01 replica, database synchronization interrupted, RPO data loss window: 12 seconds

Neural Engine Root Cause Analysis

Hyper-V host HV-03 has experienced a critical hardware failure indicated by WHEA_UNCORRECTABLE_ERROR (0x00000124) with Machine Check Exception on CPU 2. This is a hardware-level CPU fault that caused an immediate system crash and blue screen, taking down the entire host and affecting 18 virtual machines. The 15 correlated incidents within the same time window suggest either cascading failures from dependent systems or a broader infrastructure issue affecting multiple components simultaneously.

Remediation Plan

1. Immediate: Verify host hardware status via out-of-band management (iLO/iDRAC) 2. Check system event logs and hardware diagnostics for CPU/memory errors 3. Attempt remote reboot if hardware appears stable 4. If reboot fails or errors persist, schedule emergency hardware maintenance 5. Evacuate affected VMs to alternate Hyper-V hosts 6. Replace or repair faulty CPU/motherboard components 7. Validate system stability before returning VMs to host 8. Investigate the 15 correlated incidents to determine if they are cascading effects or indicate broader infrastructure problems

Tested: 2026-03-30Monitors: 3 | Incidents: 3Test ID: cmncjhdjp01i4obqeufhkjrqm