PASSEDserver / windows_cluster_quorum_loss

Windows Cluster Quorum Loss

A 3-node Windows Server Failover Cluster loses quorum when two nodes simultaneously fail due to a shared storage controller issue. The remaining node cannot form quorum alone, and all clustered services go offline including SQL Server Always On Availability Groups.

Pattern

UNKNOWN

Severity

CRITICAL

Confidence

85%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	UNKNOWN	UNKNOWN
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	27 linked
Cascade Escalation	Yes	Yes
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

3-node WSFC cluster with dynamic quorum. Node majority voting. Two nodes lose connectivity to shared storage and crash simultaneously. File share witness unreachable due to network partition. SQL Server AG hosted on cluster.

Injected Error Messages (3)

Windows Failover Cluster node wsfc-01 offline — cluster quorum lost, event ID 1177: cluster service cannot form quorum with remaining nodes, file share witness unreachable, all clustered resources going offline

Windows Failover Cluster node wsfc-02 offline — shared storage controller failure caused simultaneous node crash, cluster quorum vote count insufficient, node isolated from cluster network

SQL Server Always On AG listener unreachable — Windows cluster quorum lost, availability group databases in resolving state, all database connections rejected, cluster cannot elect primary replica

Neural Engine Root Cause Analysis

Windows Failover Cluster node wsfc-01 has lost connectivity and cluster quorum cannot be formed with remaining nodes. Event ID 1177 indicates the cluster service cannot establish quorum, and the file share witness is unreachable, causing all clustered resources to go offline. The 12 correlated incidents within the same 5-minute window strongly suggest a broader infrastructure failure affecting multiple systems, likely a network outage, power failure, or shared infrastructure component failure rather than an isolated node issue.

Remediation Plan

1. Immediately verify network connectivity to wsfc-01 (10.10.5.30) and check if node is powered on. 2. Investigate the 12 correlated incidents to identify shared infrastructure impact (switches, power, storage). 3. Verify file share witness accessibility from remaining cluster nodes. 4. If wsfc-01 is recoverable, restart cluster service and verify node rejoin. 5. If node hardware failed, force quorum on remaining healthy nodes temporarily and plan node replacement. 6. Once connectivity restored, validate all cluster resources are online and functioning properly.

Tested: 2026-03-30Monitors: 3 | Incidents: 3Test ID: cmncjxldq05mwobqe0iot2vlf