PASSEDdatabase / mariadb_galera_split_brain

MariaDB Galera Cluster Split-Brain

A network partition splits a 3-node MariaDB Galera cluster into a 1-node partition and a 2-node partition. The isolated node enters non-primary state and rejects all queries. When the partition heals, the isolated node has divergent data that requires manual SST (State Snapshot Transfer) to resync, causing extended downtime.

Pattern

DATABASE_EVENT

Severity

CRITICAL

Confidence

85%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	DATABASE_EVENT	DATABASE_EVENT
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	27 linked
Cascade Escalation	Yes	Yes
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

MariaDB 10.11 Galera Cluster. 3 nodes across 2 network segments. Network partition isolates node-03 from nodes 01 and 02. Node-03 enters non-primary state. Writes continued on 2-node quorum. 45 minutes of divergent transactions on node-03 before partition detected.

Injected Error Messages (3)

MariaDB Galera cluster partition detected — wsrep_cluster_size: 2 (expected 3), node galera-03 removed from cluster view, wsrep_cluster_status: Primary, wsrep_connected: ON, slow query log showing increased replication write-set conflicts, flow control pauses increasing, database connection pool stable but cluster degraded to 2 nodes

Galera node-03 in non-primary state — WSREP: Node is non-primary, rejecting all queries with SQLSTATE 08S01: 'WSREP has not yet prepared node for application use', wsrep_cluster_status: non-Primary, wsrep_connected: OFF, 45 minutes of divergent transactions, full SST required to rejoin cluster, estimated resync time: 2 hours for 200GB dataset

HAProxy database pool degraded — backend galera-03 marked DOWN (health check: 'MariaDB Galera not synced'), active backend servers: 2/3, max connections redistributed across remaining nodes, connection queue depth increasing, galera-01 and galera-02 at 80% max connections, slow query response times climbing

Neural Engine Root Cause Analysis

The MariaDB Galera cluster is experiencing a network partition or node failure where galera-03 has been removed from the cluster view, reducing the cluster size from 3 to 2 nodes. While the remaining 2 nodes maintain quorum and primary status, the cluster is degraded and experiencing increased replication write-set conflicts and flow control pauses, indicating network latency or resource contention issues. The 12 correlated incidents suggest this is likely part of a broader infrastructure event affecting multiple services simultaneously.

Remediation Plan

1. Immediately check connectivity and status of galera-03 node (10.10.100.62 likely) 2. Verify network connectivity between all Galera nodes 3. Check galera-03 system resources (CPU, memory, disk I/O) and MariaDB error logs 4. If galera-03 is responsive, attempt to rejoin it to the cluster using 'SET GLOBAL wsrep_cluster_address' or restart MariaDB service 5. If galera-03 is unresponsive, investigate underlying infrastructure issues given the 12 correlated incidents 6. Monitor cluster performance metrics and reduce write load if necessary until full 3-node cluster is restored 7. Consider temporary read-only mode for non-critical applications to reduce replication conflicts

Tested: 2026-03-30Monitors: 3 | Incidents: 3Test ID: cmncjtu9l04mjobqew82a97ro