Back to All Scenarios
PASSEDdatabase / redis_cluster_node_failure

Redis Cluster Node Failure — Slot Coverage Lost

A Redis Cluster node holding 5,461 hash slots crashes due to a memory corruption bug. The cluster marks the node as FAIL and attempts automatic failover to its replica. The replica promotion fails because the replica was behind on replication. Queries to affected hash slots return CLUSTERDOWN errors.

Pattern
DATABASE_EVENT
Severity
CRITICAL
Confidence
85%
Remediation
Remote Hands

Test Results

MetricExpectedActualResult
Pattern RecognitionDATABASE_EVENTDATABASE_EVENT
Severity AssessmentCRITICALCRITICAL
Incident CorrelationYes42 linked
Cascade EscalationYesYes
RemediationRemote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

Redis 7.2 Cluster with 6 masters, 6 replicas. Master redis-node-03 crashed. Hash slots 10923-16383 unavailable. Replica redis-node-09 behind by 1,200 operations. Cluster state: FAIL. Session store and cache affected.

Injected Error Messages (3)

Redis Cluster node failure — CLUSTER INFO: cluster_state:fail, cluster_slots_ok:10922, cluster_slots_fail:5461, node redis-node-03 (10.10.100.83:6379) marked as FAIL by majority, automatic failover to replica redis-node-09 attempted but failed: replica offset behind master by 1200 ops, database connection pool draining for affected slots
Redis node redis-node-03 unresponsive — process killed by kernel (signal 11, segmentation fault in jemalloc), core dump generated, hash slots 10923-16383 completely unavailable, CLUSTERDOWN response to all queries for affected key range, last known memory usage: 14.2GB of 16GB limit
Session service degraded — CLUSTERDOWN: The cluster is down, hash slot 12847 not served, user sessions in affected slot range returning errors, 33% of authenticated users losing sessions, login failures cascading, session lookup latency: N/A (errors), database connection retries exhausting thread pool

Neural Engine Root Cause Analysis

Redis cluster is experiencing a split-brain scenario where redis-node-03 has been marked as FAIL by majority consensus, causing 5,461 slots to become unavailable. The automatic failover to replica redis-node-09 failed because the replica is significantly behind the master (1,200 operations), indicating either network partitioning, replication lag, or resource constraints preventing proper data synchronization. The 16 correlated incidents suggest a broader infrastructure issue affecting multiple services simultaneously.

Remediation Plan

1. Immediately check network connectivity between all Redis cluster nodes, especially redis-node-03 and redis-node-09. 2. Verify system resources (CPU, memory, disk I/O) on all cluster nodes to identify bottlenecks causing replication lag. 3. Review Redis logs on redis-node-03 and redis-node-09 for specific error messages. 4. If network/resources are healthy, manually promote a healthy replica or restart the failed node after ensuring data consistency. 5. Investigate the 16 correlated incidents to address the underlying infrastructure issue. 6. Consider temporarily redistributing slots to healthy nodes if immediate recovery is needed.
Tested: 2026-03-30Monitors: 3 | Incidents: 3Test ID: cmncjtc4204inobqeqo95tzjl