PASSEDdatabase / mysql_replication_break

MySQL Replication Break — Binary Log Position Lost

MySQL master-replica replication breaks after a storage volume snapshot causes the binary log position to become invalid on the replica. The replica enters an error state with GTID gap, and the replication lag grows unbounded. Applications relying on read replicas begin returning stale data while writes to the master succeed.

Pattern

DATABASE_EVENT

Severity

CRITICAL

Confidence

85%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	DATABASE_EVENT	DATABASE_EVENT
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	27 linked
Cascade Escalation	Yes	Yes
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

MySQL 8.0 with GTID-based replication. 1 master, 2 replicas. Storage snapshot on replica-01 corrupted relay log. GTID auto-position enabled. Replication SQL thread stopped with error 1236. 300 read queries/second directed to replicas.

Injected Error Messages (3)

MySQL master reporting replication lag on replica-01 — SHOW SLAVE HOSTS showing replica-01 with Seconds_Behind_Master: NULL, replication channel broken, binary log position 'mysql-bin.000847:4' no longer available on master, GTID set has gap, max connections approaching limit due to read traffic failover to master

MySQL replica-01 replication stopped — Last_SQL_Error: 'Could not execute relay log event, GTID 3e11fa47-9e56:1-984752 not found in binary log', Slave_SQL_Running: No, Slave_IO_Running: Yes, relay log corrupted after storage snapshot, replication lag: unbounded, read queries returning stale data from 6 hours ago

MySQL replica-02 under heavy load — read traffic shifted from broken replica-01, Threads_running: 247, slow query log filling rapidly, replication lag on this replica now 45 seconds and growing, max connections: 90% utilized, application read latency spiking to 12 seconds

Neural Engine Root Cause Analysis

MySQL master-slave replication has broken down due to a cascade of issues: the binary log position referenced by the replica is no longer available on the master (indicating potential log purging or master restart), there's a gap in the GTID set causing replication inconsistency, and the master is approaching max connections due to read traffic being redirected from the failed replica. The NULL Seconds_Behind_Master indicates the replication thread has stopped completely, likely due to the missing binary log position and GTID issues.

Remediation Plan

1. Immediately increase max_connections on master to handle traffic surge 2. Stop slave processes on replica-01 3. Take consistent backup/snapshot of master 4. Restore replica-01 from master backup to establish clean baseline 5. Reset replication with current master position and GTID set 6. Start slave processes and verify replication is working 7. Gradually redirect read traffic back to replica 8. Review binary log retention settings to prevent future log purging issues

Tested: 2026-03-30Monitors: 3 | Incidents: 3Test ID: cmncjsu9l04crobqee188v79a