Warm Standby Database Out of Sync — Failover Would Cause Data Loss
The warm standby database at the DR site has fallen out of synchronous replication mode and has been running in async mode for 5 days without anyone noticing. The standby is now 2 hours behind the primary. If failover were triggered now, 2 hours of committed transactions would be lost.
Pattern
DATABASE_EVENT
Severity
CRITICAL
Confidence
95%
Remediation
Remote Hands
Test Results
Metric
Expected
Actual
Result
Pattern Recognition
DATABASE_EVENT
DATABASE_EVENT
Severity Assessment
CRITICAL
CRITICAL
Incident Correlation
Yes
17 linked
Cascade Escalation
N/A
No
Remediation
—
Remote Hands — Corax contacts on-site support via call, email, or API
Scenario Conditions
Synchronous replication configured between primary and standby. Standby silently switched to async 5 days ago after network hiccup. No monitoring on replication mode. Standby is 2 hours behind. Synchronous_commit showing 'local' instead of 'remote_apply'.
Injected Error Messages (2)
database replication mode mismatch — pg-primary synchronous_commit is set to 'remote_apply' but pg_stat_replication shows sync_state='async' for standby connection, replication fell out of synchronous mode 5 days ago after network interruption and never recovered, primary continued accepting writes without waiting for standby confirmation, max connections normal, replication lag now 2 hours 14 minutes, zero-data-loss guarantee broken
warm standby database behind primary — pg-standby replay_lsn is 2 hours 14 minutes behind primary's write_lsn, standby receiving WAL segments but applying them with increasing delay, standby was supposed to be in synchronous mode (zero data loss) but is actually running asynchronous, if failover triggered now: 2 hours of committed transactions would be permanently lost, replication lag trend: stable but gap persists
Neural Engine Root Cause Analysis
The PostgreSQL primary database has a critical replication configuration mismatch where synchronous_commit is set to 'remote_apply' but the standby is operating in asynchronous mode. This occurred 5 days ago after a network interruption caused the replication to fall out of sync, and the primary continued accepting writes without waiting for standby confirmation, breaking the zero-data-loss guarantee. The 2+ hour replication lag indicates the standby is significantly behind, creating a dangerous data consistency situation.
Remediation Plan
1. Immediately assess data criticality and consider putting application in read-only mode to prevent further data loss risk. 2. Check network connectivity between primary and standby servers. 3. Verify standby server health and replication slot status. 4. Attempt to re-establish synchronous replication by restarting replication on standby or recreating replication slot. 5. If replication cannot be quickly restored, consider temporarily setting synchronous_commit to 'local' to restore primary availability while fixing replication. 6. Once synchronized, restore synchronous_commit to 'remote_apply'. 7. Implement monitoring to detect future replication failures immediately.