PASSEDinfrastructure / warm_standby_db_out_of_sync

Warm Standby Database Out of Sync — Failover Would Cause Data Loss

The warm standby database at the DR site has fallen out of synchronous replication mode and has been running in async mode for 5 days without anyone noticing. The standby is now 2 hours behind the primary. If failover were triggered now, 2 hours of committed transactions would be lost.

Pattern

DATABASE_EVENT

Severity

CRITICAL

Confidence

95%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	DATABASE_EVENT	DATABASE_EVENT
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	17 linked
Cascade Escalation	N/A	No
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

Synchronous replication configured between primary and standby. Standby silently switched to async 5 days ago after network hiccup. No monitoring on replication mode. Standby is 2 hours behind. Synchronous_commit showing 'local' instead of 'remote_apply'.

Injected Error Messages (2)

database replication mode mismatch — pg-primary synchronous_commit is set to 'remote_apply' but pg_stat_replication shows sync_state='async' for standby connection, replication fell out of synchronous mode 5 days ago after network interruption and never recovered, primary continued accepting writes without waiting for standby confirmation, max connections normal, replication lag now 2 hours 14 minutes, zero-data-loss guarantee broken

warm standby database behind primary — pg-standby replay_lsn is 2 hours 14 minutes behind primary's write_lsn, standby receiving WAL segments but applying them with increasing delay, standby was supposed to be in synchronous mode (zero data loss) but is actually running asynchronous, if failover triggered now: 2 hours of committed transactions would be permanently lost, replication lag trend: stable but gap persists

Neural Engine Root Cause Analysis

The PostgreSQL primary database has a critical replication configuration mismatch where synchronous_commit is set to 'remote_apply' but the standby is operating in asynchronous mode. This occurred 5 days ago after a network interruption caused the replication to fall out of sync, and the primary continued accepting writes without waiting for standby confirmation, breaking the zero-data-loss guarantee. The 2+ hour replication lag indicates the standby is significantly behind, creating a dangerous data consistency situation.

Remediation Plan

1. Immediately assess data criticality and consider putting application in read-only mode to prevent further data loss risk. 2. Check network connectivity between primary and standby servers. 3. Verify standby server health and replication slot status. 4. Attempt to re-establish synchronous replication by restarting replication on standby or recreating replication slot. 5. If replication cannot be quickly restored, consider temporarily setting synchronous_commit to 'local' to restore primary availability while fixing replication. 6. Once synchronized, restore synchronous_commit to 'remote_apply'. 7. Implement monitoring to detect future replication failures immediately.

Tested: 2026-03-30Monitors: 2 | Incidents: 2Test ID: cmnckfv5u096oobqe50hhcn2h