PASSEDinfrastructure / cross_region_replication_lag_critical

Cross-Region Replication Lag Critical — Data Consistency at Risk

The cross-region database replication between the primary (us-east-1) and disaster recovery (eu-west-1) regions has fallen behind by 4 hours. A sustained write increase combined with network throttling between regions is causing the replica to fall further behind, with the gap widening every hour.

Pattern

DATABASE_EVENT

Severity

CRITICAL

Confidence

85%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	DATABASE_EVENT	DATABASE_EVENT
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	17 linked
Cascade Escalation	N/A	No
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

Primary database in us-east-1. DR replica in eu-west-1. Replication lag: 4 hours and growing. Write throughput: 5000 TPS (2x normal). Inter-region bandwidth throttled by ISP. RPO at risk.

Injected Error Messages (2)

database replication lag CRITICAL — primary pg-prod-us-east-1 to replica pg-dr-eu-west-1 replication lag has reached 4 hours 23 minutes and growing, WAL shipping rate: 12GB/hour, WAL generation rate: 28GB/hour (deficit: 16GB/hour), inter-region network throughput from 500Mbps to 180Mbps due to ISP congestion, replication lag increasing by approximately 40 minutes per hour at current rates, max connections on primary unaffected

DR replica falling further behind — pg-dr-eu-west-1 replay position is 4 hours behind primary, 67GB of WAL segments queued for replay, replica applying WAL at 12GB/hour but receiving only intermittent batches due to network throttling, if primary fails now DR would lose 4+ hours of data, RPO SLA (1 hour) breached by 3+ hours, replication lag trend: widening, estimated time to catch up at current rates: never (deficit growing)

Neural Engine Root Cause Analysis

The root cause is network bandwidth degradation between us-east-1 and eu-west-1 regions due to ISP congestion, reducing throughput from 500Mbps to 180Mbps. This bottleneck prevents WAL shipping from keeping pace with WAL generation (12GB/hour vs 28GB/hour), creating a 16GB/hour deficit that manifests as exponentially growing replication lag currently at 4 hours 23 minutes. The 9 correlated incidents suggest this network issue is affecting multiple services simultaneously.

Remediation Plan

1. Immediately implement WAL compression to reduce shipping volume 2. Enable parallel WAL streaming if not already active 3. Contact ISP to escalate network congestion issue 4. Consider temporarily routing replication traffic through alternate network paths or VPN tunnels 5. Evaluate promoting eu-west-1 replica to primary if lag becomes unacceptable for business continuity 6. Monitor other affected services from correlated incidents for additional network-related issues

Tested: 2026-03-30Monitors: 2 | Incidents: 2Test ID: cmnckft7y096mobqe5o67mkxn