The PostgreSQL streaming replication replica falls 2GB behind the primary due to a long-running analytical query holding a replication slot open. Applications reading from the replica see stale data. If the primary fails, 2GB of WAL data would be lost.
Pattern
DATABASE_EVENT
Severity
CRITICAL
Confidence
85%
Remediation
Remote Hands
Test Results
Metric
Expected
Actual
Result
Pattern Recognition
DATABASE_EVENT
DATABASE_EVENT
Severity Assessment
CRITICAL
CRITICAL
Incident Correlation
Yes
18 linked
Cascade Escalation
N/A
No
Remediation
—
Remote Hands — Corax contacts on-site support via call, email, or API
Scenario Conditions
PostgreSQL 16 with streaming replication. 1 primary, 1 replica. Replication slot 'replica_slot' active. Long-running query on replica blocking WAL apply. WAL lag: 2GB (8 minutes). Application reads balanced across primary and replica.
Injected Error Messages (2)
PostgreSQL replication lag critical — pg_stat_replication: replay_lag 2,147,483,648 bytes (2GB), write_lag: 00:08:12, slot 'replica_slot' retention growing, WAL accumulation on primary /pg_wal at 67% capacity, risk of pg_wal disk full if lag continues
PostgreSQL replica serving stale data — recovery replay paused by long-running query (PID 12847, running for 45 minutes), pg_last_xact_replay_timestamp 8 minutes behind primary, read-only application queries returning outdated results, RPO at risk
Neural Engine Root Cause Analysis
The PostgreSQL replica is experiencing severe replication lag (2GB) with 8+ minute write lag, indicating the replica cannot keep up with the primary's write volume. The WAL directory is at 67% capacity and growing, suggesting the replica is either down, severely resource-constrained, or experiencing network connectivity issues. This creates a cascading failure where WAL files accumulate on the primary, risking disk exhaustion and potential primary database failure.
Remediation Plan
1. Immediately check replica server status and connectivity to primary 2. Verify replica PostgreSQL service is running and accepting connections 3. Check replica server resources (CPU, memory, disk I/O) for bottlenecks 4. Examine replica PostgreSQL logs for errors or blocking operations 5. If replica is healthy, investigate network connectivity between primary and replica 6. Consider temporarily increasing WAL retention settings to prevent primary failure 7. If replica cannot be quickly restored, consider promoting a standby or implementing emergency scaling