Back to All Scenarios
PASSEDserver / postgres_replication_lag

PostgreSQL Replication Lag Critical

The PostgreSQL streaming replication replica falls 2GB behind the primary due to a long-running analytical query holding a replication slot open. Applications reading from the replica see stale data. If the primary fails, 2GB of WAL data would be lost.

Pattern
DATABASE_EVENT
Severity
CRITICAL
Confidence
85%
Remediation
Remote Hands

Test Results

MetricExpectedActualResult
Pattern RecognitionDATABASE_EVENTDATABASE_EVENT
Severity AssessmentCRITICALCRITICAL
Incident CorrelationYes18 linked
Cascade EscalationN/ANo
RemediationRemote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

PostgreSQL 16 with streaming replication. 1 primary, 1 replica. Replication slot 'replica_slot' active. Long-running query on replica blocking WAL apply. WAL lag: 2GB (8 minutes). Application reads balanced across primary and replica.

Injected Error Messages (2)

PostgreSQL replication lag critical — pg_stat_replication: replay_lag 2,147,483,648 bytes (2GB), write_lag: 00:08:12, slot 'replica_slot' retention growing, WAL accumulation on primary /pg_wal at 67% capacity, risk of pg_wal disk full if lag continues
PostgreSQL replica serving stale data — recovery replay paused by long-running query (PID 12847, running for 45 minutes), pg_last_xact_replay_timestamp 8 minutes behind primary, read-only application queries returning outdated results, RPO at risk

Neural Engine Root Cause Analysis

The PostgreSQL replica is experiencing severe replication lag (2GB) with 8+ minute write lag, indicating the replica cannot keep up with the primary's write volume. The WAL directory is at 67% capacity and growing, suggesting the replica is either down, severely resource-constrained, or experiencing network connectivity issues. This creates a cascading failure where WAL files accumulate on the primary, risking disk exhaustion and potential primary database failure.

Remediation Plan

1. Immediately check replica server status and connectivity to primary 2. Verify replica PostgreSQL service is running and accepting connections 3. Check replica server resources (CPU, memory, disk I/O) for bottlenecks 4. Examine replica PostgreSQL logs for errors or blocking operations 5. If replica is healthy, investigate network connectivity between primary and replica 6. Consider temporarily increasing WAL retention settings to prevent primary failure 7. If replica cannot be quickly restored, consider promoting a standby or implementing emergency scaling
Tested: 2026-03-30Monitors: 2 | Incidents: 2Test ID: cmncjn1cj02s9obqelje1khjn