PASSEDinfrastructure / dr_site_failover_test_failure

DR Site Failover Test Failure — Recovery Environment Not Functional

A scheduled disaster recovery failover test reveals that the DR site cannot bring up critical services. The DR database has been silently failing replication for 2 weeks, the application servers have outdated configurations, and the DR network routing tables are stale.

Pattern

DATABASE_EVENT

Severity

CRITICAL

Confidence

95%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	DATABASE_EVENT	DATABASE_EVENT
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	29 linked
Cascade Escalation	Yes	Yes
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

Annual DR failover test. DR site in secondary data center. Database replication broken for 14 days. Application configs 3 versions behind. Network routes pointing to decommissioned subnets. RPO/RTO SLAs cannot be met.

Injected Error Messages (3)

DR database replication broken — pg-dr-primary has not received WAL segments from production for 14 days, replication lag: 14 days 6 hours, last successful replication: 2026-03-15T03:22:00Z, replication slot 'dr_replica' marked inactive, DR database missing 14 days of transactions, RPO violation: target 1 hour, actual 14 days, database recovery point is 2 weeks stale

DR application server startup failure — application version on DR site is v4.1.2, production is running v4.4.0, configuration files reference production database endpoints instead of DR endpoints, 3 required environment variables missing from DR configuration, application health check failing pointing to wrong host, DR environment has not been validated since initial setup

DR network routing failure — static routes on DR gateway reference decommissioned subnets (10.20.0.0/16 removed 6 months ago), no route to production DNS servers from DR site, BGP peering with ISP not established at DR location, name resolution failing for all internal domains, network path to monitoring systems broken, DR site effectively isolated

Neural Engine Root Cause Analysis

The disaster recovery database replication has been completely broken for 14 days, indicating a catastrophic failure in the PostgreSQL WAL streaming replication mechanism. The replication slot 'dr_replica' is marked inactive, suggesting either the primary database stopped sending WAL segments, network connectivity between production and DR sites is broken, or the DR database service itself has failed. This represents a complete RPO violation (14 days vs 1 hour target) and poses extreme business continuity risk as there is no current disaster recovery capability.

Remediation Plan

1. Immediately verify network connectivity between production and DR database servers. 2. Check if the DR database service (pg-dr-primary) is running and accessible. 3. Examine replication slot status on the production database and recreate if necessary. 4. Review PostgreSQL logs on both primary and DR systems for replication errors. 5. If replication cannot be restored quickly, initiate full base backup and restore to re-establish replication baseline. 6. Once connectivity is restored, monitor replication lag recovery. 7. Conduct post-incident review to prevent 14-day detection delays.

Tested: 2026-03-30Monitors: 3 | Incidents: 3Test ID: cmnckfd1j093jobqewt6tjs3n