PASSEDinfrastructure / failover_dns_propagation_delay

Failover DNS Propagation Delay — Extended Outage

After activating disaster recovery, the DNS records are updated to point to the DR site, but propagation is taking much longer than expected due to high TTL values that were never reduced pre-failover. Many clients continue hitting the dead primary site for hours after the DNS change.

Pattern

DNS_FAILURE

Severity

CRITICAL

Confidence

95%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	DNS_FAILURE	DNS_FAILURE
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	18 linked
Cascade Escalation	N/A	No
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

Production outage triggers DR activation. DNS records updated to DR IPs. Original DNS TTL: 86400 seconds (24 hours). No TTL pre-staging performed. Global DNS propagation taking 12+ hours. Many clients still resolving to dead primary.

Injected Error Messages (2)

dns resolution failed for 60% of global clients — DNS records updated 3 hours ago to point to DR site (192.168.200.10) but resolvers worldwide still returning old production IP (10.10.1.10), original TTL was 86400 seconds (24 hours), cached entries will persist for up to 21 more hours, only 40% of DNS resolvers have picked up the new records, enotfound errors for clients whose resolvers have purged the old entry but not yet received the new one due to propagation delays

DR site operational but unreachable by majority of users — DR site at 192.168.200.10 is fully functional and passing all health checks, but only 40% of global traffic reaching it, 60% of users still being directed to dead primary IP by cached DNS records, customer support receiving hundreds of reports, dns resolution failed for production domain from most geographic regions, estimated propagation completion: 12-21 hours from now

Neural Engine Root Cause Analysis

The incident is caused by DNS propagation delays following a DR cutover 3 hours ago. The original DNS records had a 24-hour TTL (86400 seconds), causing global DNS resolvers to cache the old production IP (10.10.1.10) for up to 24 hours. Only 40% of resolvers have updated to the new DR site IP (192.168.200.10), while 60% still serve stale records, creating inconsistent resolution and ENOTFOUND errors during the transition period. This is expected behavior given the high TTL value, not a system failure.

Remediation Plan

1. Implement DNS flush acceleration by contacting major public DNS providers (Google 8.8.8.8, Cloudflare 1.1.1.1, OpenDNS) to manually purge cached records. 2. Configure temporary DNS health checks to monitor propagation progress across global resolvers. 3. For future DR scenarios, pre-configure DNS records with lower TTL values (300-900 seconds) before planned failovers. 4. Consider implementing DNS failover automation with health checks to avoid manual record updates. 5. Monitor the remaining 21-hour propagation window and document resolution timeline for post-incident review.

Tested: 2026-03-30Monitors: 2 | Incidents: 2Test ID: cmnckfufz096nobqed3sy96zx