Back to All Scenarios
PASSEDinfrastructure / dns_ttl_too_low_resolver_overload

DNS TTL Too Low — Resolver Overload and Lookup Latency

A well-intentioned change to reduce DNS TTL to 10 seconds for faster failover has overwhelmed the internal DNS resolvers. With 500 services making DNS lookups at 10-second intervals, the resolvers are processing 300,000 queries per minute and experiencing query failures that cascade into application-level failures.

Pattern
HIGH_CPU
Severity
CRITICAL
Confidence
95%
Remediation
Remote Hands

Test Results

MetricExpectedActualResult
Pattern RecognitionHIGH_CPUHIGH_CPU
Severity AssessmentCRITICALCRITICAL
Incident CorrelationYes16 linked
Cascade EscalationN/ANo
RemediationRemote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

Internal DNS resolvers handling 500 services. DNS TTL reduced from 300s to 10s. Query volume: 300,000/minute (30x increase). Resolver CPU at 100%. DNS query latency: 5 seconds. Applications failing DNS lookups.

Injected Error Messages (2)

dns resolution failed — primary DNS resolver 10.50.7.10 overwhelmed, query rate: 300,000/minute (normal: 10,000/minute), 30x increase caused by TTL reduction from 300s to 10s across all zones, resolver CPU at 100%, query response latency: 5.2 seconds (normal: 2ms), recursive query queue depth: 47,000, 15% of queries returning SERVFAIL, cache effectiveness dropped from 95% to 8% due to 10-second TTL forcing constant re-resolution
dns resolution failed — secondary resolver also at capacity, query spillover from primary plus own client load, combined resolver infrastructure unable to handle 300K queries/minute, enotfound errors propagating to application layer, microservices unable to resolve internal service names, service mesh DNS-based discovery completely broken, all inter-service communication failing during DNS lookup phase, cascade of application errors across 500+ services

Neural Engine Root Cause Analysis

The primary DNS resolver is experiencing severe performance degradation due to a recent TTL reduction from 300 seconds to 10 seconds across all DNS zones. This configuration change has increased the query rate by 30x (from 10,000 to 300,000 queries/minute), overwhelming the resolver's capacity and causing CPU saturation at 100%. The dramatic TTL reduction has effectively disabled DNS caching (cache effectiveness dropped from 95% to 8%), forcing constant re-resolution of queries and creating a recursive query backlog of 47,000 requests. The resolver is fundamentally under-provisioned for this new query load pattern.

Remediation Plan

1. Immediate: Revert TTL values from 10 seconds back to 300 seconds across all DNS zones to restore caching effectiveness and reduce query load. 2. Scale horizontally: Activate secondary DNS resolvers and implement load balancing to distribute the query load. 3. Optimize resolver configuration: Increase recursive query limits, adjust cache size, and tune performance parameters. 4. Monitor: Verify query rates return to normal levels (~10,000/minute) and response latency drops below 50ms. 5. Long-term: Implement proper capacity planning and gradual TTL changes with load testing before production deployment.
Tested: 2026-03-30Monitors: 2 | Incidents: 2Test ID: cmnckiury09vpobqehukc6xh9