Back to All Scenarios
PASSEDnetwork / lb_health_check_cascade

Load Balancer Health Check Cascade Failure

An F5 BIG-IP load balancer's health check monitor becomes too aggressive after a config change (interval: 1s, timeout: 2s). A brief 3-second network blip causes all pool members to be marked DOWN simultaneously. The LB returns 503 to all clients.

Pattern
LOAD_BALANCER_EVENT
Severity
CRITICAL
Confidence
85%
Remediation
Auto-Heal

Test Results

MetricExpectedActualResult
Pattern RecognitionLOAD_BALANCER_EVENTLOAD_BALANCER_EVENT
Severity AssessmentCRITICALCRITICAL
Incident CorrelationYes29 linked
Cascade EscalationYesYes
RemediationAuto-Heal — Corax resolves autonomously

Scenario Conditions

F5 BIG-IP LTM i5800. Virtual server for production web app. Pool: 6 web servers. Health monitor changed from 30s/90s to 1s/2s interval/timeout. 3-second network blip on server VLAN. 10,000 active connections.

Injected Error Messages (4)

F5 BIG-IP pool 'web-prod-pool' ALL members DOWN — health monitor /Common/http-aggressive marked 6/6 members offline after 3-second VLAN blip, monitor interval: 1s timeout: 2s, no available pool members, virtual server returning 503 to all clients
F5 pool member web-01 marked DOWN — health check failure, /Common/http-aggressive: no response within 2 seconds, server is healthy but LB removed from pool, 0/6 active members, cannot serve traffic
F5 pool member web-02 marked DOWN — health monitor timeout during network blip, TCP RST to health check probe, server is operational but marked unavailable by load balancer, pool status: offline
www.acmecorp.com returning HTTP 503 Service Unavailable — F5 load balancer has no available pool members, sorry page displayed, 10,000 active sessions dropped, customer-facing outage in progress

Neural Engine Root Cause Analysis

A 3-second VLAN network disruption caused all 6 backend web servers in the 'web-prod-pool' to be marked offline by the aggressive health monitor (/Common/http-aggressive) which has a very tight 1-second interval and 2-second timeout. The brief network blip prevented health checks from reaching the backend servers, causing F5 BIG-IP to mark them as down despite the servers likely being healthy. With no available pool members, the virtual server is returning 503 Service Unavailable to all clients.

Remediation Plan

1. Immediately force-enable pool members in 'web-prod-pool' to restore service 2. Verify backend server health by testing connectivity and application responsiveness 3. Review and potentially adjust health monitor settings to be less aggressive (increase timeout or add retry logic) 4. Monitor pool member status for stability 5. Investigate VLAN infrastructure for recurring network issues to prevent future occurrences
Tested: 2026-03-30Monitors: 4 | Incidents: 4Test ID: cmncjm6mj02lbobqeka67ffxd