PASSEDnetwork / lb_health_check_cascade

Load Balancer Health Check Cascade Failure

An F5 BIG-IP load balancer's health check monitor becomes too aggressive after a config change (interval: 1s, timeout: 2s). A brief 3-second network blip causes all pool members to be marked DOWN simultaneously. The LB returns 503 to all clients.

Pattern

LOAD_BALANCER_EVENT

Severity

CRITICAL

Confidence

85%

Remediation

Auto-Heal

Test Results

Metric	Expected	Actual
Pattern Recognition	LOAD_BALANCER_EVENT	LOAD_BALANCER_EVENT
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	29 linked
Cascade Escalation	Yes	Yes
Remediation	—	Auto-Heal — Corax resolves autonomously

Scenario Conditions

F5 BIG-IP LTM i5800. Virtual server for production web app. Pool: 6 web servers. Health monitor changed from 30s/90s to 1s/2s interval/timeout. 3-second network blip on server VLAN. 10,000 active connections.

Injected Error Messages (4)

F5 BIG-IP pool 'web-prod-pool' ALL members DOWN — health monitor /Common/http-aggressive marked 6/6 members offline after 3-second VLAN blip, monitor interval: 1s timeout: 2s, no available pool members, virtual server returning 503 to all clients

F5 pool member web-01 marked DOWN — health check failure, /Common/http-aggressive: no response within 2 seconds, server is healthy but LB removed from pool, 0/6 active members, cannot serve traffic

F5 pool member web-02 marked DOWN — health monitor timeout during network blip, TCP RST to health check probe, server is operational but marked unavailable by load balancer, pool status: offline

www.acmecorp.com returning HTTP 503 Service Unavailable — F5 load balancer has no available pool members, sorry page displayed, 10,000 active sessions dropped, customer-facing outage in progress

Neural Engine Root Cause Analysis

A 3-second VLAN network disruption caused all 6 backend web servers in the 'web-prod-pool' to be marked offline by the aggressive health monitor (/Common/http-aggressive) which has a very tight 1-second interval and 2-second timeout. The brief network blip prevented health checks from reaching the backend servers, causing F5 BIG-IP to mark them as down despite the servers likely being healthy. With no available pool members, the virtual server is returning 503 Service Unavailable to all clients.

Remediation Plan

1. Immediately force-enable pool members in 'web-prod-pool' to restore service 2. Verify backend server health by testing connectivity and application responsiveness 3. Review and potentially adjust health monitor settings to be less aggressive (increase timeout or add retry logic) 4. Monitor pool member status for stability 5. Investigate VLAN infrastructure for recurring network issues to prevent future occurrences

Tested: 2026-03-30Monitors: 4 | Incidents: 4Test ID: cmncjm6mj02lbobqeka67ffxd