We test Corax against real-world infrastructure failures across every vendor, platform, and scenario. Browse the results below.
The NTP server goes offline and server clocks begin drifting. After 3 hours, the domain controller clock is 7 minutes ahead of workstations. Kerberos authentication fails because the maximum clock skew tolerance is 5 minutes. Users cannot log in, access file shares, or use any domain-authenticated service.
The server room CRAC (Computer Room Air Conditioning) unit fails at 2AM. Temperature rises from 72F to 95F in 45 minutes. Server thermal throttling begins. If temperature reaches 104F, automatic thermal shutdown will occur on all servers.
A prolonged power outage has drained the UPS batteries to 15%. The generator failed to start due to a fuel pump failure. With 10 minutes of battery remaining, a graceful shutdown of non-critical systems must begin immediately or risk unclean shutdowns and data corruption.
The primary ISP circuit is experiencing intermittent packet loss (5-15%) due to a degraded fiber segment. Not a full outage — the circuit stays up but quality degrades. VoIP calls have choppy audio, video conferences freeze, and cloud app performance is poor. ISP ticket opened but ETA unknown.
Ransomware (LockBit 3.0 variant) is detected spreading laterally via SMB (port 445) from a compromised workstation. The malware is encrypting shared drives and attempting to reach backup servers. 3 file servers already affected. EDR alerts are firing but automated containment is not configured.
An attacker compromises the internal DNS server and injects fraudulent A records for banking and M365 login pages. Users are redirected to phishing pages that harvest credentials. The poisoned cache affects 3 client tenants on the MSP's shared DNS infrastructure.
A poorly optimized batch job triggers a deadlock storm in SQL Server. 150+ deadlocks in 10 minutes as the batch process and OLTP transactions compete for the same table locks. Application transactions are chosen as deadlock victims, causing user-facing errors.
The application's database connection pool (PgBouncer) is exhausted after a slow query causes connections to pile up. New requests queue behind the pool, causing cascading timeouts. The application returns 503 errors while the database itself is healthy.
The PostgreSQL streaming replication replica falls 2GB behind the primary due to a long-running analytical query holding a replication slot open. Applications reading from the replica see stale data. If the primary fails, 2GB of WAL data would be lost.
A Java-based microservice container is being repeatedly OOMKilled because the JVM heap (-Xmx) is set to 512MB but the container memory limit is also 512MB, leaving no room for JVM metaspace, thread stacks, and native memory. The pod restarts every 3-5 minutes.
A Kubernetes worker node enters NotReady state due to kubelet losing contact with the control plane after a network partition. 25 pods on the node are marked for eviction after the 5-minute toleration period. Pods reschedule to other nodes but some fail due to resource constraints.
The private Docker registry (Harbor) becomes unreachable due to a TLS certificate renewal failure. All Kubernetes pods that need to pull or repull images fail with ImagePullBackOff. Existing running containers are fine but no new deployments or restarts work.
During a certificate renewal, the wrong certificate is applied to the load balancer's SSL offload profile. The certificate is for a different domain (staging.acmecorp.com instead of www.acmecorp.com). Browsers show certificate name mismatch warnings. HPKP pins do not match.
A WAF rule update on the F5 ASM introduces a false positive that matches a common HTTP header sent by the company's mobile app. All mobile API requests are blocked with 403 Forbidden. 60% of customer traffic comes from the mobile app.
An F5 BIG-IP load balancer's health check monitor becomes too aggressive after a config change (interval: 1s, timeout: 2s). A brief 3-second network blip causes all pool members to be marked DOWN simultaneously. The LB returns 503 to all clients.
The WLC detects a rogue access point broadcasting a corporate SSID ('Corp-WiFi') in the parking lot. The rogue AP is performing an evil twin attack, capturing credentials from employees who auto-connect. WIDS alerts trigger but containment is not automatic.
The Cisco 9800 Wireless LAN Controller crashes, orphaning 60 managed access points. APs enter standalone mode with limited functionality. New client authentications fail because RADIUS proxy is unavailable. Existing clients remain associated but cannot roam.
The master switch in a 3-member stack reboots unexpectedly due to a firmware bug. A new master election occurs, causing a 90-second control plane outage. During the election, no configuration changes can be made, and STP reconverges, causing brief traffic interruption.
A Cisco 9300 4-member switch stack experiences a stack cable failure, splitting the stack into two independent 2-member stacks. Both halves claim the same management IP. MAC address tables conflict. Half the access ports become unreachable from management.
A sudden work-from-home mandate floods the SSL VPN concentrator with 500+ simultaneous connections. The device supports 250 concurrent sessions. Users see 'maximum sessions reached' errors. Split tunneling not configured, so all traffic routes through VPN, crushing the bandwidth.
Every scenario is tested against Corax's Neural Engine in a production environment with AI-powered root cause analysis.
Tests run continuously as new infrastructure patterns are added.