We test Corax against real-world infrastructure failures across every vendor, platform, and scenario. Browse the results below.
The application's database connection pool (PgBouncer) is exhausted after a slow query causes connections to pile up. New requests queue behind the pool, causing cascading timeouts. The application returns 503 errors while the database itself is healthy.
The PostgreSQL streaming replication replica falls 2GB behind the primary due to a long-running analytical query holding a replication slot open. Applications reading from the replica see stale data. If the primary fails, 2GB of WAL data would be lost.
A Java-based microservice container is being repeatedly OOMKilled because the JVM heap (-Xmx) is set to 512MB but the container memory limit is also 512MB, leaving no room for JVM metaspace, thread stacks, and native memory. The pod restarts every 3-5 minutes.
A Kubernetes worker node enters NotReady state due to kubelet losing contact with the control plane after a network partition. 25 pods on the node are marked for eviction after the 5-minute toleration period. Pods reschedule to other nodes but some fail due to resource constraints.
The private Docker registry (Harbor) becomes unreachable due to a TLS certificate renewal failure. All Kubernetes pods that need to pull or repull images fail with ImagePullBackOff. Existing running containers are fine but no new deployments or restarts work.
The Windows Print Spooler service crashes on the central print server after processing a corrupted print job from an updated driver. All 45 network printers become inaccessible. 400 users across 3 floors cannot print.
The FreePBX server crashes due to an Asterisk segfault after a problematic module update. All 150 SIP phones lose registration. IVR, voicemail, call queues, and ring groups all go offline. No redundant PBX in place.
The SIP trunk between the on-premises PBX and the ITSP loses registration after the provider changes their SBC IP without notice. All outbound and inbound PSTN calls fail. Internal extension-to-extension calls still work.
The offsite backup copy job to Wasabi S3 stalls at 23% due to ISP bandwidth throttling during business hours. The 8TB backup set cannot complete within the backup window. Offsite RPO violated for disaster recovery compliance.
A Veeam incremental backup fails because the previous restore point was corrupted by a storage error. The backup chain is broken, requiring a new active full backup. With 95 VMs, the full backup will take 18 hours and consume 4TB of additional space.
The Veeam backup repository runs out of disk space during the nightly backup window. All backup jobs fail with 'insufficient disk space' errors. No backups have completed for 3 nights. RPO violated for all protected VMs.
An Exchange Database Availability Group (DAG) member server experiences a storage failure causing the active database copy to dismount. The passive copy on the second server activates but is 5 minutes behind due to log shipping lag.
The Exchange 2019 Hub Transport service freezes after a malformed email triggers an anti-malware scanning loop. The mail queue grows to 15,000 messages. Internal and external email delivery stops completely. Users unaware until critical business emails bounce.
Group Policy processing fails across the domain after a SYSVOL replication issue leaves the DFS-R replicated SYSVOL share inconsistent between DCs. Workstations receive incomplete or conflicting policies. Security baselines not enforcing.
The domain controller holding all 5 FSMO roles (PDC Emulator, RID Master, Infrastructure Master, Schema Master, Domain Naming Master) goes down due to a motherboard failure. Password changes, account creation, and domain joins all fail.
Active Directory replication between the two domain controllers fails due to a lingering object conflict. Users at the branch office (authenticating against DC-02) see stale group memberships and GPOs. Password changes on DC-01 not replicating to DC-02.
The internal enterprise CA intermediate certificate is revoked by mistake during a PKI cleanup. All certificates issued by the intermediate CA are now untrusted. Internal web apps, RADIUS 802.1X auth, and LDAPS all fail certificate validation.
The SSL/TLS certificate on the public-facing customer portal expires at midnight. Chrome and Edge users see NET::ERR_CERT_DATE_INVALID. HSTS enforcement prevents bypass. API clients receiving TLS handshake failures. Revenue-impacting for e-commerce.
The primary Windows DHCP server crashes and the DHCP failover partner does not transition to 'Partner Down' state due to a misconfigured maximum client lead time (MCLT). Clients with expiring leases cannot renew and start losing connectivity.
The DHCP scope for the main user VLAN is 99% exhausted. New devices connecting to the network fail to obtain IP addresses. Users reporting 169.254.x.x APIPA addresses. Scope was sized for 200 but 210 devices now on the VLAN due to BYOD growth.
Every scenario is tested against Corax's Neural Engine in a production environment with AI-powered root cause analysis.
Tests run continuously as new infrastructure patterns are added.