We test Corax against real-world infrastructure failures across every vendor, platform, and scenario. Browse the results below.
Multiple PersistentVolumeClaims in a Kubernetes cluster are stuck in Pending state after the cloud provider's storage provisioner hits its volume limit. New StatefulSet pods cannot start because they require persistent storage. The storage class provisioner logs show quota exceeded errors.
The NGINX Ingress Controller in a production Kubernetes cluster crashes after a malformed Ingress resource is applied. The controller enters a CrashLoopBackOff state. All external HTTP/HTTPS traffic to the cluster is blocked because no ingress controller pods are running to route traffic to backend services.
A Google Cloud Spanner regional instance experiences a zone-level failure in us-central1-a. The multi-zone configuration should provide automatic failover, but a configuration error in the instance's node count causes the remaining zones to be overloaded. Read and write latencies spike above SLA thresholds.
Azure Cosmos DB begins aggressively throttling requests after a marketing campaign drives 10x normal traffic. The provisioned RU/s budget is exhausted and autoscale max is reached. Applications receive HTTP 429 (Too Many Requests) responses. Retry storms amplify the problem as clients retry throttled requests.
Azure Key Vault becomes unreachable due to a misconfigured private endpoint and NSG rule change during a network security audit. All applications that fetch secrets, encryption keys, or certificates from Key Vault at startup or rotation time fail. Services that cache secrets continue working but cannot rotate credentials.
An EC2 instance with instance store (ephemeral) volumes experiences a hardware failure on the underlying host. AWS stops and restarts the instance on new hardware, but all instance store data is lost. The application had been incorrectly storing session data and temporary processing files on instance store volumes instead of EBS.
A misconfigured S3 bucket policy denies all access including the root account. The bucket contains 15TB of production assets (user uploads, documents, media). All applications that read from or write to the bucket receive AccessDenied errors. Even the AWS console shows access denied.
During a disaster recovery test, the database backup restore fails at 78% completion due to a corrupted backup chain. The backup verification job discovers that 3 incremental backups have invalid checksums, making the entire backup chain since the last full backup unrestorable. Production database has no valid point-in-time recovery option for the last 5 days.
A network partition splits a 3-node MariaDB Galera cluster into a 1-node partition and a 2-node partition. The isolated node enters non-primary state and rejects all queries. When the partition heals, the isolated node has divergent data that requires manual SST (State Snapshot Transfer) to resync, causing extended downtime.
PostgreSQL autovacuum has been unable to keep up with a high-write workload, and the database is approaching transaction ID wraparound. The autovacuum_freeze_max_age threshold is reached, forcing aggressive anti-wraparound vacuums that consume all I/O. Database performance degrades severely as aggressive vacuum competes with production queries.
An Oracle 19c production database runs out of space in the USERS tablespace after an overnight ETL job loads 3x the expected data volume. All INSERT and UPDATE operations fail with ORA-01653. The application returns errors on any write operation while reads continue to function.
SQL Server TempDB runs out of space due to a runaway query creating massive temp tables and sort operations. All concurrent queries requiring TempDB (sorts, hash joins, temp tables, version store) are blocked. The entire instance becomes effectively frozen.
A Redis Cluster node holding 5,461 hash slots crashes due to a memory corruption bug. The cluster marks the node as FAIL and attempts automatic failover to its replica. The replica promotion fails because the replica was behind on replication. Queries to affected hash slots return CLUSTERDOWN errors.
A network partition between MongoDB replica set members triggers repeated elections. The primary steps down, but no node can achieve majority quorum due to split-brain networking. Applications receive 'not master' errors and writes fail across all connected services.
MySQL master-replica replication breaks after a storage volume snapshot causes the binary log position to become invalid on the replica. The replica enters an error state with GTID gap, and the replication lag grows unbounded. Applications relying on read replicas begin returning stale data while writes to the master succeed.
Zoom Phone cloud service experiences an outage during a severe weather event that also knocks out the primary internet circuit. The Zoom Phone Survivability Gateway (local appliance) fails to activate because it was never properly configured after installation. The office has no phone service — cloud calling is down and local failover doesn't work.
After a PBX configuration change, all ring groups are pointing to non-existent extensions. Inbound PSTN calls ring once and go to a generic voicemail box instead of reaching the intended departments. Sales, support, and main line calls are all misrouted. The issue was caused by an extension renumbering project that didn't update ring group memberships.
The call recording system stops capturing calls after the recording storage volume fills up. The VoIP system continues functioning but no calls are being recorded, putting the organization in violation of regulatory compliance requirements (HIPAA/PCI). The failure went undetected for 48 hours.
The primary SBC (Session Border Controller) suffers a hardware failure, dropping all active calls and preventing new call setup. The standby SBC fails to take over because the HA license expired. All inbound and outbound PSTN calls are completely offline.
Microsoft Teams Phone System experiences a regional outage affecting all Teams calling features. Users cannot make or receive PSTN calls through Teams. Direct Routing SBC shows the Teams backend as unreachable. Internal Teams chat and meetings work but all telephony features are offline.
Every scenario is tested against Corax's Neural Engine in a production environment with AI-powered root cause analysis.
Tests run continuously as new infrastructure patterns are added.