We test Corax against real-world infrastructure failures across every vendor, platform, and scenario. Browse the results below.
A Kubernetes cluster's nodes are all pulling images from Docker Hub simultaneously during a rolling deployment, exceeding the Docker Hub rate limit (100 pulls/6 hours for anonymous, 200 for authenticated). All image pulls fail with 429 Too Many Requests, blocking deployments and pod restarts across the cluster.
A newly applied Kubernetes NetworkPolicy with an overly restrictive ingress rule blocks all traffic between the API pods and the database pods. The policy was intended to restrict external access but inadvertently blocks intra-cluster communication, causing a complete application outage.
The Kubernetes metrics-server deployment is crashlooping due to a misconfigured TLS flag after a Helm chart upgrade. Without metrics-server, the Horizontal Pod Autoscaler cannot read CPU/memory metrics and stops scaling pods. During a traffic surge, pods remain at minimum replica count and become overwhelmed.
A multi-cloud architecture uses DNS-based failover between primary (cloud provider A) and secondary (cloud provider B). The DNS failover mechanism itself fails because the health check endpoint uses a shared authentication service that is down on both clouds, causing the DNS provider to mark both targets as unhealthy.
The CDN origin shield layer fails, causing all 50+ edge locations to simultaneously request content directly from the origin server. The origin is designed to handle coalesced requests from 1 shield, not direct requests from 50+ edges. Origin collapses under the load.
An attacker exploits an unkeyed header vulnerability to poison the CDN cache, causing all users requesting a specific page to receive a response containing injected malicious JavaScript. The poisoned cache entry has a 24-hour TTL and is replicated across all edge locations.
A GCP Pub/Sub subscription's dead letter topic accumulates millions of unprocessable messages after a schema change breaks the consumer application. The dead letter queue has no consumer, messages are piling up, and the original subscription's backlog is growing exponentially.
A GCP HTTP(S) load balancer marks all backend instances as unhealthy after a firewall rule change blocks the health check probe source IP range (35.191.0.0/16). The load balancer returns HTTP errors to all clients despite all backend VMs being fully operational.
An Azure Front Door configuration update introduces a routing rule error that sends 40% of production traffic to a staging backend pool. Users intermittently see staging data mixed with production data, causing data integrity concerns and customer confusion.
All Azure DevOps self-hosted build agents are stuck on hung builds, preventing any new CI/CD pipelines from running. The agent pool shows 0 available agents. Development velocity drops to zero as no code can be built, tested, or deployed.
A misconfigured Route 53 health check threshold causes all three regional endpoints to be marked unhealthy simultaneously during a brief network blip. Route 53 removes all records from DNS, causing a complete global outage even though all regions are actually healthy.
ECS service cannot place new tasks because all container instances in the cluster have exhausted their CPU and memory reservations. Auto-scaling group is at max capacity. Deployments are stuck with desired count never matching running count.
An Ansible playbook run against 200 production servers fails midway through execution due to a changed SSH host key on the jump host. 87 servers received the updated configuration while 113 did not, creating a split-brain configuration state across the fleet.
A CloudFormation stack update to production fails during resource creation, triggering an automatic rollback that itself gets stuck in UPDATE_ROLLBACK_FAILED state due to a manually modified resource outside of CloudFormation control.
Two CI/CD pipelines triggered simultaneously attempt to run terraform apply against the same state file. The DynamoDB state lock prevents both from proceeding, but a stale lock from a crashed previous run is never released, blocking all infrastructure deployments for 4 hours.
Azure AD Connect enters a synchronization loop where the delta sync cycle never completes, continuously restarting. Password hash synchronization stops working, and on-premises changes are not reflecting in Azure AD.
A Helm chart upgrade fails partway through due to a resource conflict, and the automatic rollback also fails because the previous release's CRDs are incompatible with the current cluster state. The release is stuck in a 'pending-rollback' state. Kubernetes resources are in a mixed state between the old and new versions.
An automated Kubernetes secret rotation job fails silently, leaving database credentials expired in 15 Kubernetes Secrets across 3 namespaces. Pods that restart or scale up pick up the expired credentials and cannot connect to databases. Running pods with cached credentials continue working until their connection pools recycle.
The Docker daemon on a production host becomes unresponsive due to a deadlock in the containerd shim layer. All container operations (start, stop, exec, logs) hang indefinitely. Running containers continue to operate but cannot be managed. New deployments and health checks fail.
Multiple PersistentVolumeClaims in a Kubernetes cluster are stuck in Pending state after the cloud provider's storage provisioner hits its volume limit. New StatefulSet pods cannot start because they require persistent storage. The storage class provisioner logs show quota exceeded errors.
Every scenario is tested against Corax's Neural Engine in a production environment with AI-powered root cause analysis.
Tests run continuously as new infrastructure patterns are added.