We test Corax against real-world infrastructure failures across every vendor, platform, and scenario. Browse the results below.
An Azure Network Security Group was updated and accidentally removed the allow rule for HTTPS (443). All web traffic to the subnet is now blocked.
An EC2 instance failed the system status check due to underlying hardware issues. Instance is unreachable. AWS recommends stop/start to migrate to healthy hardware.
ECS tasks in the production cluster cannot start because the ECR image repository authentication token has expired. All new task launches failing.
An AWS Lambda function is being throttled because the account-level concurrent execution limit has been reached. Event-driven processing completely stalled.
An S3 bucket containing customer data was accidentally made public through a policy change. AWS Access Analyzer detected the exposure.
An RDS MySQL instance hit the max_connections limit. New connections rejected. Application connection pools failing to acquire connections.
An RDS PostgreSQL instance hit its allocated storage limit. All write operations failing. Autoscaling storage not enabled.
A T3 burstable instance has exhausted all CPU credits. Performance throttled to baseline (20% of a vCPU). Application response times unacceptable.
A Kubernetes cluster's nodes are all pulling images from Docker Hub simultaneously during a rolling deployment, exceeding the Docker Hub rate limit (100 pulls/6 hours for anonymous, 200 for authenticated). All image pulls fail with 429 Too Many Requests, blocking deployments and pod restarts across the cluster.
A newly applied Kubernetes NetworkPolicy with an overly restrictive ingress rule blocks all traffic between the API pods and the database pods. The policy was intended to restrict external access but inadvertently blocks intra-cluster communication, causing a complete application outage.
The Kubernetes metrics-server deployment is crashlooping due to a misconfigured TLS flag after a Helm chart upgrade. Without metrics-server, the Horizontal Pod Autoscaler cannot read CPU/memory metrics and stops scaling pods. During a traffic surge, pods remain at minimum replica count and become overwhelmed.
A multi-cloud architecture uses DNS-based failover between primary (cloud provider A) and secondary (cloud provider B). The DNS failover mechanism itself fails because the health check endpoint uses a shared authentication service that is down on both clouds, causing the DNS provider to mark both targets as unhealthy.
The CDN origin shield layer fails, causing all 50+ edge locations to simultaneously request content directly from the origin server. The origin is designed to handle coalesced requests from 1 shield, not direct requests from 50+ edges. Origin collapses under the load.
An attacker exploits an unkeyed header vulnerability to poison the CDN cache, causing all users requesting a specific page to receive a response containing injected malicious JavaScript. The poisoned cache entry has a 24-hour TTL and is replicated across all edge locations.
A GCP Pub/Sub subscription's dead letter topic accumulates millions of unprocessable messages after a schema change breaks the consumer application. The dead letter queue has no consumer, messages are piling up, and the original subscription's backlog is growing exponentially.
A GCP HTTP(S) load balancer marks all backend instances as unhealthy after a firewall rule change blocks the health check probe source IP range (35.191.0.0/16). The load balancer returns HTTP errors to all clients despite all backend VMs being fully operational.
An Azure Front Door configuration update introduces a routing rule error that sends 40% of production traffic to a staging backend pool. Users intermittently see staging data mixed with production data, causing data integrity concerns and customer confusion.
All Azure DevOps self-hosted build agents are stuck on hung builds, preventing any new CI/CD pipelines from running. The agent pool shows 0 available agents. Development velocity drops to zero as no code can be built, tested, or deployed.
A misconfigured Route 53 health check threshold causes all three regional endpoints to be marked unhealthy simultaneously during a brief network blip. Route 53 removes all records from DNS, causing a complete global outage even though all regions are actually healthy.
ECS service cannot place new tasks because all container instances in the cluster have exhausted their CPU and memory reservations. Auto-scaling group is at max capacity. Deployments are stuck with desired count never matching running count.
Every scenario is tested against Corax's Neural Engine in a production environment with AI-powered root cause analysis.
Tests run continuously as new infrastructure patterns are added.