We test Corax against real-world infrastructure failures across every vendor, platform, and scenario. Browse the results below.
During a disaster recovery drill, the automated runbook scripts fail to execute because the orchestration server's credentials to the DR environment have expired, the API endpoints have changed, and 3 of 8 recovery scripts reference infrastructure that has been decommissioned.
The warm standby database at the DR site has fallen out of synchronous replication mode and has been running in async mode for 5 days without anyone noticing. The standby is now 2 hours behind the primary. If failover were triggered now, 2 hours of committed transactions would be lost.
After activating disaster recovery, the DNS records are updated to point to the DR site, but propagation is taking much longer than expected due to high TTL values that were never reduced pre-failover. Many clients continue hitting the dead primary site for hours after the DNS change.
The cross-region database replication between the primary (us-east-1) and disaster recovery (eu-west-1) regions has fallen behind by 4 hours. A sustained write increase combined with network throttling between regions is causing the replica to fall further behind, with the gap widening every hour.
A monitoring check reveals that the most recent successful backup of the production database is 72 hours old, far exceeding the 1-hour RPO SLA. The backup job has been failing silently due to a full backup repository, and the replication to the offsite location has also been paused.
During an actual disaster recovery activation, the recovery process is taking far longer than the 4-hour RTO SLA. After 6 hours, only 3 of 12 critical services are operational. The recovery runbook is outdated, automation scripts are failing, and key personnel are unreachable.
A scheduled disaster recovery failover test reveals that the DR site cannot bring up critical services. The DR database has been silently failing replication for 2 weeks, the application servers have outdated configurations, and the DR network routing tables are stale.
An automated vulnerability scan discovers a critical CVE with a CVSS score of 10.0 affecting the production web server software. The vulnerability allows unauthenticated remote code execution and has a known public exploit. The affected software is running on 15 production servers.
A quarterly security audit discovers that 23 production services are still offering deprecated cipher suites including RC4, DES, and 3DES. Several services also support TLS 1.0 and 1.1. This fails PCI DSS Requirement 4.1 and multiple CIS benchmarks.
An external penetration test discovers a critical remote code execution vulnerability in the production API through an unsanitized file upload endpoint. The pentester demonstrates full shell access to the application server and lateral movement to the database server.
A weekly CIS benchmark compliance scan detects that 43 production servers have drifted from their hardened baseline. An unauthorized configuration management change reverted SSH hardening, disabled audit logging, and re-enabled insecure protocols across the production fleet.
A GDPR compliance scan discovers that the automated data retention purge job has been silently skipping records due to a foreign key constraint error. 2.3 million EU user records past their retention period have not been deleted, violating GDPR Article 5(1)(e) storage limitation principle.
An automated SOC 2 compliance scan discovers 14 terminated employee accounts that are still active across production systems. The offboarding automation failed silently for 3 months, and these accounts retain full production access including database admin and cloud console roles.
The centralized audit logging service for a healthcare application has been silently failing for 72 hours. No access logs for electronic protected health information (ePHI) were captured during this period, creating a HIPAA audit trail gap that must be reported and remediated.
A database scan discovers unencrypted cardholder data (primary account numbers) stored in a staging database that was never intended to be in PCI scope. A developer copied production data to staging for debugging without masking sensitive fields, violating PCI DSS Requirement 3.4.
A Kubernetes cluster's nodes are all pulling images from Docker Hub simultaneously during a rolling deployment, exceeding the Docker Hub rate limit (100 pulls/6 hours for anonymous, 200 for authenticated). All image pulls fail with 429 Too Many Requests, blocking deployments and pod restarts across the cluster.
A newly applied Kubernetes NetworkPolicy with an overly restrictive ingress rule blocks all traffic between the API pods and the database pods. The policy was intended to restrict external access but inadvertently blocks intra-cluster communication, causing a complete application outage.
The Kubernetes metrics-server deployment is crashlooping due to a misconfigured TLS flag after a Helm chart upgrade. Without metrics-server, the Horizontal Pod Autoscaler cannot read CPU/memory metrics and stops scaling pods. During a traffic surge, pods remain at minimum replica count and become overwhelmed.
A multi-cloud architecture uses DNS-based failover between primary (cloud provider A) and secondary (cloud provider B). The DNS failover mechanism itself fails because the health check endpoint uses a shared authentication service that is down on both clouds, causing the DNS provider to mark both targets as unhealthy.
The CDN origin shield layer fails, causing all 50+ edge locations to simultaneously request content directly from the origin server. The origin is designed to handle coalesced requests from 1 shield, not direct requests from 50+ edges. Origin collapses under the load.
Every scenario is tested against Corax's Neural Engine in a production environment with AI-powered root cause analysis.
Tests run continuously as new infrastructure patterns are added.