We test Corax against real-world infrastructure failures across every vendor, platform, and scenario. Browse the results below.
A well-intentioned change to reduce DNS TTL to 10 seconds for faster failover has overwhelmed the internal DNS resolvers. With 500 services making DNS lookups at 10-second intervals, the resolvers are processing 300,000 queries per minute and experiencing query failures that cascade into application-level failures.
A legacy 32-bit system component used for timestamp calculations in the billing pipeline is experiencing integer overflow when processing dates beyond 2038. A new long-term contract with an end date of 2040 causes the billing calculation to overflow, generating negative timestamps that crash the entire billing pipeline.
After a server migration, the new application servers are configured in UTC while the database remains in US/Eastern. This 4-5 hour time difference causes scheduled jobs to run at wrong times, time-based queries to return incorrect results, and audit timestamps to be inconsistent.
A customer-facing application that processes international domain names has a punycode encoding bug that causes DNS lookups to fail for domains containing non-ASCII characters. The bug was introduced in a library update and affects 15% of international customers whose email domains contain Unicode characters.
A network upgrade enables IPv6 on production servers without properly configuring the application firewall rules for IPv6. The application binds to both IPv4 and IPv6, but the firewall only has IPv4 rules. The IPv6 interface is completely unprotected, and port scans from the internet are discovering open management ports via IPv6.
A positive leap second insertion causes time synchronization issues across the infrastructure. Some servers handle it by stepping the clock, others by smearing, and a few that missed the NTP update have clocks 1 second ahead. This causes TLS handshake failures, database replication errors, and authentication failures due to clock skew.
During the 'fall back' daylight saving time transition, cron jobs scheduled between 1:00 AM and 2:00 AM execute twice because the clock passes through that hour twice. A billing job runs twice, double-charging 15,000 customers. A cleanup job runs twice, deleting data that the first run already processed.
The external OAuth2/OIDC identity provider (Okta) is experiencing a major outage. All SSO login attempts fail because the authorization endpoint is unreachable. Users cannot authenticate to any application that relies on Okta for SSO, affecting the entire organization.
A JWT signing key rotation was performed on the authentication service, but the new public key was not distributed to 4 of 7 microservices that validate tokens. These services are rejecting all tokens signed with the new key, while the auth service has stopped issuing tokens with the old key.
A new GraphQL resolver deployed to production has an N+1 query problem. A single client query that fetches a list of 500 items triggers 501 database queries (1 for the list + 500 for related data). With 200 concurrent users making this query, the database is executing 100,000+ queries per second and grinding to a halt.
A mobile app update with a WebSocket reconnection bug is causing millions of simultaneous WebSocket connection attempts. Each failed connection immediately retries, creating a connection storm that exhausts the server's file descriptor limit and blocks all new connections including the web application.
A partner integration is being blocked by API rate limiting after a bug in the partner's code causes it to retry failed requests in a tight loop. The rate limiter is correctly blocking the abusive traffic but is also blocking legitimate requests from the same partner, disrupting a critical business integration.
Redis has hit its maxmemory limit and is aggressively evicting cached entries using volatile-lru policy. The cache hit ratio has dropped from 95% to 12%, causing a thundering herd of cache misses that are overwhelming the backend database with 50x normal query load.
Kafka consumer group for the real-time analytics pipeline has accumulated 50 million unprocessed messages across 24 partitions. The consumers crashed after a schema registry became unavailable, and the auto-restart is failing because the schema registry is still down.
The payment service circuit breaker has tripped open after the payment gateway became unresponsive. This causes the order service to fail, which causes the cart service to fail, creating a cascade of open circuit breakers across 5 microservices. The entire checkout flow is non-functional.
The SIEM correlation engine is overwhelmed by a 10x log volume increase caused by a misconfigured firewall debug logging. The correlation engine cannot keep up, creating a 3-hour processing backlog. Real-time security alerting is non-functional during this period.
PagerDuty webhook integrations for all monitoring tools have been failing for 6 hours due to an expired mutual TLS credential on the PagerDuty integration proxy. Alerts are being generated by monitoring systems but never reaching the on-call team. Multiple production incidents have gone unnoticed.
A Datadog API key rotation was performed incorrectly, deploying the new key to only 30 of 500 servers. The remaining 470 agents are now sending data with the revoked API key, resulting in silent data loss. Datadog shows green dashboards but only from 30 servers.
The Elasticsearch cluster backing the ELK logging stack has run out of disk space. Elasticsearch has entered read-only mode, Logstash is backing up, and no new logs are being indexed. Security event logs, application logs, and audit trails are all being dropped.
The entire monitoring stack (Prometheus, Grafana, Alertmanager) has gone down due to a persistent volume running out of IOPS on the Kubernetes node. No metrics are being collected, no dashboards are accessible, and no alerts are firing. The infrastructure team is flying blind.
Every scenario is tested against Corax's Neural Engine in a production environment with AI-powered root cause analysis.
Tests run continuously as new infrastructure patterns are added.