Infrastructure Scenario Tests

We test Corax against real-world infrastructure failures across every vendor, platform, and scenario. Browse the results below.

276
Total Tests
100.0%
Pass Rate
276
Passed
0
Failed

DNS TTL Too Low — Resolver Overload and Lookup Latency

PASS

A well-intentioned change to reduce DNS TTL to 10 seconds for faster failover has overwhelmed the internal DNS resolvers. With 500 services making DNS lookups at 10-second intervals, the resolvers are processing 300,000 queries per minute and experiencing query failures that cascade into application-level failures.

InfrastructurePattern: HIGH_CPUSeverity: CRITICALConfidence: 95%Remote Hands16 correlated

Epoch Timestamp Overflow in 32-Bit System

PASS

A legacy 32-bit system component used for timestamp calculations in the billing pipeline is experiencing integer overflow when processing dates beyond 2038. A new long-term contract with an end date of 2040 causes the billing calculation to overflow, generating negative timestamps that crash the entire billing pipeline.

InfrastructurePattern: UNKNOWNSeverity: CRITICALConfidence: 95%Remote Hands17 correlated

Time Zone Mismatch Between Database and Application Servers

PASS

After a server migration, the new application servers are configured in UTC while the database remains in US/Eastern. This 4-5 hour time difference causes scheduled jobs to run at wrong times, time-based queries to return incorrect results, and audit timestamps to be inconsistent.

InfrastructurePattern: UNKNOWNSeverity: CRITICALConfidence: 95%Auto-Heal17 correlated

Unicode Domain Name (IDN) Causing DNS Resolution Failure

PASS

A customer-facing application that processes international domain names has a punycode encoding bug that causes DNS lookups to fail for domains containing non-ASCII characters. The bug was introduced in a library update and affects 15% of international customers whose email domains contain Unicode characters.

InfrastructurePattern: DNS_FAILURESeverity: CRITICALConfidence: 95%Remote Hands18 correlated

IPv6 Transition Issue — Dual-Stack Misconfiguration

PASS

A network upgrade enables IPv6 on production servers without properly configuring the application firewall rules for IPv6. The application binds to both IPv4 and IPv6, but the firewall only has IPv4 rules. The IPv6 interface is completely unprotected, and port scans from the internet are discovering open management ports via IPv6.

InfrastructurePattern: FIREWALL_RULE_BLOCKSeverity: CRITICALConfidence: 95%Auto-Heal18 correlated

Leap Second Causing Time Synchronization Drift

PASS

A positive leap second insertion causes time synchronization issues across the infrastructure. Some servers handle it by stepping the clock, others by smearing, and a few that missed the NTP update have clocks 1 second ahead. This causes TLS handshake failures, database replication errors, and authentication failures due to clock skew.

InfrastructurePattern: UNKNOWNSeverity: CRITICALConfidence: 95%Remote Hands18 correlated

Daylight Saving Time Causing Cron Job Double-Execution

PASS

During the 'fall back' daylight saving time transition, cron jobs scheduled between 1:00 AM and 2:00 AM execute twice because the clock passes through that hour twice. A billing job runs twice, double-charging 15,000 customers. A cleanup job runs twice, deleting data that the first run already processed.

InfrastructurePattern: UNKNOWNSeverity: CRITICALConfidence: 95%Remote Hands26 correlated

OAuth2 Provider Outage — All SSO Logins Failing

PASS

The external OAuth2/OIDC identity provider (Okta) is experiencing a major outage. All SSO login attempts fail because the authorization endpoint is unreachable. Users cannot authenticate to any application that relies on Okta for SSO, affecting the entire organization.

SecurityPattern: CONNECTION_REFUSEDSeverity: CRITICALConfidence: 95%Remote Hands36 correlated

JWT Token Signing Key Rotation Failure — Authentication Broken

PASS

A JWT signing key rotation was performed on the authentication service, but the new public key was not distributed to 4 of 7 microservices that validate tokens. These services are rejecting all tokens signed with the new key, while the auth service has stopped issuing tokens with the old key.

SecurityPattern: UNKNOWNSeverity: CRITICALConfidence: 95%Auto-Heal36 correlated

GraphQL N+1 Query — Database Overload

PASS

A new GraphQL resolver deployed to production has an N+1 query problem. A single client query that fetches a list of 500 items triggers 501 database queries (1 for the list + 500 for related data). With 200 concurrent users making this query, the database is executing 100,000+ queries per second and grinding to a halt.

InfrastructurePattern: SERVER_ERRORSeverity: CRITICALConfidence: 95%Remote Hands18 correlated

WebSocket Connection Storm — Server Connection Limit Reached

PASS

A mobile app update with a WebSocket reconnection bug is causing millions of simultaneous WebSocket connection attempts. Each failed connection immediately retries, creating a connection storm that exhausts the server's file descriptor limit and blocks all new connections including the web application.

InfrastructurePattern: CONNECTION_REFUSEDSeverity: CRITICALConfidence: 95%Remote Hands18 correlated

API Rate Limit Exceeded — Partner Integration Blocked

PASS

A partner integration is being blocked by API rate limiting after a bug in the partner's code causes it to retry failed requests in a tight loop. The rate limiter is correctly blocking the abusive traffic but is also blocking legitimate requests from the same partner, disrupting a critical business integration.

InfrastructurePattern: FIREWALL_RULE_BLOCKSeverity: CRITICALConfidence: 85%Remote Hands18 correlated

Redis Cache Eviction Storm — Database Overwhelmed

PASS

Redis has hit its maxmemory limit and is aggressively evicting cached entries using volatile-lru policy. The cache hit ratio has dropped from 95% to 12%, causing a thundering herd of cache misses that are overwhelming the backend database with 50x normal query load.

InfrastructurePattern: UNKNOWNSeverity: CRITICALConfidence: 95%Remote Hands22 correlated

Kafka Consumer Lag Critical — Event Processing Stalled

PASS

Kafka consumer group for the real-time analytics pipeline has accumulated 50 million unprocessed messages across 24 partitions. The consumers crashed after a schema registry became unavailable, and the auto-restart is failing because the schema registry is still down.

InfrastructurePattern: UNKNOWNSeverity: CRITICALConfidence: 85%Remote Hands22 correlated

Microservice Circuit Breaker Open — Cascading Failures

PASS

The payment service circuit breaker has tripped open after the payment gateway became unresponsive. This causes the order service to fail, which causes the cart service to fail, creating a cascade of open circuit breakers across 5 microservices. The entire checkout flow is non-functional.

InfrastructurePattern: CONNECTION_REFUSEDSeverity: CRITICALConfidence: 85%Remote Hands30 correlated

SIEM Correlation Engine Overloaded — Security Events Unprocessed

PASS

The SIEM correlation engine is overwhelmed by a 10x log volume increase caused by a misconfigured firewall debug logging. The correlation engine cannot keep up, creating a 3-hour processing backlog. Real-time security alerting is non-functional during this period.

InfrastructurePattern: HIGH_CPUSeverity: CRITICALConfidence: 92%Remote Hands18 correlated

PagerDuty Webhook Delivery Failure — Alerts Not Reaching On-Call

PASS

PagerDuty webhook integrations for all monitoring tools have been failing for 6 hours due to an expired mutual TLS credential on the PagerDuty integration proxy. Alerts are being generated by monitoring systems but never reaching the on-call team. Multiple production incidents have gone unnoticed.

InfrastructurePattern: UNKNOWNSeverity: CRITICALConfidence: 95%Auto-Heal18 correlated

Datadog Agent Mass Disconnect — Fleet Monitoring Gap

PASS

A Datadog API key rotation was performed incorrectly, deploying the new key to only 30 of 500 servers. The remaining 470 agents are now sending data with the revoked API key, resulting in silent data loss. Datadog shows green dashboards but only from 30 servers.

InfrastructurePattern: SERVER_ERRORSeverity: CRITICALConfidence: 95%Auto-Heal18 correlated

ELK Stack Disk Full — Log Ingestion Halted

PASS

The Elasticsearch cluster backing the ELK logging stack has run out of disk space. Elasticsearch has entered read-only mode, Logstash is backing up, and no new logs are being indexed. Security event logs, application logs, and audit trails are all being dropped.

InfrastructurePattern: DISK_FULLSeverity: CRITICALConfidence: 95%Remote Hands22 correlated

Prometheus/Grafana Stack Down — Monitoring Blind Spot

PASS

The entire monitoring stack (Prometheus, Grafana, Alertmanager) has gone down due to a persistent volume running out of IOPS on the Kubernetes node. No metrics are being collected, no dashboards are accessible, and no alerts are firing. The infrastructure team is flying blind.

InfrastructurePattern: STORAGE_IO_LATENCYSeverity: CRITICALConfidence: 90%Remote Hands30 correlated
Page 1 of 14Next

Every scenario is tested against Corax's Neural Engine in a production environment with AI-powered root cause analysis.

Tests run continuously as new infrastructure patterns are added.