We test Corax against real-world infrastructure failures across every vendor, platform, and scenario. Browse the results below.
Active Directory password hash synchronization between on-premises AD and Azure AD breaks after a domain controller is decommissioned. Users who change their on-premises passwords find their Azure AD passwords still use the old value, causing login failures for cloud services.
The ADFS token signing certificate expires, breaking all federated SSO authentication. Users cannot sign into Office 365, SaaS applications, or any relying party trusts configured to use ADFS for authentication.
DFS Replication backlog between two domain controllers reaches critical levels, with SYSVOL replication lagging by thousands of files. Group Policy is inconsistent across the domain, and some users receive stale GPOs depending on which DC they authenticate against.
Active Directory inter-site replication breaks after a network change removes the IP link between two AD sites. The Knowledge Consistency Checker (KCC) cannot generate a replication topology, and AD objects created in one site are not replicating to the other.
The robotic tape library experiences a mechanical jam in the tape picker mechanism, preventing all tape load/unload operations. Backup jobs queue indefinitely as no tapes can be mounted, and offsite tape rotation is halted.
The S3-compatible object storage (MinIO) reaches its configured quota limit. Application uploads fail, log shipping stops, and backup jobs writing to object storage all fail simultaneously.
Multiple Ceph OSDs fail simultaneously on a storage node after a power supply unit failure, triggering a massive data rebalancing operation. The cluster enters HEALTH_WARN state and client I/O is severely impacted during recovery.
An iSCSI storage target becomes unreachable on the primary path after a switch failure. Multipath failover engages but the secondary path is congested, causing severe performance degradation for all connected initiators.
A ZFS storage pool enters degraded state after a drive failure in a RAIDZ2 vdev. The pool remains operational but with reduced redundancy. A second drive in the same vdev is showing SMART warnings, indicating imminent failure.
The active controller on a Pure Storage FlashArray fails, triggering an automatic failover to the standby controller. During the failover, I/O is briefly paused and disk queue depth spikes, causing latency-sensitive applications to experience errors.
Thinly provisioned LUNs on an EMC Unity storage array exceed physical capacity. The storage pool runs out of space, causing write failures on all LUNs in the pool. VMware datastores become read-only and VMs begin crashing.
A NetApp ONTAP volume goes offline due to an aggregate running out of space, causing all LUNs and NFS exports on that volume to become unavailable. Multiple application servers lose access to their primary storage.
During a new client onboarding, the automated network discovery scan fails to complete due to aggressive IDS/IPS rules on the client firewall. The scan times out after 4 hours with only 30% of the network discovered. The MSP has an incomplete view of the client infrastructure.
During a scheduled SNMP community string rotation across client infrastructure, 40% of devices fail to update to the new community string. The NOC monitoring platform can no longer poll these devices, creating a critical blind spot across 6 client networks.
The ConnectWise Manage PSA platform becomes completely unreachable after a database failover goes wrong. The MSP service desk cannot create, update, or view tickets. Automated ticket creation from monitoring alerts queues up and eventually starts dropping. SLA tracking is offline.
The shared cloud backup repository used for 15 MSP clients becomes corrupted after a storage controller firmware bug. Backup jobs for all tenants fail with integrity check errors. The most recent valid restore point for some clients is 72 hours old, violating SLA RPO requirements.
A failed RMM platform update pushes a corrupt agent binary to all managed endpoints. The agent enters a crash loop on 400+ devices across 12 client organizations, leaving the MSP completely blind to endpoint health and unable to run remote management tasks.
The captive portal web server crashes, preventing guests from completing the splash page authentication. Guest devices connect to WiFi but cannot access the internet because the portal redirect times out. The portal VM ran out of memory handling a spike in connections.
A neighboring tenant in the shared office building installs high-power wireless equipment on overlapping channels, causing severe co-channel interference. Client devices experience packet loss, low throughput, and frequent disconnections across the entire 2.4GHz band and DFS channels on 5GHz.
The RADIUS server certificate used for EAP-TLS authentication expires, causing all 802.1X wireless clients to fail authentication. Supplicants reject the expired certificate, and no clients can connect to the enterprise SSID. Guest network remains functional.
Every scenario is tested against Corax's Neural Engine in a production environment with AI-powered root cause analysis.
Tests run continuously as new infrastructure patterns are added.