An AWS RDS PostgreSQL Multi-AZ instance experiences a hardware failure in the primary AZ. Automatic failover to the standby in the secondary AZ triggers. Applications experience 60-120 seconds of downtime. DNS endpoint resolves to new primary.
Pattern
AWS_CLOUD
Severity
CRITICAL
Confidence
95%
Remediation
Remote Hands
Test Results
Metric
Expected
Actual
Result
Pattern Recognition
AWS_CLOUD
AWS_CLOUD
Severity Assessment
CRITICAL
CRITICAL
Incident Correlation
Yes
34 linked
Cascade Escalation
Yes
Yes
Remediation
—
Remote Hands — Corax contacts on-site support via call, email, or API
Scenario Conditions
AWS RDS PostgreSQL 15. Multi-AZ deployment. Primary in us-east-1a, standby in us-east-1b. db.r6g.xlarge instance. 3 application servers connecting via RDS endpoint. Connection pooling with PgBouncer.
Injected Error Messages (3)
AWS RDS Multi-AZ failover in progress — RDS event: 'Multi-AZ instance failover started', hardware issue on primary in us-east-1a, promoting standby in us-east-1b, DNS propagation in progress, estimated downtime: 60-120 seconds, Event ID: RDS-EVENT-0049
Application database connection errors — PgBouncer reporting 'server conn crashed' for 12 active connections, PostgreSQL connection to RDS endpoint refused during failover, retrying with exponential backoff, HTTP 503 responses to clients
Application health check failing — ECS task reporting unhealthy, database connection pool exhausted after RDS failover, all connections in 'waiting for server' state, ALB draining traffic from unhealthy target
Neural Engine Root Cause Analysis
AWS RDS is experiencing a Multi-AZ failover due to a hardware issue on the primary instance in us-east-1a. The system is automatically promoting the standby instance in us-east-1b and updating DNS records to point to the new primary. This is a managed AWS process designed for high availability, with 14 correlated incidents likely representing downstream application connection failures during the failover window.
Remediation Plan
1. Monitor AWS RDS console for failover completion status 2. Wait for DNS propagation to complete (estimated 60-120 seconds) 3. Verify new primary instance health in us-east-1b 4. Test database connectivity once failover completes 5. Monitor application logs for connection recovery 6. Address any downstream application issues from the 14 correlated incidents once RDS is stable