PASSEDcloud / aws_rds_failover

AWS RDS Multi-AZ Failover

An AWS RDS PostgreSQL Multi-AZ instance experiences a hardware failure in the primary AZ. Automatic failover to the standby in the secondary AZ triggers. Applications experience 60-120 seconds of downtime. DNS endpoint resolves to new primary.

Pattern

AWS_CLOUD

Severity

CRITICAL

Confidence

95%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	AWS_CLOUD	AWS_CLOUD
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	34 linked
Cascade Escalation	Yes	Yes
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

AWS RDS PostgreSQL 15. Multi-AZ deployment. Primary in us-east-1a, standby in us-east-1b. db.r6g.xlarge instance. 3 application servers connecting via RDS endpoint. Connection pooling with PgBouncer.

Injected Error Messages (3)

AWS RDS Multi-AZ failover in progress — RDS event: 'Multi-AZ instance failover started', hardware issue on primary in us-east-1a, promoting standby in us-east-1b, DNS propagation in progress, estimated downtime: 60-120 seconds, Event ID: RDS-EVENT-0049

Application database connection errors — PgBouncer reporting 'server conn crashed' for 12 active connections, PostgreSQL connection to RDS endpoint refused during failover, retrying with exponential backoff, HTTP 503 responses to clients

Application health check failing — ECS task reporting unhealthy, database connection pool exhausted after RDS failover, all connections in 'waiting for server' state, ALB draining traffic from unhealthy target

Neural Engine Root Cause Analysis

AWS RDS is experiencing a Multi-AZ failover due to a hardware issue on the primary instance in us-east-1a. The system is automatically promoting the standby instance in us-east-1b and updating DNS records to point to the new primary. This is a managed AWS process designed for high availability, with 14 correlated incidents likely representing downstream application connection failures during the failover window.

Remediation Plan

1. Monitor AWS RDS console for failover completion status 2. Wait for DNS propagation to complete (estimated 60-120 seconds) 3. Verify new primary instance health in us-east-1b 4. Test database connectivity once failover completes 5. Monitor application logs for connection recovery 6. Address any downstream application issues from the 14 correlated incidents once RDS is stable

Tested: 2026-03-30Monitors: 3 | Incidents: 3Test ID: cmncjk7w7025aobqerpkvqcsh