PASSEDcloud / aws_ec2_instance_store_loss

AWS EC2 Instance Store Loss — Ephemeral Data Gone

An EC2 instance with instance store (ephemeral) volumes experiences a hardware failure on the underlying host. AWS stops and restarts the instance on new hardware, but all instance store data is lost. The application had been incorrectly storing session data and temporary processing files on instance store volumes instead of EBS.

Pattern

AWS_CLOUD

Severity

CRITICAL

Confidence

95%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	AWS_CLOUD	AWS_CLOUD
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	20 linked
Cascade Escalation	N/A	No
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

AWS EC2 i3.2xlarge with 1.9TB NVMe instance store. Application storing session data, temp files, and ML model cache on instance store. No EBS backup of instance store data. Host hardware failure triggers instance stop/start. Instance relaunches on new host with empty instance store.

Injected Error Messages (2)

AWS EC2 instance store data loss — ec2 instance i-0a1b2c3d4e5f67890 stopped by AWS due to underlying host degradation, EC2 System Status Check: impaired, instance restarted on new host hardware, all NVMe instance store volumes (1.9TB) returned empty, /mnt/ephemeral mount shows 0 bytes used, session data for 12,000 active users lost, application bootstrap failing: required files missing from instance store

ML pipeline data loss — trained model cache (340GB) on instance store wiped after EC2 host migration, model inference endpoint returning errors: 'model file not found at /mnt/ephemeral/models/', retraining required (estimated 8 hours), batch processing queue backed up with 4,700 pending jobs, no EBS snapshot of instance store data available for recovery

Neural Engine Root Cause Analysis

AWS automatically stopped EC2 instance i-0a1b2c3d4e5f67890 due to underlying host hardware degradation and migrated it to new hardware. During this migration, all ephemeral instance store volumes (1.9TB NVMe) were permanently lost, as instance store data is not persistent across instance stops/migrations. The application was critically dependent on instance store data including session information for 12,000 active users and bootstrap files, causing complete service failure upon restart.

Remediation Plan

1. Immediately assess data recovery options from backups or replicas if available 2. Restore critical application bootstrap files from deployment artifacts or S3 backups 3. Invalidate all user sessions and force re-authentication to handle lost session data 4. Restart application services once bootstrap files are restored 5. Implement architectural changes to move critical data from ephemeral instance store to persistent EBS volumes or external storage 6. Set up automated backups for any remaining instance store usage 7. Consider implementing session persistence in external stores (Redis, DynamoDB) rather than local storage

Tested: 2026-03-30Monitors: 2 | Incidents: 2Test ID: cmncjuaj404qeobqe5y7qqkeq