An EC2 instance with instance store (ephemeral) volumes experiences a hardware failure on the underlying host. AWS stops and restarts the instance on new hardware, but all instance store data is lost. The application had been incorrectly storing session data and temporary processing files on instance store volumes instead of EBS.
Pattern
AWS_CLOUD
Severity
CRITICAL
Confidence
95%
Remediation
Remote Hands
Test Results
Metric
Expected
Actual
Result
Pattern Recognition
AWS_CLOUD
AWS_CLOUD
Severity Assessment
CRITICAL
CRITICAL
Incident Correlation
Yes
20 linked
Cascade Escalation
N/A
No
Remediation
—
Remote Hands — Corax contacts on-site support via call, email, or API
Scenario Conditions
AWS EC2 i3.2xlarge with 1.9TB NVMe instance store. Application storing session data, temp files, and ML model cache on instance store. No EBS backup of instance store data. Host hardware failure triggers instance stop/start. Instance relaunches on new host with empty instance store.
Injected Error Messages (2)
AWS EC2 instance store data loss — ec2 instance i-0a1b2c3d4e5f67890 stopped by AWS due to underlying host degradation, EC2 System Status Check: impaired, instance restarted on new host hardware, all NVMe instance store volumes (1.9TB) returned empty, /mnt/ephemeral mount shows 0 bytes used, session data for 12,000 active users lost, application bootstrap failing: required files missing from instance store
ML pipeline data loss — trained model cache (340GB) on instance store wiped after EC2 host migration, model inference endpoint returning errors: 'model file not found at /mnt/ephemeral/models/', retraining required (estimated 8 hours), batch processing queue backed up with 4,700 pending jobs, no EBS snapshot of instance store data available for recovery
Neural Engine Root Cause Analysis
AWS automatically stopped EC2 instance i-0a1b2c3d4e5f67890 due to underlying host hardware degradation and migrated it to new hardware. During this migration, all ephemeral instance store volumes (1.9TB NVMe) were permanently lost, as instance store data is not persistent across instance stops/migrations. The application was critically dependent on instance store data including session information for 12,000 active users and bootstrap files, causing complete service failure upon restart.
Remediation Plan
1. Immediately assess data recovery options from backups or replicas if available 2. Restore critical application bootstrap files from deployment artifacts or S3 backups 3. Invalidate all user sessions and force re-authentication to handle lost session data 4. Restart application services once bootstrap files are restored 5. Implement architectural changes to move critical data from ephemeral instance store to persistent EBS volumes or external storage 6. Set up automated backups for any remaining instance store usage 7. Consider implementing session persistence in external stores (Redis, DynamoDB) rather than local storage