Back to All Scenarios
PASSEDinfrastructure / rpo_violation_backup_gap

RPO Violation — Backup Gap Exceeds SLA

A monitoring check reveals that the most recent successful backup of the production database is 72 hours old, far exceeding the 1-hour RPO SLA. The backup job has been failing silently due to a full backup repository, and the replication to the offsite location has also been paused.

Pattern
BACKUP_FAILURE
Severity
CRITICAL
Confidence
95%
Remediation
Remote Hands

Test Results

MetricExpectedActualResult
Pattern RecognitionBACKUP_FAILUREBACKUP_FAILURE
Severity AssessmentCRITICALCRITICAL
Incident CorrelationYes19 linked
Cascade EscalationN/ANo
RemediationRemote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

Production database with 1-hour RPO SLA. Last successful backup: 72 hours ago. Backup repository at 100% capacity. Backup job failing with 'no space left' error. Offsite replication paused. No alerts configured on backup age.

Injected Error Messages (2)

veeam backup job CRITICAL — last successful backup of 'pg-prod-daily' completed 72 hours ago on 2026-03-26T02:00:00Z, subsequent 36 backup attempts all failed with error: backup repository 'repo-primary' is at 100% capacity (48TB/48TB), no space to write new backup files, RPO SLA: 1 hour, actual recovery point age: 72 hours, offsite replication also paused due to source backup failures, zero usable restore points created in 3 days
production database at risk — pg-prod generating approximately 150GB of WAL per day, 72 hours of transactions (450GB) not backed up, if production database fails now data loss would be 72 hours of business transactions including orders, payments, and customer records, estimated financial exposure: 72 hours of revenue at risk, veeam repository full preventing any new backups

Neural Engine Root Cause Analysis

The backup server failure is caused by the primary backup repository reaching 100% capacity (48TB/48TB), preventing any new backup files from being written. This storage exhaustion has caused 36 consecutive backup failures over 72 hours, breaking the 1-hour RPO SLA and preventing offsite replication. The cascade effect suggests other systems may be dependent on backup status or storage resources.

Remediation Plan

1. Immediately check disk usage on backup repository and identify largest/oldest backup files for cleanup 2. Delete expired or redundant backup chains to free space 3. Verify backup retention policies are properly configured 4. Add additional storage capacity to backup repository or configure secondary repository 5. Resume backup jobs once space is available 6. Validate offsite replication resumes automatically 7. Implement storage monitoring alerts to prevent future capacity issues
Tested: 2026-03-30Monitors: 2 | Incidents: 2Test ID: cmnckffh9093lobqejyutj3a9