A monitoring check reveals that the most recent successful backup of the production database is 72 hours old, far exceeding the 1-hour RPO SLA. The backup job has been failing silently due to a full backup repository, and the replication to the offsite location has also been paused.
Pattern
BACKUP_FAILURE
Severity
CRITICAL
Confidence
95%
Remediation
Remote Hands
Test Results
Metric
Expected
Actual
Result
Pattern Recognition
BACKUP_FAILURE
BACKUP_FAILURE
Severity Assessment
CRITICAL
CRITICAL
Incident Correlation
Yes
19 linked
Cascade Escalation
N/A
No
Remediation
—
Remote Hands — Corax contacts on-site support via call, email, or API
Scenario Conditions
Production database with 1-hour RPO SLA. Last successful backup: 72 hours ago. Backup repository at 100% capacity. Backup job failing with 'no space left' error. Offsite replication paused. No alerts configured on backup age.
Injected Error Messages (2)
veeam backup job CRITICAL — last successful backup of 'pg-prod-daily' completed 72 hours ago on 2026-03-26T02:00:00Z, subsequent 36 backup attempts all failed with error: backup repository 'repo-primary' is at 100% capacity (48TB/48TB), no space to write new backup files, RPO SLA: 1 hour, actual recovery point age: 72 hours, offsite replication also paused due to source backup failures, zero usable restore points created in 3 days
production database at risk — pg-prod generating approximately 150GB of WAL per day, 72 hours of transactions (450GB) not backed up, if production database fails now data loss would be 72 hours of business transactions including orders, payments, and customer records, estimated financial exposure: 72 hours of revenue at risk, veeam repository full preventing any new backups
Neural Engine Root Cause Analysis
The backup server failure is caused by the primary backup repository reaching 100% capacity (48TB/48TB), preventing any new backup files from being written. This storage exhaustion has caused 36 consecutive backup failures over 72 hours, breaking the 1-hour RPO SLA and preventing offsite replication. The cascade effect suggests other systems may be dependent on backup status or storage resources.
Remediation Plan
1. Immediately check disk usage on backup repository and identify largest/oldest backup files for cleanup 2. Delete expired or redundant backup chains to free space 3. Verify backup retention policies are properly configured 4. Add additional storage capacity to backup repository or configure secondary repository 5. Resume backup jobs once space is available 6. Validate offsite replication resumes automatically 7. Implement storage monitoring alerts to prevent future capacity issues