PASSEDinfrastructure / rpo_violation_backup_gap

RPO Violation — Backup Gap Exceeds SLA

A monitoring check reveals that the most recent successful backup of the production database is 72 hours old, far exceeding the 1-hour RPO SLA. The backup job has been failing silently due to a full backup repository, and the replication to the offsite location has also been paused.

Pattern

BACKUP_FAILURE

Severity

CRITICAL

Confidence

95%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	BACKUP_FAILURE	BACKUP_FAILURE
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	19 linked
Cascade Escalation	N/A	No
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

Production database with 1-hour RPO SLA. Last successful backup: 72 hours ago. Backup repository at 100% capacity. Backup job failing with 'no space left' error. Offsite replication paused. No alerts configured on backup age.

Injected Error Messages (2)

veeam backup job CRITICAL — last successful backup of 'pg-prod-daily' completed 72 hours ago on 2026-03-26T02:00:00Z, subsequent 36 backup attempts all failed with error: backup repository 'repo-primary' is at 100% capacity (48TB/48TB), no space to write new backup files, RPO SLA: 1 hour, actual recovery point age: 72 hours, offsite replication also paused due to source backup failures, zero usable restore points created in 3 days

production database at risk — pg-prod generating approximately 150GB of WAL per day, 72 hours of transactions (450GB) not backed up, if production database fails now data loss would be 72 hours of business transactions including orders, payments, and customer records, estimated financial exposure: 72 hours of revenue at risk, veeam repository full preventing any new backups

Neural Engine Root Cause Analysis

The backup server failure is caused by the primary backup repository reaching 100% capacity (48TB/48TB), preventing any new backup files from being written. This storage exhaustion has caused 36 consecutive backup failures over 72 hours, breaking the 1-hour RPO SLA and preventing offsite replication. The cascade effect suggests other systems may be dependent on backup status or storage resources.

Remediation Plan

1. Immediately check disk usage on backup repository and identify largest/oldest backup files for cleanup 2. Delete expired or redundant backup chains to free space 3. Verify backup retention policies are properly configured 4. Add additional storage capacity to backup repository or configure secondary repository 5. Resume backup jobs once space is available 6. Validate offsite replication resumes automatically 7. Implement storage monitoring alerts to prevent future capacity issues

Tested: 2026-03-30Monitors: 2 | Incidents: 2Test ID: cmnckffh9093lobqejyutj3a9