PASSEDinfrastructure / netapp_volume_offline

NetApp ONTAP Volume Offline

A NetApp ONTAP volume goes offline due to an aggregate running out of space, causing all LUNs and NFS exports on that volume to become unavailable. Multiple application servers lose access to their primary storage.

Pattern

STORAGE_IO_LATENCY

Severity

CRITICAL

Confidence

95%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	STORAGE_IO_LATENCY	STORAGE_IO_LATENCY
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	18 linked
Cascade Escalation	N/A	No
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

NetApp FAS8200 running ONTAP 9.12. Aggregate aggr1 at 98% capacity. Volume vol_prod_data goes offline. 8 LUNs and 3 NFS exports affected. 12 application servers impacted. Snapshot reserve consumed by runaway snapshots.

Injected Error Messages (2)

NetApp ONTAP volume vol_prod_data offline — aggregate aggr1 at 98% full, volume taken offline by ONTAP space enforcement, read latency spiked to 450ms before volume went offline, 8 LUNs and 3 NFS exports unavailable, write latency was 380ms in final minutes before offline event

NFS exports from NetApp unavailable — vol_prod_data offline, 12 application servers reporting stale file handles, read latency exceeded 500ms threshold before complete storage failure on aggregate aggr1

Neural Engine Root Cause Analysis

The NetApp ONTAP volume vol_prod_data has been automatically taken offline due to aggregate aggr1 reaching 98% capacity, triggering ONTAP's space enforcement policies. This storage exhaustion caused severe I/O degradation (read latency 450ms, write latency 380ms) before the automatic offline event, resulting in 8 LUNs and 3 NFS exports becoming unavailable. The 10 correlated incidents within the same timeframe indicate a cascading failure affecting multiple dependent systems and services that rely on this storage infrastructure.

Remediation Plan

1. Immediately assess aggregate aggr1 space usage and identify largest volumes/files consuming space. 2. Check for and delete any snapshots, temporary files, or logs that can be safely removed to free space. 3. If possible, extend the aggregate by adding disks or move some volumes to other aggregates with available space. 4. Once space is freed below 90%, bring vol_prod_data back online using 'volume online' command. 5. Monitor I/O latency and verify all 8 LUNs and 3 NFS exports are accessible. 6. Implement proactive space monitoring and alerting at 85% to prevent future occurrences.

Tested: 2026-03-30Monitors: 2 | Incidents: 2Test ID: cmncjzsoc05xqobqeiufwmoro