PASSEDserver / storage_io_latency

SAN Storage Latency Spike — Noisy Neighbor on Shared Storage

A dev team runs a massive data import job on a shared SAN, consuming all available IOPS. Production VMs on the same storage pool experience 10x latency increase, causing application timeouts.

Pattern

STORAGE_IO_LATENCY

Severity

CRITICAL

Confidence

92%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	STORAGE_IO_LATENCY	STORAGE_IO_LATENCY
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	28 linked
Cascade Escalation	Yes	Yes
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

NetApp FAS8200 SAN. Shared storage pool for dev and prod. Dev job consuming 50K IOPS (normal: 2K). Production VM latency: 45ms (baseline: 3ms).

Injected Error Messages (3)

storage latency critical — SAN pool read latency 45ms (baseline 3ms), disk queue depth 128, iowait 89%, disk i/o throughput saturated by dev workload

SQL Server query timeouts due to storage I/O latency — read latency 47ms, write latency 52ms, iowait high on VM host, disk i/o bottleneck from shared SAN

Application server slow due to storage latency — response time 12s (baseline 200ms), i/o latency on SAN causing cascading timeouts

Neural Engine Root Cause Analysis

The SAN storage pool is experiencing severe I/O performance degradation caused by development workload saturation. Read latency has increased 15x from baseline (3ms to 45ms), with disk queue depth at maximum (128) and I/O wait at 89%, indicating the storage subsystem is completely overwhelmed. The 12 correlated incidents suggest this storage bottleneck is cascading to dependent services and applications that rely on this SAN pool.

Remediation Plan

1. Immediately identify and throttle/suspend the saturating development workload to restore service 2. Implement I/O priority/QoS controls to prevent dev workloads from impacting production 3. Move development workloads to separate storage tier or schedule during off-peak hours 4. Monitor storage performance recovery and validate dependent services return to normal 5. Establish long-term capacity planning and workload isolation policies

Tested: 2026-03-30Monitors: 3 | Incidents: 3Test ID: cmncjenx800qxobqe0lc77j5p