Back to All Scenarios
PASSEDserver / storage_io_latency

SAN Storage Latency Spike — Noisy Neighbor on Shared Storage

A dev team runs a massive data import job on a shared SAN, consuming all available IOPS. Production VMs on the same storage pool experience 10x latency increase, causing application timeouts.

Pattern
STORAGE_IO_LATENCY
Severity
CRITICAL
Confidence
92%
Remediation
Remote Hands

Test Results

MetricExpectedActualResult
Pattern RecognitionSTORAGE_IO_LATENCYSTORAGE_IO_LATENCY
Severity AssessmentCRITICALCRITICAL
Incident CorrelationYes28 linked
Cascade EscalationYesYes
RemediationRemote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

NetApp FAS8200 SAN. Shared storage pool for dev and prod. Dev job consuming 50K IOPS (normal: 2K). Production VM latency: 45ms (baseline: 3ms).

Injected Error Messages (3)

storage latency critical — SAN pool read latency 45ms (baseline 3ms), disk queue depth 128, iowait 89%, disk i/o throughput saturated by dev workload
SQL Server query timeouts due to storage I/O latency — read latency 47ms, write latency 52ms, iowait high on VM host, disk i/o bottleneck from shared SAN
Application server slow due to storage latency — response time 12s (baseline 200ms), i/o latency on SAN causing cascading timeouts

Neural Engine Root Cause Analysis

The SAN storage pool is experiencing severe I/O performance degradation caused by development workload saturation. Read latency has increased 15x from baseline (3ms to 45ms), with disk queue depth at maximum (128) and I/O wait at 89%, indicating the storage subsystem is completely overwhelmed. The 12 correlated incidents suggest this storage bottleneck is cascading to dependent services and applications that rely on this SAN pool.

Remediation Plan

1. Immediately identify and throttle/suspend the saturating development workload to restore service 2. Implement I/O priority/QoS controls to prevent dev workloads from impacting production 3. Move development workloads to separate storage tier or schedule during off-peak hours 4. Monitor storage performance recovery and validate dependent services return to normal 5. Establish long-term capacity planning and workload isolation policies
Tested: 2026-03-30Monitors: 3 | Incidents: 3Test ID: cmncjenx800qxobqe0lc77j5p