PASSEDinfrastructure / redis_cache_eviction_storm

Redis Cache Eviction Storm — Database Overwhelmed

Redis has hit its maxmemory limit and is aggressively evicting cached entries using volatile-lru policy. The cache hit ratio has dropped from 95% to 12%, causing a thundering herd of cache misses that are overwhelming the backend database with 50x normal query load.

Pattern

UNKNOWN

Severity

CRITICAL

Confidence

95%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	UNKNOWN	UNKNOWN
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	22 linked
Cascade Escalation	N/A	No
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

Redis 6GB maxmemory. Cache hit ratio dropped from 95% to 12%. Eviction rate: 50000 keys/second. Database connections maxed out at 500. Database query latency: 8 seconds (normal: 50ms). Application returning errors.

Injected Error Messages (2)

Redis eviction storm — maxmemory 6GB reached, eviction policy: volatile-lru, evicted_keys: 50000/second, keyspace_hits falling rapidly, cache hit ratio: 12% (baseline: 95%), connected_clients: 847, used_memory: 6.0GB/6.0GB, 4.2 million keys evicted in last 90 seconds, memory fragmentation ratio: 1.8, all cached session data and query results being purged

database connection pool exhausted — pg-prod receiving 50x normal query load due to Redis cache eviction storm, max connections reached (500/500), new connections being rejected, database query latency: 8 seconds (normal: 50ms), database cpu at 100% across all cores, application returning 'database connection pool exhausted' errors, thundering herd of cache misses overwhelming database backend, deadlock detected on multiple tables

Neural Engine Root Cause Analysis

Redis production cache has reached its configured 6GB maxmemory limit, triggering an eviction storm with 50,000 keys being evicted per second under the volatile-lru policy. The cache hit ratio has collapsed from 95% to 12% as critical session data and query results are being purged, likely causing the 11 correlated incidents including the Azure App Service customer portal failure. This indicates either a memory leak in the application writing to Redis, an unexpected surge in data volume, or insufficient memory allocation for current workload demands.

Remediation Plan

1. Immediately increase Redis maxmemory limit or scale to larger instance if possible 2. Analyze Redis keyspace to identify large keys or unexpected data growth patterns 3. Review application logs for memory leaks or unusual write patterns 4. Consider temporary cache flush and application restart if memory increase isn't possible 5. Implement monitoring alerts for memory usage at 80% threshold 6. Review and optimize cache expiration policies for non-critical data

Tested: 2026-03-30Monitors: 2 | Incidents: 2Test ID: cmnckh60b09hlobqe7q2as2x3