PASSEDcloud / azure_cosmos_db_throttling

Azure Cosmos DB Throttling — RU/s Budget Exhausted

Azure Cosmos DB begins aggressively throttling requests after a marketing campaign drives 10x normal traffic. The provisioned RU/s budget is exhausted and autoscale max is reached. Applications receive HTTP 429 (Too Many Requests) responses. Retry storms amplify the problem as clients retry throttled requests.

Pattern

AZURE_CLOUD

Severity

CRITICAL

Confidence

95%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	AZURE_CLOUD	AZURE_CLOUD
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	42 linked
Cascade Escalation	Yes	Yes
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

Azure Cosmos DB account 'prod-cosmos'. Provisioned 10,000 RU/s with autoscale max 40,000 RU/s. Traffic spike: 10x normal (marketing campaign). Current demand: 95,000 RU/s. All partitions throttling. 5 microservices dependent on Cosmos DB.

Injected Error Messages (3)

Azure Cosmos DB throttling critical — account 'prod-cosmos' returning HTTP 429 on 78% of requests, consumed RU/s: 95,000 (max provisioned: 40,000), autoscale ceiling reached, Azure Monitor: TotalRequests throttled across all partitions, retry-after header: 47ms but retry storms amplifying load, partition key hot spots detected on /category, Azure cost alert triggered at $2,400/day

Product catalog service degraded — 78% of Azure Cosmos DB read queries returning 429 TooManyRequests, product listing pages loading in 12 seconds (baseline: 400ms), SDK retry policy exhausted after 9 retries, cache miss rate: 85% due to traffic spike, customer-facing product pages showing 'unable to load products', Azure App Service scaling but backend bottleneck is Cosmos DB RU budget

Shopping cart service failing — cart add/update operations rejected by Azure Cosmos DB throttling, 429 response rate: 92% on write operations, customer carts not persisting, checkout flow broken, estimated lost revenue: $12,000/hour, Azure Service Bus dead letter queue growing (3,400 failed cart events), retry storms consuming 2x the original RU demand

Neural Engine Root Cause Analysis

The Azure Cosmos DB instance 'prod-cosmos' is experiencing severe throttling due to consuming 95,000 RU/s against a maximum provisioned capacity of 40,000 RU/s, despite autoscale being enabled and having reached its ceiling. The throttling is exacerbated by hot partition issues on the /category partition key, creating uneven load distribution that prevents effective scaling. Additionally, retry storms are amplifying the problem as applications retry throttled requests with only 47ms delays, further overwhelming the database.

Remediation Plan

1. Immediately increase the autoscale maximum RU/s limit from 40,000 to at least 100,000 RU/s to handle current load. 2. Implement exponential backoff in application retry logic to respect the retry-after headers and reduce retry storms. 3. Analyze and redesign the partition key strategy to distribute load more evenly across partitions, potentially using a composite key instead of just /category. 4. Review and optimize high-frequency queries and operations consuming excessive RUs. 5. Set up proper monitoring and alerting for RU consumption before reaching throttling thresholds. 6. Consider implementing request queuing or circuit breaker patterns in applications to handle throttling gracefully.

Tested: 2026-03-30Monitors: 3 | Incidents: 3Test ID: cmncjusrn04waobqe26gmqywu