PASSEDinfrastructure / kafka_consumer_lag_critical

Kafka Consumer Lag Critical — Event Processing Stalled

Kafka consumer group for the real-time analytics pipeline has accumulated 50 million unprocessed messages across 24 partitions. The consumers crashed after a schema registry became unavailable, and the auto-restart is failing because the schema registry is still down.

Pattern

UNKNOWN

Severity

CRITICAL

Confidence

85%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	UNKNOWN	UNKNOWN
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	22 linked
Cascade Escalation	N/A	No
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

Kafka cluster with 24 partitions. Consumer group 'analytics-pipeline' crashed. Schema registry unavailable. Consumer lag: 50 million messages. Auto-restart failing. Real-time analytics dashboard showing stale data.

Injected Error Messages (2)

Kafka consumer group 'analytics-pipeline' lag CRITICAL — total lag across 24 partitions: 50,247,831 messages, lag growing at 2.1 million messages per minute, oldest unprocessed message offset: 45 minutes old, consumer group has 0 active members (all consumers crashed), no rebalancing in progress, topic 'events.analytics' retention: 72 hours, if lag not resolved within 71 hours oldest messages will be purged unprocessed

analytics pipeline consumer startup failure — all 8 consumer instances failing on startup with 'SchemaRegistryException: econnrefused on schema-registry:8081', schema registry pod in error state, consumers cannot deserialize Avro messages without schema registry, consumer loop: start -> connect to schema registry -> econnrefused -> fail -> restart, real-time analytics dashboard showing data 45 minutes stale and aging

Neural Engine Root Cause Analysis

The Kafka consumer group 'analytics-pipeline' has 0 active members, indicating all consumers have crashed and are not processing messages from the 'events.analytics' topic. This has resulted in a massive lag of over 50 million messages growing at 2.1M messages/minute, with the oldest unprocessed message being 45 minutes old. The presence of 12 correlated incidents within a 5-minute window suggests a broader infrastructure or deployment issue that caused multiple consumers to crash simultaneously, rather than an isolated consumer application failure.

Remediation Plan

1. Investigate the 12 correlated incidents to identify the common root cause (network outage, deployment, resource exhaustion). 2. Check Kafka broker health and cluster status at 10.40.1.10:9092. 3. Examine consumer application logs for crash reasons (OOM, exceptions, connectivity issues). 4. Verify consumer host infrastructure (CPU, memory, disk, network connectivity). 5. Restart consumer applications/services for the 'analytics-pipeline' group. 6. Monitor consumer lag recovery and processing rate. 7. If lag doesn't decrease rapidly, consider scaling out consumers or increasing partition parallelism. 8. Implement alerting for consumer group health to prevent future incidents.

Tested: 2026-03-30Monitors: 2 | Incidents: 2Test ID: cmnckh5a709hkobqez8kliuj4