A Kubernetes worker node enters NotReady state due to kubelet losing contact with the control plane after a network partition. 25 pods on the node are marked for eviction after the 5-minute toleration period. Pods reschedule to other nodes but some fail due to resource constraints.
Pattern
CONTAINER_EVENT
Severity
CRITICAL
Confidence
85%
Remediation
Remote Hands
Test Results
Metric
Expected
Actual
Result
Pattern Recognition
CONTAINER_EVENT
CONTAINER_EVENT
Severity Assessment
CRITICAL
CRITICAL
Incident Correlation
Yes
22 linked
Cascade Escalation
N/A
No
Remediation
—
Remote Hands — Corax contacts on-site support via call, email, or API
Scenario Conditions
Kubernetes 1.29. 8-node cluster. Worker node k8s-worker-05 network partitioned. 25 pods on the node. Pod eviction timeout: 5 minutes. Remaining nodes at 70% resource utilization. 3 pods with nodeAffinity constraints.
Kubernetes pod eviction in progress — 25 pods evicted from k8s-worker-05, 22 pods rescheduled to healthy nodes, 3 pods Pending (insufficient cpu/memory on target nodes, nodeAffinity constraint not satisfiable), cluster resource utilization at 85%
Neural Engine Root Cause Analysis
Kubernetes worker node k8s-worker-05 has experienced a critical failure where the kubelet service has stopped responding and posting status updates to the control plane. The node is marked as NotReady with a network partition detected, causing 25 pods to be scheduled for eviction. This appears to be either a kubelet service failure, severe resource exhaustion (CPU/memory/disk), or underlying infrastructure issues affecting the node's ability to communicate with the cluster.
Remediation Plan
1. Deploy Corax agent to investigate node health (disk space, memory, CPU usage). 2. Check kubelet service status and logs for failure indicators. 3. Attempt kubelet service restart if service is stopped/crashed. 4. Verify node network connectivity to control plane. 5. If kubelet restart fails or node remains unhealthy, cordon and drain node safely, then escalate for infrastructure team investigation. 6. Monitor pod rescheduling to healthy nodes.