PASSEDserver / k8s_node_not_ready

Kubernetes Node NotReady — Pods Rescheduling

A Kubernetes worker node enters NotReady state due to kubelet losing contact with the control plane after a network partition. 25 pods on the node are marked for eviction after the 5-minute toleration period. Pods reschedule to other nodes but some fail due to resource constraints.

Pattern

CONTAINER_EVENT

Severity

CRITICAL

Confidence

85%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	CONTAINER_EVENT	CONTAINER_EVENT
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	22 linked
Cascade Escalation	N/A	No
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

Kubernetes 1.29. 8-node cluster. Worker node k8s-worker-05 network partitioned. 25 pods on the node. Pod eviction timeout: 5 minutes. Remaining nodes at 70% resource utilization. 3 pods with nodeAffinity constraints.

Injected Error Messages (2)

Kubernetes node k8s-worker-05 NotReady — kubelet stopped posting status, NodeCondition: Ready=False, last heartbeat 6 minutes ago, network partition detected, 25 pods scheduled for eviction, node taint: node.kubernetes.io/unreachable:NoExecute

Kubernetes pod eviction in progress — 25 pods evicted from k8s-worker-05, 22 pods rescheduled to healthy nodes, 3 pods Pending (insufficient cpu/memory on target nodes, nodeAffinity constraint not satisfiable), cluster resource utilization at 85%

Neural Engine Root Cause Analysis

Kubernetes worker node k8s-worker-05 has experienced a critical failure where the kubelet service has stopped responding and posting status updates to the control plane. The node is marked as NotReady with a network partition detected, causing 25 pods to be scheduled for eviction. This appears to be either a kubelet service failure, severe resource exhaustion (CPU/memory/disk), or underlying infrastructure issues affecting the node's ability to communicate with the cluster.

Remediation Plan

1. Deploy Corax agent to investigate node health (disk space, memory, CPU usage). 2. Check kubelet service status and logs for failure indicators. 3. Attempt kubelet service restart if service is stopped/crashed. 4. Verify node network connectivity to control plane. 5. If kubelet restart fails or node remains unhealthy, cordon and drain node safely, then escalate for infrastructure team investigation. 6. Monitor pod rescheduling to healthy nodes.

Tested: 2026-03-30Monitors: 2 | Incidents: 2Test ID: cmncjmmb102p9obqe6k09kvd2