Back to All Scenarios
PASSEDcloud / k8s_hpa_autoscaling_failure

Kubernetes HPA Autoscaling Failure — Metrics Server Down

The Kubernetes metrics-server deployment is crashlooping due to a misconfigured TLS flag after a Helm chart upgrade. Without metrics-server, the Horizontal Pod Autoscaler cannot read CPU/memory metrics and stops scaling pods. During a traffic surge, pods remain at minimum replica count and become overwhelmed.

Pattern
CONTAINER_EVENT
Severity
CRITICAL
Confidence
95%
Remediation
Auto-Heal

Test Results

MetricExpectedActualResult
Pattern RecognitionCONTAINER_EVENTCONTAINER_EVENT
Severity AssessmentCRITICALCRITICAL
Incident CorrelationYes22 linked
Cascade EscalationN/ANo
RemediationAuto-Heal — Corax resolves autonomously

Scenario Conditions

K8s cluster with 12 HPAs. Metrics-server crashed after Helm upgrade. HPAs cannot read metrics. Traffic surge 5x normal. All HPAs stuck at minReplicas. Application pods overwhelmed.

Injected Error Messages (2)

kubelet metrics-server pod restarting repeatedly — pod 'metrics-server-7bc6d75975-x4k2m' in kube-system namespace restarting every 30 seconds, error: 'failed to verify TLS connection to kubelet, flag --kubelet-insecure-tls removed in chart v3.12', all 12 HPAs reporting 'unable to fetch metrics: metrics API not available', notready state on metrics aggregation, horizontal pod autoscaler disabled cluster-wide
frontend pods overwhelmed — HPA stuck at minReplicas=2 during 5x traffic surge, kubelet reporting resource pressure on nodes hosting frontend pods, requests queuing with 15-second latency, autoscaling non-functional due to missing metrics, container restart count increasing on overloaded pods

Neural Engine Root Cause Analysis

The Kubernetes metrics-server pod is failing to start due to a configuration incompatibility after a Helm chart upgrade to v3.12+. The error indicates that the --kubelet-insecure-tls flag was removed, but the current deployment configuration still relies on insecure TLS connections to kubelets. This is causing the pod to crash-loop every 30 seconds, resulting in complete HPA functionality loss across the cluster as the metrics API is unavailable.

Remediation Plan

1. Update the metrics-server deployment to use secure TLS by adding the --kubelet-use-node-status-port flag or configuring proper TLS certificates. 2. Alternatively, if secure TLS setup is complex, temporarily add --kubelet-insecure-tls flag back to the deployment args until proper certificates can be configured. 3. Apply the configuration change and wait for the pod to stabilize. 4. Verify metrics API is accessible and HPA controllers can fetch metrics. 5. Monitor for successful pod startup and metrics collection resumption.
Tested: 2026-03-30Monitors: 2 | Incidents: 2Test ID: cmnckdmzc08qiobqe53wsh3uy