A Nutanix Controller VM (CVM) crashes on one node of a 4-node cluster. All VMs on that node lose local storage access. The cluster attempts to serve I/O from surviving CVMs but performance degrades significantly.
Pattern
NUTANIX_EVENT
Severity
CRITICAL
Confidence
85%
Remediation
Remote Hands
Test Results
Metric
Expected
Actual
Result
Pattern Recognition
NUTANIX_EVENT
NUTANIX_EVENT
Severity Assessment
CRITICAL
CRITICAL
Incident Correlation
Yes
36 linked
Cascade Escalation
Yes
Yes
Remediation
—
Remote Hands — Corax contacts on-site support via call, email, or API
Scenario Conditions
4-node Nutanix cluster (NX-3460). RF2 storage. 60 VMs total, 15 on affected node. CVM crashed due to Stargate process segfault. Prism Central monitoring active.
Injected Error Messages (3)
Nutanix CVM down on node-3 — CVM unreachable at 10.10.50.103, Stargate process crash (segfault), local storage I/O redirected to remote CVMs, Prism alert: CRITICAL
Nutanix cluster health degraded — storage degraded after CVM failure on node-3, Prism Central alert: node storage offline, curator scan initiated for data rebalance
SQL Server experiencing high I/O latency after Nutanix CVM down on its host node — storage latency 45ms (baseline 2ms), query timeouts increasing
Neural Engine Root Cause Analysis
The Nutanix Controller VM (CVM) on node-3 has crashed due to a segmentation fault in the Stargate process, which is the core distributed storage service. This segfault has rendered the entire CVM unreachable at 10.10.50.103:9440, causing local storage I/O to be redirected to remote CVMs and triggering a critical Prism alert. The presence of 14 correlated incidents suggests this may be part of a broader cluster instability or cascading failure affecting multiple nodes.
Remediation Plan
1. Immediately check cluster health and remaining node capacity to ensure service continuity. 2. Access the hypervisor hosting node-3 and attempt to restart the CVM via vCenter/AHV. 3. Monitor CVM boot process and check /home/nutanix/data/logs for segfault details in stargate.log. 4. If restart fails, escalate to Nutanix support with crash dumps and consider emergency procedures to maintain cluster quorum. 5. Investigate root cause of segfault (memory corruption, hardware issues, or software bugs) to prevent recurrence.