PASSEDcloud / ansible_playbook_failure

Ansible Playbook Failure — Configuration Drift Across Fleet

An Ansible playbook run against 200 production servers fails midway through execution due to a changed SSH host key on the jump host. 87 servers received the updated configuration while 113 did not, creating a split-brain configuration state across the fleet.

Pattern

FIREWALL_RULE_BLOCK

Severity

CRITICAL

Confidence

85%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	FIREWALL_RULE_BLOCK	FIREWALL_RULE_BLOCK
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	21 linked
Cascade Escalation	N/A	No
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

200 production servers managed by Ansible. Jump host SSH key rotated without updating known_hosts. Playbook partially applied: 87/200 servers updated. No rollback playbook available. NTP, firewall rules, and syslog configuration now inconsistent.

Injected Error Messages (2)

ansible playbook 'prod-hardening-v2.4' failed at task 88/142 — SSH connection to jump host 10.20.1.5 rejected due to host key verification failure, WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED, 87 of 200 hosts completed successfully before failure, 113 hosts unreachable, fleet now in split configuration state, firewall rules inconsistent between updated and non-updated servers, syslog forwarding misconfigured on partial fleet

configuration drift detected across production fleet — 87 servers running config v2.4, 113 servers still on v2.3, firewall rule mismatch causing intermittent connectivity failures between application tiers, NTP source inconsistent (87 servers using new NTP pool, 113 on old), monitoring agents reporting mixed syslog destinations

Neural Engine Root Cause Analysis

The Ansible playbook failed due to SSH host key verification failure when connecting to jump host 10.20.1.5, indicating the host's SSH fingerprint has changed (likely due to key rotation, system rebuild, or security incident). This caused the automation to halt mid-execution at task 88/142, leaving the production fleet in an inconsistent state with partial hardening applied, creating security and operational risks. The 10 correlated incidents within the same timeframe suggest this jump host failure has cascaded to other dependent systems and services.

Remediation Plan

1. Immediately verify the legitimacy of the SSH host key change on 10.20.1.5 through out-of-band verification (console access, physical verification, or trusted network path). 2. If legitimate, update the known_hosts file in Ansible Tower's SSH configuration to accept the new key. 3. If suspicious, investigate for potential security compromise before proceeding. 4. Re-run the 'prod-hardening-v2.4' playbook from task 88 to complete hardening on the remaining 113 hosts. 5. Validate fleet consistency by running configuration compliance checks across all 200 hosts. 6. Monitor the 10 correlated incidents for resolution as the jump host connectivity is restored.

Tested: 2026-03-30Monitors: 2 | Incidents: 2Test ID: cmnckcb3r08fmobqent6weg6k