PASSEDinfrastructure / psa_ticketing_outage

PSA/Ticketing Platform Outage — Service Desk Paralyzed

The ConnectWise Manage PSA platform becomes completely unreachable after a database failover goes wrong. The MSP service desk cannot create, update, or view tickets. Automated ticket creation from monitoring alerts queues up and eventually starts dropping. SLA tracking is offline.

Pattern

CONNECTION_REFUSED

Severity

CRITICAL

Confidence

85%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	CONNECTION_REFUSED	CONNECTION_REFUSED
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	42 linked
Cascade Escalation	Yes	Yes
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

ConnectWise Manage cloud-hosted PSA. Database failover during maintenance window failed. API returning connection refused. 25 technicians unable to access tickets. 200+ queued alerts not creating tickets. Client portal also down.

Injected Error Messages (3)

ConnectWise Manage PSA completely unreachable — connection refused on all endpoints, database failover failed during maintenance, web interface returning connection refused, 25 technicians locked out of service desk

PSA API returning connection refused — all API calls to /v4_6_release/apis/3.0 failing with connection refused, automated ticket creation halted, 200+ monitoring alerts queued without ticket creation, SLA clock tracking offline

Webhook receiver unable to reach PSA — ticket creation webhook returning connection refused when forwarding to Manage API, 200+ alerts queued in dead letter queue, integration pipeline stalled, manual ticket triage required

Neural Engine Root Cause Analysis

The ConnectWise Manage PSA system has suffered a complete service failure following a database failover operation during maintenance. The connection refused errors across all endpoints indicate that the primary application services are not running or accepting connections, likely due to the database failover not completing successfully. This has created a cascading failure affecting 25 technicians and generating 16 correlated incidents, suggesting multiple dependent services are impacted by the PSA system being unreachable.

Remediation Plan

1. Immediately check database cluster status and failover state to identify if primary/secondary roles are properly established. 2. Verify database connectivity from application servers using connection strings and credentials. 3. Restart ConnectWise Manage application services in proper sequence (database connections first, then web services). 4. Check application logs for database connection errors or startup failures. 5. Verify load balancer health checks and endpoint configurations. 6. Test service restoration with limited user access before full restoration. 7. Monitor for cascading service recovery of the 16 correlated incidents.

Tested: 2026-03-30Monitors: 3 | Incidents: 3Test ID: cmncjrbm103z9obqek2e9bzkf