PASSEDinfrastructure / zfs_pool_degraded

ZFS Pool Degraded — Drive Failure

A ZFS storage pool enters degraded state after a drive failure in a RAIDZ2 vdev. The pool remains operational but with reduced redundancy. A second drive in the same vdev is showing SMART warnings, indicating imminent failure.

Pattern

PERFORMANCE_DEGRADATION

Severity

CRITICAL

Confidence

95%

Remediation

Remote Hands

Test Results

Metric	Expected	Actual
Pattern Recognition	PERFORMANCE_DEGRADATION	PERFORMANCE_DEGRADATION
Severity Assessment	CRITICAL	CRITICAL
Incident Correlation	Yes	5 linked
Cascade Escalation	N/A	No
Remediation	—	Remote Hands — Corax contacts on-site support via call, email, or API

Scenario Conditions

ZFS pool 'datapool' with 3x RAIDZ2 vdevs (6 drives each). Drive sd-e3 failed in vdev-1. Drive sd-e5 in same vdev showing SMART reallocated sector warnings. Pool degraded but operational. Resilver operation in progress on hot spare.

Injected Error Messages (1)

ZFS pool 'datapool' degraded — drive sd-e3 in vdev-1 FAULTED with too many errors, pool running in degraded state with reduced redundancy, resilver in progress on hot spare: 23% complete, WARNING: drive sd-e5 in same vdev showing 847 reallocated sectors in SMART data, second failure would result in data loss for vdev-1, zpool status showing DEGRADED state

Neural Engine Root Cause Analysis

The ZFS storage server has experienced a hardware drive failure where drive sd-e3 in vdev-1 has FAULTED due to excessive errors, causing the ZFS pool 'datapool' to enter a degraded state. While a resilver operation is in progress on a hot spare (23% complete), there is a critical secondary risk as drive sd-e5 in the same vdev shows 847 reallocated sectors, indicating imminent failure. If sd-e5 fails before the resilver completes, vdev-1 would lose all data as ZFS requires at least one healthy drive per vdev for data integrity.

Remediation Plan

1. Immediately monitor the resilver progress and ensure it completes successfully. 2. Once resilver is complete, physically replace the faulted drive sd-e3. 3. Add the new drive back to the pool to restore redundancy. 4. Urgently replace drive sd-e5 due to high reallocated sector count indicating imminent failure. 5. Monitor SMART data on all remaining drives in the pool. 6. Consider adding additional hot spares to prevent future single points of failure. 7. Implement proactive SMART monitoring alerts to catch drive degradation earlier.

Tested: 2026-03-30Monitors: 1 | Incidents: 1Test ID: cmnck066s05zvobqesh7vu6wd