ESXi Host Path Inflation (Target ID Storm) and Cross-Fabric I/O Blocking triggered by Cisco SAN "Slow Drain" and nfnic Driver Bug
search cancel

ESXi Host Path Inflation (Target ID Storm) and Cross-Fabric I/O Blocking triggered by Cisco SAN "Slow Drain" and nfnic Driver Bug

book

Article ID: 436399

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

ESXi hosts in a Cisco UCS environment may experience a massive, rapid increase in storage path counts (exceeding 3,000+ paths), which eventually results in both VMHBA adapters reporting path down events. This behavior leads to severe performance degradation, "All Paths Down" (APD) conditions, or a Purple Screen of Death (PSOD), even when the physical root cause is limited to a single faulty SFP on a single SAN fabric.

The issue typically begins with redundancy loss on the HBA connected to the failing fabric (e.g., vmhba0). However, due to logical congestion spreading and a "Target ID Storm" within the storage stack, the failure eventually escalates to impact the redundant healthy adapter (e.g., vmhba1). This cascading effect can occur across multiple ESXi hosts and across different vCenter environments simultaneously.

Additional symptoms reported:

  • Guest OS logs reporting "pvscsi" resets and "iScsiPrt" connection lost events.

  • Degraded VM performance leading to SQL database corruption or VMs becoming inaccessible.

  • Path counts reaching near the ESXi configuration maximum (4,096) before resetting and beginning to increment again.

  • Physical layer triggers identifying a "sick but not dead" SFP on a Cisco MDS switch port, causing Fibre Channel Slow Drain.

Environment

  •  ESXi 7.x or 8.x

  •  Pure Storage FlashArray (Fibre Channel)

  •  Cisco MDS 9000 Series Switches

  • Cisco UCS B-Series or C-Series Servers

Cause

The failure is the result of a three-stage cascading event involving the physical layer, fabric logic, and host driver behavior:

  1. Fibre Channel Slow Drain (Physical Trigger): A "sick but not dead" SFP on a Cisco MDS switch generates bit errors or fails to return Buffer-to-Buffer (B2B) credits. This creates "back-pressure" or a "Slow Drain" that spreads from the switch to the storage array controller. Ref: KB429270  (Congestion Spreading)

  2. Target ID Storm / nfnic Driver Bug (CSCwt85731): The unstable fabric triggers Registered State Change Notifications (RSCNs). The nfnic driver attempts to recover by re-logging into the fabric. Due to the congestion, logout requests time out. The driver fails to recognize the returning port and assigns it a brand-new logical Target ID (e.g., T1380 to T1458). This results in the rapid accumulation of thousands of "ghost paths" in the kernel. Ref: KB404949

Resolution

Fabric Remediation

  • Administrative Shutdown: Identify the MDS switch port with the faulty SFP and perform an administrative shutdown. Simply replacing the SFP while the port is live may not stop the RSCN storm or clear the nfnic driver's login loop.

2. Software and Driver Upgrades 

Upgrade the environment to the following levels (or higher) to address the Target ID Storm bug: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwt85731

  • nfnic Driver: Version 5.0.0.50 (Addresses Cisco Bug CSCwt85731).

  • UCS Firmware: Baseline 6.0.1e (Improves ABTS/Abort sequence handling between the VIC and the driver).

3. Host Recovery

  • Reboot: Any ESXi host that has reached a path count significantly higher than its baseline (e.g., 1,000+ paths) should be gracefully rebooted after the fabric is stabilized. This is the only way to reliably flush the thousands of "unregistered" path objects and reclaim kernel memory heap.

4. Fabric Hardening (Prevention)

  • Enable Port Guard / MAPS: Configure the Cisco MDS switches to use Port Guard or Congestion Drop. This ensures that if a port exceeds a threshold of txwait or CRC errors, the switch will automatically disable the port (err-disable), preventing the "Slow Drain" from poisoning the rest of the SAN fabric.