Storage Heartbeat Timeouts and Path Instability on ESXi Hosts with ALUA Configuration
search cancel

Storage Heartbeat Timeouts and Path Instability on ESXi Hosts with ALUA Configuration

book

Article ID: 434717

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms

  • ESXi hosts report intermittent "Lost connectivity to storage heartbeat" or "Host-path redundancy degraded" alarms.
  • Specific datastores (e.g., L2) show high latency or disconnects, while others (L0, L1) remain functional.
  • vmkernel.log displays frequent H:0xc (Soft Error) SCSI status codes only on one path: 
    vmkernel: cpu48:2098390)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0xa0 (0x45d9ae0b35c0, 0) to dev "naa.xxx" on path "vmhba64:C0:T0:L0" Failed: 
     vmkernel: cpu48:2098390)NMP: nmp_ThrottleLogForDevice:3898: H:0xc D:0x0 P:0x0 . Act:N ONE. cmdId.initiator=0x45392079b908 CmdSN 0x0
  • Hardware statistics show physical link degradation on one HBA: 
    localcli storage san fc stats get -a vmhba64
       ...
       Link Failure Count: 2100.
       ...

Environment

ESXi

Cause

The issue occurs when a specific datastore's Active (Preferred) path is assigned to an HBA experiencing physical link instability (e.g., faulty SFP or fiber cable). In an ALUA environment, the ESXi host attempts to send I/O via the preferred "Active" path. If that HBA (e.g., vmhba64) has hardware failures, the datastore assigned to it will suffer timeouts. Other datastores whose preferred "Active" paths are on a healthy HBA (e.g., vmhba2) will appear normal as they only use the problematic HBA as a standby "Active Unoptimized" path Troubleshooting Storage ALUA Identifying Storage Issues.

Sample Diagnostic Output

Using the command localcli storage nmp path list | egrep "Runtime Name:|Device:|Group State:", the following pattern identifies the imbalance:

   Runtime Name: vmhba64:C0:T0:L2   Device: naa.xxx   Group State: active  <-- Preferred path is on unstable HBA
   Runtime Name: vmhba2:C0:T0:L2   Device: naa.xxx   Group State: active unoptimized <-- Standby path is on healthy HBA

Note: In this scenario, Device L2 will experience timeouts because its primary path is on vmhba64, which has physical link errors.

 

Resolution

  1. Identify the failing HBA: Review the ESXi vmkernel.log or hardware management logs (e.g., IMM, iLO) for link resets or high error counts on a specific vmhba.
  2. Verify Path Assignments: Use the following command to identify which HBA is handling the Active path for the affected datastore: localcli storage nmp path list | egrep "Runtime Name:|Device:|Group State:"
  3. Physical Hardware Replacement: Replace the SFP module and fiber optic cable for the identified unstable HBA
  4. Confirm Path Stability: After hardware replacement, verify that all paths return to a stable state and that the "Active" paths are no longer toggling