ESX Host Unresponsiveness and Virtual Machine Inaccessibility Due to Storage Latency or Fabric Issues

Products

VMware vSphere ESXi

Issue/Introduction

The ESX host may appear connected in the vCenter Server inventory but might not respond to management operations.
The host may also show Unresponsive or Not Connected, and attempting to re-connect the host to vCenter through the vSphere UI fails.
Some, or all, virtual machines on the affected host become inaccessible and may show as Disconnected in the vSphere UI.
Power operations such as Power On, Power Off, or Reset may fail for virtual machines residing on the host.
Restarting ESX management agents (hostd, vpxa) may not restore host or virtual machine responsiveness.
Navigation in the host's Direct Console User Interface (DCUI accessed via KVM, IPMI, etc.) may present long delays moving between or within the menus and accessing the shell through DCUI may not function correctly.
In the /var/run/log/vmkernel.log file, the following warning messages may be seen:
- "ALERT: hostd performance has degraded due to high system latency"
- "Devices/volumes experiencing 'Internal Target Failure'"
- Sense Code 0xB 44/00 = Aborted Command / Internal Target Failure
  Refer: Interpreting SCSI sense codes in VMware ESXi
- YYYY-MM-DDTHH:MM:SSZ cpu##:#######)ALERT: hostd performance has degraded due to high system latency
  -----
  YYYY-MM-DDTHH:MM:SSZ cpu##:#######)ScsiDeviceIO: ####: Cmd(#x############) #x##, CmdSN #x######## from world ####### to dev "naa.#######" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0
  YYYY-MM-DDTHH:MM:SSZ cpu##:#######)NMP: nmp_ThrottleLogForDevice:####: Cmd #x## (#x############, #######) to dev "naa.#######" on path "vmhba#:##:##:###" Failed:
In the /var/run/log/vmkwarning.log, messages like "state in doubt; requested fast path state update..." may appear as well as messages stating hostd detected to be non-responsive and/or PDL (permanent device loss):
- YYYY-MM-DDTHH:MM:SSZ cpu###:#######)WARNING: nfnic: <#>: fnic_abort_cmd: ####: Abort for cmd tag: #x### in pending state
  YYYY-MM-DDTHH:MM:SSZ cpu###:#######)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:###: NMP device "naa.#######" state in doubt; requested fast path state update...
  -----
  YYYY-MM-DDTHH:MM:SSZ Al(###) vmkalert: cpu#:#######)ALERT: hostd detected to be non-responsive
  -----
  YYYY-MM-DDTHH:MM:SSZ Wa(###) vmkwarning: cpu#:#######)WARNING: NMP: nmp_PathDetermineFailure:####: Cmd (#x##) PDL error (0x5/0x25/0x0) - path vmhba#:C#:T#:L# device naa.#### - triggering path failover
  YYYY-MM-DDTHH:MM:SSZ Wa(###) vmkwarning: cpu#:#######)WARNING: NMP: nmp_DeviceRetryCommand:###: Device "naa.####": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
Hostd logs (/var/run/log/hostd.log) show increasing latency messages. This can happen even if the vmkernel logs do not show driver or SCSI I/O messages:
- YYYY-MM-DDTHH:MM:SSZ Wa(###) Hostd[#######] [Originator@#### sub=IoTracker] In thread #######, stat("/vmfs/volumes/datastoreUUID/folderName/VMname-sesparse.vmdk") took over 43799 sec.
  YYYY-MM-DDTHH:MM:SSZ Wa(###) Hostd[#######] [Originator@#### sub=IoTracker] In thread #######, stat("/vmfs/volumes/datastoreUUID/folderName/VMname-sesparse.vmdk") took over 43809 sec
- This may result in "ALERT: hostd detected to be non-responsive" messages in the vmkernel logs.

Cause

This issue typically occurs when high system latency or storage-related delays impact the responsiveness of the ESX management service, hostd.
As a result, the host becomes unresponsive to management operations while still appearing connected in vCenter. Contributing factors may include:
- Storage array performance degradation
- Fabric issues, such as SAN switch/zoning delays or intermittent path failures
- SCSI command failures with sense key 0xB / ASC 44/00 indicating Internal Target Failure
- Aborted commands observed due to path or array-level issues

Resolution

The issue may be caused by storage array performance degradation or a fabric-related issue. To resolve it:

1. Engage the storage vendor to investigate latency at the storage-array level.
2. Check the storage fabric health, including SAN switches, zoning, and connectivity between the ESX host and storage array.
3. Monitor storage response times to identify anomalies or bottlenecks in the data path.

Workarounds:

To temporarily recover from the unresponsive state and regain access to the affected virtual machines (VMs), consider the following options:
- Option 1: Rescan Storage
  - Follow the steps of KB 308546 (Performing a rescan of the storage on an ESXi host)
  - Restart the management agents as per KB 320280 (Restarting Management Agents in ESXi)
    - /etc/init.d/hostd restart
    - /etc/init.d/vpxa restart
- Option 2: Reset the vmhbaX related to the issue

1. 1. Identify the vmhba being used by the affected datastore:
    If not in the logs, use the below command:
    esxcfg-mpath -L | grep naa.################################
  2. Reset the related vmhbaX using this command (repeat the command on all vmhba related):
    localcli storage san fc reset -A vmhba1
    localcli storage san fc reset -A vmhba2
  3. Restart the management agents as per KB 320280 (Restarting Management Agents in ESXi)
    /etc/init.d/hostd restart
    /etc/init.d/vpxa restart

- Option 3: Reboot the ESX host.

1. 1. Evacuate the virtual machines that are possible to be migrated to other hosts or gracefully shutdown virtual machines that cannot be migrated but still accessible via guest OS.
  2. Proceed with the ESX host reboot. If necessary, proceed with hard reset the affected ESX host using the KVM/IPMI console.
    Note: Upon reboot, the High Availability (HA) mechanism may trigger, if configured, causing virtual machines to restart on available hosts within the cluster.

Additional Information

Refer to KB article "state in doubt; requested fast path state update" error in vmkernel.log for information on device "state in doubt" conditions.
If using fiber channel storage (FC or FCoE), ensure HBA firmware and drivers are up to date and supported for the ESX version.

ESX Host Unresponsiveness and Virtual Machine Inaccessibility Due to Storage Latency or Fabric Issues

Article ID: 392616

Updated On:

Products

Issue/Introduction

Cause

Resolution

Additional Information

Feedback