VMs inaccessible with "State in Doubt" and "VMFS Heartbeat Timeouts" reported in the logs.
search cancel

VMs inaccessible with "State in Doubt" and "VMFS Heartbeat Timeouts" reported in the logs.

book

Article ID: 432799

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • The VM may experience latency resulting in applications becoming unusable/unstable.

  • The Virtual Machine may also become totally unresponsive whereby no remote connections can be established.

  • The Virtual Machine may also be seen to be unresponsive and inaccessible in vSphere.

 

 

Environment

VMware vSphere ESXi 8.x

VMware vSphere ESX 9.x

 

Cause

This is caused by external storage or SAN fabric-layer instability, or a failure in the local HBA driver/firmware stack to process I/O acknowledgments. The ESXi host is reporting a failure to maintain a consistent I/O path to the storage devices.

"State in doubt" message indicates that the storage stack cannot determine the device status, typically due to physical layer interruptions or hardware-level congestion.

VMFS heartbeat timeouts confirm that the host is losing access to the volume metadata for extended periods.

 

  • Upon further inspection of the /var/log/vmkernel.log, you also see evidence of the following

             state in doubt; requested fast path state update

  • /var/log/vobd.log reports  [esx.problem.vmfs.heartbeat.timedout] and  [esx.problem.vmfs.heartbeat.recovered] 
  • The vmkernel.log contains log throttle entries indicating a large number of errors

     NMP: nmp_ResetDeviceLogThrottling:####: last error status from device naa.################################

  • Where the source of the issue is on the SAN, you may also see indications of Permanent Device Loss (PDL) with SCSI code H: 0x0 D: 0x2 P: 0x0 Valid sense data: 0x5 0x25 0x0 

 

Resolution

Engage the Storage (SAN) and Fabric vendors to investigate the health of the storage array and the physical paths (switches, cables, and SFPs).

  1. HBA Driver and Firmware Check: Verify that the Host Bus Adapter (HBA) driver and firmware versions are on the Broadcom Compatibility Guide (HCL) and are supported for the version of ESXi in the environment.

  2. Engage Storage and Fabric Vendors: Contact the Storage (SAN) and Fabric vendors to investigate the health of the storage array controllers and the physical pathing (switches, cables, and SFPs).

  3. Data Correlation: Provide the storage and fabric teams with the specific device ID (naa.################################) and the timestamps from the vmkernel.log & vobd.log when the throttling or heartbeat timeouts were recorded.

  4. Hardware Inspection: Inspect physical SAN switch logs for port flapping, or faulty SFP modules that could cause intermittent path instability.

Additional Information

The following knowledge-based articles are applicable

 

"state in doubt; requested fast path state update" error in vmkernel.log

 

Understanding lost access to volume messages in ESXi

 

Permanent Device Loss (PDL) and All-Paths-Down (APD) on host