Hosts are going into non responding state.
VMs residing on host are getting disconnected.
VMware ESXi 7.x
VMware ESXi 8.x
The issue is caused by transient storage errors where vmhba paths enter into failed state, resulting APD events. This temporary loss of storage connectivity impacts hostd, leading the host to enter a non-responsive state.
Reference:
From ESXI, var/log/vmkernel.log
2025-02-16T09:56:42.769Z cpu0:2097409)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:843: Path (vmhba2:C0:T0:L7) command 0xa3 failed with transient error status Transient storage condition, suggest retry. sense data: 0x6 0x29 0x0. Waiting for 20 seconds fo$2025-02-16T09:56:42.770Z cpu3:2098243)NMP: nmp_ThrottleLogForDevice:3867: Cmd 0xa3 (0x45b9e368fc88, 0) to dev "naa.###########################" on path "vmhba2:C0:T0:L8" Failed.
2025-02-16T09:57:49.777Z cpu64:2098246)NMP: nmp_ThrottleLogForDevice:3875: H:0x2 D:0x8 P:0x0 . Act:EVAL. cmdId.initiator=0x############ CmdSN 0x592025-02-16T09:57:49.777Z cpu64:2098246)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.###########################" state in doubt; requested fast path state update...
To resolve the FC storage connectivity issues - Lost access to volume due to connectivity issues OR Path redundancy to storage device degraded
NOTE: HBA driver and firmware must in the Broadcom Compatibility Guide.
Reach out to your server manufacturer or their website to get more information /best practice about the driver and firmware compatible with your HBA hardware.