Virtual Machines are either frozen or have turned invalid

Products

VMware vSphere ESXi

Issue/Introduction

VMFS Datastores are showing 0MB in size

Environment

ESXi 8.0
ESXi 7.0

Cause

The storage pool on the storage array has run out of space.
The LUNs are thin-provisioned and the storage array is overprovisioned.

/var/log/vobd.log
####-##-##T##:##: [vmfsCorrelator] ############us: [vob.vmfs.heartbeat.recovered] Reclaimed heartbeat for volume ########-########-####-############ (Datastore_Name): [Timeout] [HB state ######## offset 4075520 gen 21 stampUS 732062193077 uuid ########-########-####-############ jrnl <FB 23646400> drv 14.81]
####-##-##T##:##: [vmfsCorrelator] ############us: [esx.problem.vmfs.heartbeat.recovered] ########-########-####-############ Datastore_Name
####-##-##T##:##: [vmfsCorrelator] ############us: [vob.vmfs.heartbeat.timedout] ########-########-####-############ Datastore_Name
####-##-##T##:##: [vmfsCorrelator] ############us: [esx.problem.vmfs.heartbeat.timedout] ########-########-####-############ Datastore_Name

/var/log/vmkernel.log
####-##-##T##:## cpu1:3952877)HBX: 3063: 'Datastore_Name': HB at offset 4075520 - Waiting for timed out HB:
####-##-##T##:## cpu1:3952877) [HB state abcdef02 offset 4075520 gen 21 stampUS 732067584325 uuid ########-########-####-############ jrnl <FB 2420800> drv 14.81 lockImpl 4 ip 10.153.43.120]
####-##-##T##:## cpu23:2098079)ScsiDeviceIO: 4115: Cmd(0x45b93a990fc8) 0x2a, CmdSN 0x9d from world 2103896 to dev "naa.###############################" failed H:0x0 D:0x8 P:0x0
####-##-##T##:## cpu23:2098088)ScsiDeviceIO: 4115: Cmd(0x45b916c9bb88) 0x2a, CmdSN 0xb7 from world 3597869 to dev "naa.################################" failed H:0x0 D:0x8 P:0x0
####-##-##T##:## cpu23:2098088)ScsiDeviceIO: 4115: Cmd(0x45b93a7a2348) 0x2a, CmdSN 0xbf from world 3597869 to dev "naa.################################" failed H:0x0 D:0x8 P:0x0
####-##-##T##:## cpu3:2098092)NMP: nmp_ThrottleLogForDevice:3798: last error status from device naa.################################ repeated 2560 times
..
..
####-##-##T##:## cpu0:2097223)ScsiDeviceIO: 4154: Cmd(0x45b93a94f5c8) 0xfe, cmdId.initiator=0x4306e93aa200 CmdSN 0xce272e from world 3821736 to dev "naa.################################" failed H:0x5 D:0x0 P:0x0 . Cmd count Active:0

D:0x8 - This status is returned when a LUN cannot accept SCSI commands at the moment. As this should be a temporary condition, the command is tried again.
H:0x5 - This status is returned if the driver has to abort commands in-flight to the target. This can occur due to a command timeout or parity error in the frame.

Resolution

If the usage on the LUNs is showing greater at the array side, we may reclaim the space by running an UNMAP operation. Instructions are available in the article Reclaiming VMFS deleted blocks on Thin Provisioned LUNs.

If additional space is required, we may consider the following: (Its recommend that these steps are performed under the supervision of the Storage Vendor)

Remove old snapshot LUNs that are no longer required
Identify unused LUNs and remove them
If spare/unused disks are available use them to expand the storage pool
If there are failed disks present, replace them to restore capacity of the storage pool
Add another storage shelf and use it to contintribute more space to the storage pool