vSphere HA agent failure / invalid on ESXi host due to vmware-fdm agent running out of memory

Products

VMware vSphere ESX 7.x VMware vSphere ESX 8.x

Issue/Introduction

ESXi Hosts are running Cisco NFNIC drivers for storage.

HA status is invalid and attempting to enable HA on the cluster will not succeed.

In an HA and DRS enabled environment, a mass migration event can also occur which can cause resource contention. This is due to a failure of the vmware-fdm vib on some of the hosts inside the cluster and DRS then attempting to migrate VMs to hosts in which the vSphere HA agent is responding.

Environment

ESXI 7.0.x
ESXi 8.0.x

Cause

Due to a known issue with the interaction between vmware-fdm vib and Cisco NFNIC drivers when environments experience storage path flapping events, the Cisco NFNIC drivers iterate the path number for the storage connection, rather than trying to re-establish the storage path connection on the existing path number.

When investigating the vobd.log file on the hosts, logs like the following where the T# (Target) continually increases each time the path is removed and re-added.

YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5638946463382us: [vob.scsi.scsipath.add] Add path: vmhba2:C0:T579:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5638956792108us: [vob.scsi.scsipath.add] Add path: vmhba2:C0:T578:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639016597552us: [vob.scsi.scsipath.add] Add path: vmhba2:C0:T580:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639189521271us: [vob.scsi.scsipath.add] Add path: vmhba2:C0:T581:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639275852865us: [vob.scsi.scsipath.add] Add path: vmhba2:C0:T582:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639277608251us: [vob.scsi.scsipath.add] Add path: vmhba2:C0:T583:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639315873118us: [vob.scsi.scsipath.add] Add path: vmhba2:C0:T584:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639492899791us: [vob.scsi.scsipath.remove] Remove path: vmhba2:C0:T578:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639573618367us: [vob.scsi.scsipath.remove] Remove path: vmhba2:C0:T540:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639585295240us: [vob.scsi.scsipath.remove] Remove path: vmhba2:C0:T556:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639586984720us: [vob.scsi.scsipath.remove] Remove path: vmhba2:C0:T579:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639587700619us: [vob.scsi.scsipath.remove] Remove path: vmhba2:C0:T577:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639589197433us: [vob.scsi.scsipath.remove] Remove path: vmhba2:C0:T575:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639590633999us: [vob.scsi.scsipath.remove] Remove path: vmhba2:C0:T576:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639592140707us: [vob.scsi.scsipath.remove] Remove path: vmhba2:C0:T574:L254

This constant iteration of the vmhba paths causes the hostAgentStats-20.stats file in the /var/lib/vmware/hostd/stats folder to grow unchecked. Once the hostAgentStats-20.stats grows to too large of a size the vmware-fdm vib will crash.

The /var/log/fdm.log will show similar log messages as:

YYYY-MM-DD HH:MM:SS:MS verbose fdm[5174368] [Originator@6876 sub=Election opID=SWI-60b7acd9] Default send buffer size is 9216 msg size 312
YYYY-MM-DD HH:MM:SS:MS verbose fdm[5174368] [Originator@6876 sub=Election opID=SWI-60b7acd9] Set send buffer size to 65536 bytes
YYYY-MM-DD HH:MM:SS:MS verbose fdm[5174368] [Originator@6876 sub=Election opID=SWI-60b7acd9] New send buffer size is 65536
YYYY-MM-DD HH:MM:SS:MS error fdm[5174372] [Originator@6876 sub=Default opID=SWI-41a7] Unable to allocate memory
YYYY-MM-DD HH:MM:SS:MS panic fdm[5174372] [Originator@6876 sub=Default opID=SWI-41a7]
-->
--> Panic: Unable to allocate memory
--> Backtrace:
--> [backtrace begin] product: VMware Fault Domain Manager, version: 7.0.3, build: build-24024786, tag: fdm, cpu: x86_64, os: linux, buildType: release
--> backtrace[00] fdm[0x00E560C9]
--> backtrace[01] fdm[0x00DD2305]
--> backtrace[02] fdm[0x00D10649]
--> backtrace[03] fdm[0x00D7AFEA]
<snip>

Resolution

VMware is investigating this issue for a permanent fix.

The following steps will provide a temporary work around to resolve the vsphere-fdm agent failure.

Disable HA in the cluster where the affected hosts reside.
In vSphere Client, Select Cluster Object > Configure > vSphere Availability > Edit > Turn off vSphere HA.
Set DRS automation to Manual. (This prevents potential issue of mass migration of VMs due to vSphere HA agent crash on host)
In vSphere Client, Select Cluster Object > Configure > vSphere DRS > Edit > Automation Level > Set to Manual
Connect to each of the affected hosts via SSH.
Stop hostd on the host:
/etc/init.d/hostd stop
On the ESXi host, navigate to and delete the files in: /var/lib/vmware/hostd/stats

# cd /var/lib/vmware/hostd/stats

# rm hostAgentStats-20.stats
# rm hostAgentStats-metadata.xml
# rm hostAgentStats.idMap
# rm hostAgentStats.xml
Start hostd:
/etc/init.d/hostd start

Upon start hostd will detect files are missing and will recreate them.
Once all affected hosts have completed above steps, re-enable HA and DRS in the cluster.