ESXi Hosts are running Cisco NFNIC drivers for storage.
HA status is invalid and attempting to enable HA on the cluster will not succeed.
In an HA and DRS enabled environment, a mass migration event can also occur which can cause resource contention. This is due to a failure of the vmware-fdm vib on some of the hosts inside the cluster and DRS then attempting to migrate VMs to hosts in which the vSphere HA agent is responding.
Due to a known issue with the interaction between vmware-fdm vib and Cisco NFNIC drivers when environments experience storage path flapping events, the Cisco NFNIC drivers iterate the path number for the storage connection, rather than trying to re-establish the storage path connection on the existing path number.
When investigating the vobd.log
file on the hosts, logs like the following where the T# (Target) continually increases each time the path is removed and re-added.
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5638946463382us: [vob.scsi.scsipath.add] Add path: vmhba2:C0:T579:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5638956792108us: [vob.scsi.scsipath.add] Add path: vmhba2:C0:T578:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639016597552us: [vob.scsi.scsipath.add] Add path: vmhba2:C0:T580:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639189521271us: [vob.scsi.scsipath.add] Add path: vmhba2:C0:T581:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639275852865us: [vob.scsi.scsipath.add] Add path: vmhba2:C0:T582:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639277608251us: [vob.scsi.scsipath.add] Add path: vmhba2:C0:T583:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639315873118us: [vob.scsi.scsipath.add] Add path: vmhba2:C0:T584:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639492899791us: [vob.scsi.scsipath.remove] Remove path: vmhba2:C0:T578:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639573618367us: [vob.scsi.scsipath.remove] Remove path: vmhba2:C0:T540:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639585295240us: [vob.scsi.scsipath.remove] Remove path: vmhba2:C0:T556:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639586984720us: [vob.scsi.scsipath.remove] Remove path: vmhba2:C0:T579:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639587700619us: [vob.scsi.scsipath.remove] Remove path: vmhba2:C0:T577:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639589197433us: [vob.scsi.scsipath.remove] Remove path: vmhba2:C0:T575:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639590633999us: [vob.scsi.scsipath.remove] Remove path: vmhba2:C0:T576:L254
YYYY-MM-DD HH:MM:SS:MS: [scsiCorrelator] 5639592140707us: [vob.scsi.scsipath.remove] Remove path: vmhba2:C0:T574:L254
This constant iteration of the vmhba paths causes the hostAgentStats-20.stats
file in the /var/lib/vmware/hostd/stats
folder to grow unchecked. Once the hostAgentStats-20.stats
grows to too large of a size the vmware-fdm vib will crash.
The /var/log/fdm.log
will show similar log messages as:
YYYY-MM-DD HH:MM:SS:MS verbose fdm[5174368] [Originator@6876 sub=Election opID=SWI-60b7acd9] Default send buffer size is 9216 msg size 312
YYYY-MM-DD HH:MM:SS:MS verbose fdm[5174368] [Originator@6876 sub=Election opID=SWI-60b7acd9] Set send buffer size to 65536 bytes
YYYY-MM-DD HH:MM:SS:MS verbose fdm[5174368] [Originator@6876 sub=Election opID=SWI-60b7acd9] New send buffer size is 65536
YYYY-MM-DD HH:MM:SS:MS error fdm[5174372] [Originator@6876 sub=Default opID=SWI-41a7] Unable to allocate memory
YYYY-MM-DD HH:MM:SS:MS panic fdm[5174372] [Originator@6876 sub=Default opID=SWI-41a7]
-->
--> Panic: Unable to allocate memory
--> Backtrace:
--> [backtrace begin] product: VMware Fault Domain Manager, version: 7.0.3, build: build-24024786, tag: fdm, cpu: x86_64, os: linux, buildType: release
--> backtrace[00] fdm[0x00E560C9]
--> backtrace[01] fdm[0x00DD2305]
--> backtrace[02] fdm[0x00D10649]
--> backtrace[03] fdm[0x00D7AFEA]
<snip>
VMware is investigating this issue for a permanent fix.
The following steps will provide a temporary work around to resolve the vsphere-fdm agent failure.
/etc/init.d/hostd stop
On the ESXi host, navigate to and delete the files in: /var/lib/vmware/hostd/stats
# cd /var/lib/vmware/hostd/stats
# rm hostAgentStats-20.stats
# rm hostAgentStats-metadata.xml
# rm hostAgentStats.idMap
# rm hostAgentStats.xml
Start hostd: /etc/init.d/hostd start
Upon start hostd will detect files are missing and will recreate them.