ESXi hosts become not responding in the vCenter due to missing heartbeats in vCenter 7.0 U3

Products

VMware vCenter Server 7.0

Issue/Introduction

After upgrading VC to 7.0 U3 or while running on 7.0 U3, all hosts in the environment enter a not responding state. The vCenter logs state that the host moved to this state due to missing heartbeats, even though heartbeats are showing as received on the network of the VCSA.
vpxd logs, located at /var/log/vmware/vpxd/, show repeated messages "Certificate not available, starting hostsync"
The operation timeout can happen anywhere between 2-30 mins as in the example below:

[YYYY-MM-DDTHH:MM:SS] info vpxd[34753] [Originator@6876 sub=Vmomi opID=HostSync-host-147-79013b58] Creating SOAP stub adapter for /vpxa on <ESXihostfqdn>:443
[YYYY-MM-DDTHH:MM:SS] info vpxd[34753] [Originator@6876 sub=IO.Http opID=HostSync-host-147-79013b58] Terminating user agent request; (null), N7Vmacore16TimeoutExceptionE(Operation timed out)
[YYYY-MM-DDTHH:MM:SS] info vpxd[34753] [Originator@6876 sub=IO.Http opID=HostSync-host-147-79013b58] Set user agent error; state: 1, (null), N7Vmacore16TimeoutExceptionE(Operation timed out)
[YYYY-MM-DDTHH:MM:SS] warning vpxd[34781] [Originator@6876 sub=IO.Connection opID=HostSync-host-XXX-79013b58] Failed to connect; <io_obj p:0x00007fc56ca86dc8, h:-1, <TCP 'xxx.xxx.xxx.xxx : 45208'>, <TCP '<IP> : 443'>>, e: 104(Connection reset by peer), duration: 120020msec
[YYYY-MM-DDTHH:MM:SS] warning vpxd[34753] [Originator@6876 sub=Vmomi opID=HostSync-host-147-79013b58] Creating SOAP stub adapter failed: N7Vmacore24InvalidArgumentExceptionE(No version for VMODL calls to <<cs p:00007fc5a8348eb0, TCP:<ESXihostfqdn>:443>, /vpxa>)
[YYYY-MM-DDTHH:MM:SS] info vpxd[23864] [Originator@6876 sub=Vmomi opID=HostSync-host-147-2f8fc66f] Creating SOAP stub adapter for /vpxa on <ESXihostfqdn>:443

Storage migration of VMs may not proceed after 'select storage page' while validating storage.
VPXD logs showing issues with FdmMonitor:

[YYYY-MM-DDTHH:MM:SS] info vpxd[09943] [Originator@6876 sub=Vmomi opID=sps-Main-416743-850-449976-4] Creating SOAP stub adapter for /vpxa on <hostname>
c.com:443
[YYYY-MM-DDTHH:MM:SS] error vpxd[09943] [Originator@6876 sub=Vmomi opID=sps-Main-416743-850-449976-4] [VpxaClientAdapter] Last login to <Hostname>
failed; timeout; no host sync yet
[YYYY-MM-DDTHH:MM:SS] error vpxd[09943] [Originator@6876 sub=Vmomi opID=sps-Main-416743-850-449976-4] Cannot contact the specified host (<hostname)>
[YYYY-MM-DDTHH:MM:SS] info vpxd[15009] [Originator@6876 sub=Vmomi opID=FdmMonitor-domain-c261957-2919bc76] Creating SOAP stub adapter for /fdm on <hostname>:443
[YYYY-MM-DDTHH:MM:SS] warning vpxd[15009] [Originator@6876 sub=Vmomi opID=FdmMonitor-domain-c261957-2919bc76] [FdmClientAdapter] Login request throttled on host <hostname>
[YYYY-MM-DDTHH:MM:SS] error vpxd[15009] [Originator@6876 sub=Vmomi opID=FdmMonitor-domain-c261957-2919bc76] Cannot contact the specified host (<hostname>)
[YYYY-MM-DDTHH:MM:SS] info vpxd[15009] [Originator@6876 sub=Vmomi opID=FdmMonitor-domain-c261957-2919bc76] Creating SOAP stub adapter for /fdm on <hostname>:443
[YYYY-MM-DDTHH:MM:SS] warning vpxd[15009] [Originator@6876 sub=Vmomi opID=FdmMonitor-domain-c261957-2919bc76] [FdmClientAdapter] Login request throttled on host <hostname>
[YYYY-MM-DDTHH:MM:SS] error vpxd[15009] [Originator@6876 sub=Vmomi opID=FdmMonitor-domain-c261957-2919bc76] Cannot contact the specified host (<hostname>)

[YYYY-MM-DDTHH:MM:SS] warning vpxd[07057] [Originator@6876 sub=InvtHostCnx opID=HostSync-host-<moid>-2f729613] Exception occurred during host sync; Got method fault; [vim.HostSystem:host-<moid>,<hostname>], e: N3Vim5Fault12InvalidLogin9ExceptionE(Fault cause: vim.fault.InvalidLogin
--> )

Environment

VMware vCenter Server 7.0.3

Cause

Usually, this is caused by a lock which is blocking heartbeats. Some other common causes for this issue can be:

If the password of ESX user "vpxuser" is out of sync with the password maintained by vpxd, which can happen in various scenarios such as:
1. VCSA has been restored from a snapshot, and "vpxuser" password has been updated between the time snapshot was taken and restore time.
2. The password of ESX user "vpxuser" has expired on ESX due to ESX updated setting 'Security.PasswordMaxDays' (default is "never expire" but the user may update that advanced setting)
3. The ESX host has been added to another VC without having been removed from the first VC (the first VC will fail to login to ESX in that case)
Due to mis-configured network connectivity, which causes connection failure from vpxd to ESX host after ~120 seconds timeout.
Any other underlying issue causing ESXi host not responding, typically related to storage/network loss (e.g SD card as described in the KB: 318794)

Resolution

This issue is fixed in VMware vCenter Server 7.0 U3i and VMware vCenter Server 8.0a releases.

To workaround this issue:

Identify the offending host(s) that are unresponsive in the environment and disconnect them from vCenter. Then you can address the issue with the host(s) while allowing all other hosts to remain connected.
Investigate this host for problems and resolve them or disconnect this host from the vCenter Server until the issue can be resolved.
A single host, or a small number of hosts which are not responding to the connection attempt can cause this due to hostd being unresponsive, etc.

To identify if this is the issue you are encountering, look at the vpxd logs for one of the not responding hosts that you have attempted to reconnect. Find the reconnect task:

[YYYY-MM-DDTHH:MM:SS] info vpxd[36085] [Originator@6876 sub=vpxLro opID=l6ejabx0-47722379-auto-seuto-h5:72353065-61] [VpxLRO] -- BEGIN task-5224518 -- host-1607205 -- vim.HostSystem.reconnect -- xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx(xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)

Using the opID, find where this task finished and notice there was no error with the task:

[YYYY-MM-DDTHH:MM:SS] info vpxd[36085] [Originator@6876 sub=vpxLro opID=l6ejabx0-47722379-auto-seuto-h5:72353065-61] [VpxLRO] -- FINISH task-5224518

Next, use the host ID to find the host sync task in the logs:

[YYYY-MM-DDTHH:MM:SS] verbose vpxd[36097] [Originator@6876 sub=InvtHostCnx opID=l6ejabx0-47722570-auto-seuz0-h5:72353107-1a] Queuing host sync; [vim.HostSystem:host-1607205,esxi.domain.com]
[YYYY-MM-DDTHH:MM:SS] verbose vpxd[59454] [Originator@6876 sub=vpxLro opID=l6ejabx0-47722570-auto-seuz0-h5:72353107-1a-51c0b834] [VpxLRO] -- BEGIN lro-1301294 -- -- VpxdInvtHostSyncHostLRO.Synchronize --

Immediately after the host sync is started, we see a failure stating that the connection is not alive due to missing heartbeats:

[YYYY-MM-DDTHH:MM:SS] warning vpxd[59454] [Originator@6876 sub=InvtHostCnx opID=l6ejabx0-47722570-auto-seuz0-h5:72353107-1a-51c0b834] Connection not alive due to missing heartbeats; [vim.HostSystem:host-1607205,esxi.domain.com], cnx: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
[YYYY-MM-DDTHH:MM:SS] warning vpxd[59454] [Originator@6876 sub=MoHost opID=l6ejabx0-47722570-auto-seuz0-h5:72353107-1a-51c0b834] host [vim.HostSystem:host-1607205,esxi.domain.com] connection state changed to NO_RESPONSE
[YYYY-MM-DDTHH:MM:SS] warning vpxd[59454] [Originator@6876 sub=InvtHostCnx opID=l6ejabx0-47722570-auto-seuz0-h5:72353107-1a-51c0b834] Exception occurred during host sync; Host communication failed; [vim.HostSystem:host-1607205,esxi.domain.com], e: N5Vmomi5Fault17HostCommunication9ExceptionE(Fault cause: vmodl.fault.HostCommunication

Note that the time between when the reconnect is started and the failure is less than 30 seconds in these logs. It may be longer, but we should not be marking a host as unresponsive for missing heartbeats in this short of a time span.

If you had run a packet capture looking for network problems, you may notice that the host is reporting a TCP Window Full Message like below:

3685 19:08:08.897786 YYYY-MM-DD HH:MM:SS <Host IP> <vCenter IP> TCP 1602 [TCP Window Full] 443 → 50118 [ACK] Seq=3191140846 Ack=3569770941 Win=66560 Len=1536 TSval=3535247282 TSecr=2844746348 [TCP segment of a reassembled PDU]
3688 19:08:08.899182 YYYY-MM-DD HH:MM:SS <vCenter IP> <Host IP> TCP 66 [TCP ZeroWindow] 50118 → 443 [ACK] Seq=3569770941 Ack=3191202286 Win=0 Len=0 TSval=2844746349 TSecr=353524728

You may also be able to use the following command run against the vpxd logs to identify a set of hosts to investigate for this issue. Note that you must change the <VPXD LOGS> instances in the sample to the log or logs you want to review (replace <VPXD LOGS> with the name of the specific log or a regex for running the command on multiple logs):

or