VCF Cluster Join fails due to vCLS Unhealthy Status caused by Port 902 Heartbeat Loss
search cancel

VCF Cluster Join fails due to vCLS Unhealthy Status caused by Port 902 Heartbeat Loss

book

Article ID: 438899

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  •  During VMware Cloud Foundation (VCF) operations such as adding a host or joining a cluster, the workflow may fail at the "Validate Cluster Health" stage. The vSphere Client indicates that the vSphere Cluster Services (vCLS) status is Unhealthy.



  • Logs in /var/log/vmware/vpxd/vpxd.log show frequent host connection resets:

    YYYY-MM-DDT HH:MM:SS.316Z warning vpxd[#######] [Originator@6876 sub=InvtHostCnx] Connection not alive due to missing heartbeats; [vim.HostSystem:host-####, <ESXi Host Name>]
    YYYY-MM-DDT HH:MM:SS.673Z info vpxd[#######] connection state changed to NO_RESPONSE
  •  var/log/vmware/vmware-eam/ema.log;

YYYY-MM-DDT HH:MM:SS.971Z |  INFO | vim-inv-update | VcHostSystem.java | 462 | VcHostSystem(ID: host-2004) connection state changed from connected to notResponding
YYYY-MM-DDT HH:MM:SS.971Z |  INFO | vim-inv-update | VcHostSystem.java | 469 | VcHostSystem(ID: host-2004) power state changed from poweredOn to unknown
YYYY-MM-DDT HH:MM:SS.971Z |  INFO | vim-monitor | OpId.java | 37 | [PropertyCollector(session[5241ef80-2e13-9c76-07fc-a668134aeb99]526c8d48-ecd6-c149-8f78-60f382af1c11)->WaitForUpdatesEx:af84d48639e04880] created from [WaitForUpdatesEx:d2ab38ac16f1bf6]
YYYY-MM-DDT HH:MM:SS.984Z |  INFO | vim-async-2 | OpIdLogger.java | 35 | [PropertyCollector(session[5241ef80-2e13-9c76-07fc-a668134aeb99]526c8d48-ecd6-c149-8f78-60f382af1c11)->WaitForUpdatesEx:af84d48639e04880] Completed.
YYYY-MM-DDT HH:MM:SS.984Z |  INFO | vim-inv-update | VcHostSystem.java | 462 | VcHostSystem(ID: host-2015) connection state changed from connected to notResponding

Environment

  • VMware Cloud Foundation 4.x, 5.x
  • VMware vSphere 7.x
  • VMware vSphere 8.x

Cause

Investigation has isolated the root cause of the reported issue to network firewall rules blocking TCP port 902. This port is mandatory for management traffic, heartbeats, and Network File Copy (NFC) operations between vCenter Server and ESXi hosts.

This is commonly caused by a firewall or network security appliance blocking UDP Port 902 traffic. It should be open bi-directional

When heartbeats are missed, the host enters a transient "Not Responding" state, which triggers a failure in the vSphere ESX Agent Manager (EAM) and results in an unhealthy vCLS state. VCF automation is programmed to halt when vCLS health is not "Green."

Resolution

To resolve this issue, perform the following steps:

  1. Ensure that UDP Port 902 and TCP Port 443 are permitted for bi-directional communication between the vCenter Server and the ESXi Management network.

  2. From the ESXi host SSH session, verify that heartbeats are being sent: pktcap-uw --uplink <vmnic> --capture UplinkSndKernel --udpport 902 -o -| tcpdump-uw -enr - On the vCenter Server (VCSA), verify heartbeats are being received: tcpdump src host <ESXi_IP_Address> and udp port 902

  3. Log into the VCSA via SSH and restart the agent manager: service-control --restart vmware-eam

  4. Reset vCLS Health:

    1. Navigate to the affected Cluster in the vSphere Client.

    2. Retrieve the Cluster Domain ID from the browser URL (e.g., domain-c###).

    3. Navigate to vCenter Server > Configure > Advanced Settings.

    4. Set config.vcls.clusters.domain-c###.enabled to False.

    5. Wait for vCLS VMs to be removed, then set the value back to True.