VMware vSAN 8.x
VMware vSphere 8.x
When using vLCM to remediate hosts in a vSAN cluster it does the below:
It selects a host, checks for compliance, and then places it into maintenance mode.
Performs patching of the host.
Reboots the host.
It then performs a compliance check again on this host (before exiting the maintenance mode). This also includes performing vSAN Health Check for the cluster. This is where it fails:
The errors seen: "vSAN health test 'vMotion: Basic (unicast) connectivity check''" and "vSAN health test 'vMotion: MTU check (ping with large packet size)'".
Due to this error, the host does not exit maintenance mode and the remediation fails.
vMotion ping test failing can be verified with the vSAN health summary logs from the vCenter /var/log/vmware/vsan-health/vmware-vsan-health-summary-result.log:
2024-03-18T08:48:34.345Z INFO vsan-mgmt[324367] [VsanHealthSummaryLogUtil::PrintHealthResult opID=650af2da] Cluster vSAN_cluster_name Overall Health : red
Group network health : red
Test clusterpartition health : green
Test vsanvmknic health : green
Test smallping health : green
Test largeping health : green
Test vmotionpingsmall health : red
OnlyFailedPings: FromHost ToHost ToDevice PacketSize PingResult
(Host-4139, Host-4142, Vmk1, 64, Red), (Host-4123, Host-4142, Vmk1, 64, Red), (Host-4126, Host-4142, Vmk1, 64, Red), (Host-4147, Host-4142, Vmk1, 64, Red),
(Host-4127, Host-4142, Vmk1, 64, Red), (Host-4130, Host-4142, Vmk1, 64, Red), (Host-4154, Host-4142, Vmk1, 64, Red),
Test vmotionpinglarge health : red
OnlyFailedPings: FromHost ToHost ToDevice PacketSize PingResult
(Host-4139, Host-4142, Vmk1, 8972, Red), (Host-4123, Host-4142, Vmk1, 8972, Red), (Host-4126, Host-4142, Vmk1, 8972, Red), (Host-4147, Host-4142, Vmk1, 8972, Red),
(Host-4127, Host-4142, Vmk1, 8972, Red), (Host-4130, Host-4142, Vmk1, 8972, Red), (Host-4154, Host-4142, Vmk1, 8972, Red),
Group physicaldisks health : green
Test physdiskoverall health : green
Test lsomheap health : green
Test lsomslab health : green
Group hcl health : green
Test controlleronhcl health : green
Test nvmeonhcl health : green
Test controllerreleasesupport health : green
From the above, we can see that the vMotion ping test is failing for host 4142. (The host which was being remediated.)
Below is the error snippet from the vCenter logs /var/log/vmware/vmware-updatemgr/vum-server/vmware-vum-server.log. Which shows the result of the vLCM initiated vSAN health check, which is performed before exiting the host out of maintenance mode:
2024-03-18T07:31:20.752Z info vmware-vum-server[41337] [Originator@6876 sub=EHP] [host-4142] [vsan] [com.vmware.vcIntegrity.lifecycle.health.vsan.cluster_before_exit_mm] reported issue: vSAN health test 'vMotion: Basic (unicast) connectivity check' reported an issue for cluster 'vSAN_cluster_name'. Check the vSAN health.
2024-03-18T07:31:20.752Z info vmware-vum-server[41337] [Originator@6876 sub=EHP] [host-4142] [vsan] [com.vmware.vcIntegrity.lifecycle.health.vsan.cluster_before_exit_mm] reported issue: vSAN health test 'vMotion: MTU check (ping with large packet size)' reported an issue for cluster 'vSAN_cluster_name'. Check the vSAN health.
2024-03-18T07:31:20.752Z info vmware-vum-server[41337] [Originator@6876 sub=EHP] [host-4142] All providers have finished. Elapsed time (sec): 81
2024-03-18T07:31:20.752Z info vmware-vum-server[41337] [Originator@6876 sub=EHP] [host-4142] [vSAN] [com.vmware.vcIntegrity.lifecycle.health.vsan.cluster_before_exit_mm] returned status: NOT_OK
2024-03-18T07:31:20.752Z info vmware-vum-server[41337] [Originator@6876 sub=EHP] Entity [host-4142] health status for perspective [BEFORE_EXIT_MAINTENANCE] is: NOT_OK for service(s) [vSAN]
2024-03-18T07:31:20.752Z info vmware-vum-server[40595] [Originator@6876 sub=RemediateClusterTask] [HealthCheck 507] CheckHostHealth - check status name = Cluster health before exit maintenance mode -
2024-03-18T07:31:20.752Z info vmware-vum-server[40595] [Originator@6876 sub=RemediateClusterTask] [ApplyHelpers 481] CheckHostHealth - (hostid = host-4142) - (hostName = hostname.domain.com) - (perspective = 3) - (check result status = 3) - (check timeout = 4200)
.
.
.
2024-03-18T07:31:20.752Z info vmware-vum-server[10309] [Originator@6876 sub=Telemetry] [TelemetryManager 423] Sending telemetry data: {"@type":"pman_error_report","taskId":"XXXXXXXXXXXXXXXXXXXX|XXXXXXXXXXXXXXXXXXXXXXXXXXX","entityId":"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|host-4142","parentTaskId":"","errorMessageId":"com.vmware.vcIntegrity.lifecycle.health.common.overall_health_not_green","errorMessage":"vSAN health test 'vMotion: Basic (unicast) connectivity check' reported an issue for cluster 'vSAN_cluster_name'. Check the vSAN health.","errorTime":"2024-03-18T07:31:20.752492Z"}
But after a few minutes, when we check the network connectivity by manually pinging the host's vMotion vmkernel adapter, the ping is successful.
This is because the environment is using APIPA assignment of IP for vMotion vmkernel adapater.
The APIPA IP address is auto-assigned if there is no Static IP or DHCP assignment done for the vMotion vmkernel adapter.
If this assignment takes time and is done after the vSAN Health check is performed, then there is no IP for the vMotion vmkernel adapter of this host and the pings fail.
Hence the vLCM remediation fails at the final compliance check.
To resolve this issue, assign static IPs to the vMotion VMkernel for all the vSAN hosts.
If there is a need to use only APIPA assignment of IP, then temporarily assign a static IP address until the upgrade/patch activity is complete.
+ vLCM remediation task of vSAN hosts fail.