Remediating vSAN cluster using vLCM ( vSphere Life Cycle Manager ) fails
search cancel

Remediating vSAN cluster using vLCM ( vSphere Life Cycle Manager ) fails

book

Article ID: 317242

calendar_today

Updated On:

Products

VMware vSAN VMware vSphere ESXi

Issue/Introduction

Symptoms:

  • While patching hosts in a vSAN cluster using vLCM, after the first host is patched it fails the vSAN health check with errors: "vSAN health test 'vMotion: Basic (unicast) connectivity check''" and "vSAN health test 'vMotion: MTU check (ping with large packet size)'".
  • The host is patched but fails to exit from maintenance mode due to the above errors and hence the remediation task fails.
  • When vMotion network connectivity is checked manually it works fine.
  • Manually exiting host out of maintenance mode and performing vMotion also works fine.
  • Resuming the remediation results in the next host having the same issue.

 

Environment

VMware vSAN 8.x
VMware vSphere 8.x

Cause

When using vLCM to remediate hosts in a vSAN cluster it does the below:

  1. It selects a host, checks for compliance, and then places it into maintenance mode.

  2. Performs patching of the host.

  3. Reboots the host.

  4. It then performs a compliance check again on this host (before exiting the maintenance mode). This also includes performing vSAN Health Check for the cluster. This is where it fails:

The errors seen: "vSAN health test 'vMotion: Basic (unicast) connectivity check''" and "vSAN health test 'vMotion: MTU check (ping with large packet size)'".

Due to this error, the host does not exit maintenance mode and the remediation fails.

 

vMotion ping test failing can be verified with the vSAN health summary logs from the vCenter /var/log/vmware/vsan-health/vmware-vsan-health-summary-result.log:

2024-03-18T08:48:34.345Z INFO vsan-mgmt[324367] [VsanHealthSummaryLogUtil::PrintHealthResult opID=650af2da] Cluster vSAN_cluster_name Overall Health : red

  Group network health : red

   Test clusterpartition health : green

   Test vsanvmknic health : green

   Test smallping health : green

   Test largeping health : green

   Test vmotionpingsmall health : red

     OnlyFailedPings: FromHost ToHost ToDevice PacketSize PingResult

             (Host-4139, Host-4142, Vmk1, 64, Red), (Host-4123, Host-4142, Vmk1, 64, Red), (Host-4126, Host-4142, Vmk1, 64, Red), (Host-4147, Host-4142, Vmk1, 64, Red),

             (Host-4127, Host-4142, Vmk1, 64, Red), (Host-4130, Host-4142, Vmk1, 64, Red), (Host-4154, Host-4142, Vmk1, 64, Red),

   Test vmotionpinglarge health : red

     OnlyFailedPings: FromHost ToHost ToDevice PacketSize PingResult

             (Host-4139, Host-4142, Vmk1, 8972, Red), (Host-4123, Host-4142, Vmk1, 8972, Red), (Host-4126, Host-4142, Vmk1, 8972, Red), (Host-4147, Host-4142, Vmk1, 8972, Red),

             (Host-4127, Host-4142, Vmk1, 8972, Red), (Host-4130, Host-4142, Vmk1, 8972, Red), (Host-4154, Host-4142, Vmk1, 8972, Red),

  Group physicaldisks health : green

   Test physdiskoverall health : green

   Test lsomheap health : green

   Test lsomslab health : green

  Group hcl health : green

   Test controlleronhcl health : green

   Test nvmeonhcl health : green

   Test controllerreleasesupport health : green

 

From the above, we can see that the vMotion ping test is failing for host 4142. (The host which was being remediated.)

 

Below is the error snippet from the vCenter logs  /var/log/vmware/vmware-updatemgr/vum-server/vmware-vum-server.log. Which shows the result of the vLCM initiated vSAN health check, which is performed before exiting the host out of maintenance mode:

2024-03-18T07:31:20.752Z info vmware-vum-server[41337] [Originator@6876 sub=EHP] [host-4142] [vsan] [com.vmware.vcIntegrity.lifecycle.health.vsan.cluster_before_exit_mm] reported issue: vSAN health test 'vMotion: Basic (unicast) connectivity check' reported an issue for cluster 'vSAN_cluster_name'. Check the vSAN health.

2024-03-18T07:31:20.752Z info vmware-vum-server[41337] [Originator@6876 sub=EHP] [host-4142] [vsan] [com.vmware.vcIntegrity.lifecycle.health.vsan.cluster_before_exit_mm] reported issue: vSAN health test 'vMotion: MTU check (ping with large packet size)' reported an issue for cluster 'vSAN_cluster_name'. Check the vSAN health.

2024-03-18T07:31:20.752Z info vmware-vum-server[41337] [Originator@6876 sub=EHP] [host-4142] All providers have finished. Elapsed time (sec): 81

2024-03-18T07:31:20.752Z info vmware-vum-server[41337] [Originator@6876 sub=EHP] [host-4142] [vSAN] [com.vmware.vcIntegrity.lifecycle.health.vsan.cluster_before_exit_mm] returned status: NOT_OK

2024-03-18T07:31:20.752Z info vmware-vum-server[41337] [Originator@6876 sub=EHP] Entity [host-4142] health status for perspective [BEFORE_EXIT_MAINTENANCE] is: NOT_OK for service(s) [vSAN]

2024-03-18T07:31:20.752Z info vmware-vum-server[40595] [Originator@6876 sub=RemediateClusterTask] [HealthCheck 507] CheckHostHealth - check status name = Cluster health before exit maintenance mode -

2024-03-18T07:31:20.752Z info vmware-vum-server[40595] [Originator@6876 sub=RemediateClusterTask] [ApplyHelpers 481] CheckHostHealth - (hostid = host-4142) - (hostName = hostname.domain.com) - (perspective = 3) - (check result status = 3) - (check timeout = 4200)

.

.

.

2024-03-18T07:31:20.752Z info vmware-vum-server[10309] [Originator@6876 sub=Telemetry] [TelemetryManager 423] Sending telemetry data: {"@type":"pman_error_report","taskId":"XXXXXXXXXXXXXXXXXXXX|XXXXXXXXXXXXXXXXXXXXXXXXXXX","entityId":"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|host-4142","parentTaskId":"","errorMessageId":"com.vmware.vcIntegrity.lifecycle.health.common.overall_health_not_green","errorMessage":"vSAN health test 'vMotion: Basic (unicast) connectivity check' reported an issue for cluster 'vSAN_cluster_name'. Check the vSAN health.","errorTime":"2024-03-18T07:31:20.752492Z"}

 

But after a few minutes, when we check the network connectivity by manually pinging the host's vMotion vmkernel adapter, the ping is successful.

This is because the environment is using APIPA assignment of IP for vMotion vmkernel adapater.

The APIPA IP address is auto-assigned if there is no Static IP or DHCP assignment done for the vMotion vmkernel adapter.

If this assignment takes time and is done after the vSAN Health check is performed, then there is no IP for the vMotion vmkernel adapter of this host and the pings fail.

Hence the vLCM remediation fails at the final compliance check.

Resolution

To resolve this issue, assign static IPs to the vMotion VMkernel for all the vSAN hosts.

If there is a need to use only APIPA assignment of IP, then temporarily assign a static IP address until the upgrade/patch activity is complete.

 

 

Additional Information

Impact/Risks:

+ vLCM remediation task of vSAN hosts fail. 

Note: 

  • This only applies if the APIPA-assigned IP is picked by the host after the vLCM-initiated vSAN health check occurs, causing the vMotion ping test to fail.