vCLS VM Cluster Agent VM is expected to be removed

Products

VMware vCenter Server

Issue/Introduction

vCenter with SRM experiences vpxd crash

vCenter

/var/log/vmware/vmon

vmon-1.log: YYYY-MM-DDTHH:MM:50.283Z Wa(03) host-1234 <vpxd> Service exited unexpectedly. Crash count 4. Taking configured recovery action.
vmon-3.log:YYYY-MM-DDTHH:MM:57.047Z Wa(03) host-1234 <vpxd> Service exited unexpectedly. Crash count 0.Taking configured recovery action.
vmon-3.log:YYYY-MM-DDTHH:MM:59.566Z Wa(03) host-1234 <vpxd> Service exited unexpectedly. Crash count 1. Taking configured recovery action.
vmon-3.log:YYYY-MM-DDTHH:MM:50.065Z Wa(03) host-1234 <vpxd> Service exited unexpectedly. Crash count 2. Taking configured recovery action.
vmon-3.log:YYYY-MM-DDTHH:MM:18.976Z Wa(03) host-1234 <vpxd> Service exited unexpectedly. Crash count 3. Taking configured recovery action.

vpxd.core-worker is produced

Debugging vpxd.core-worker observes "Memory exceeds hard limit. Panic"

/var/log/vmware/eam/eam_api.log

YYYY-MM-DDTHH:MM:16.097Z |  INFO | vlsi | LocalizationFilter.java | 108 | API COMPLETE: ClusterVMAgency(ID:'Agency:########-####-####-####-XXXXXXXX:null').queryRuntime[opId=1181127682, sessionId=5393671A]. Result:
eam.EamObject.RuntimeInfo {
   issue = (eam.issue.Issue) [
      (eam.issue.cluster.agent.VmNotRemoved) {
         time = yyyy-mm-dd hh:mm:ss,158,
         description = <unset>,
         key = 15,
         agency = 'Agency:########-####-####-####-XXXXXXXX:null',
         solutionId = 'VSPHERE.LOCAL\vpxd-extension-########-####-####-####-YYYYYYY',
         agencyName = 'vCLS',
         solutionName = ' ',
         agent = 'Agent:########-####-####-####-AAAAAAA:null',
         cluster = 'ClusterComputeResource:domain-cnumber:########-####-####-####-ZZZZZZZZ',
         vm = 'VirtualMachine:vm-ID:########-####-####-####-ZZZZZZZZ',
      },
   ],
   goalState = 'enabled',
   entity = 'Agency:########-####-####-####-XXXXXXXX:null',
   status = 'red',

SRM Appliance

Production VM changes received from VC

Example:/var/log/vmware/srm/vmware-dr.log

2024-06-26T17:51:36.213Z info vmware-dr[02698] [SRM@6666 sub=Replication opID=9a3efe88] [HandleProductionVmLocationChange]: Start handle Production VM location change for protectd VM protected-vm-vmid. Folder: 'vim.Folder:#####-######-#####-#####-#######:group-vID, Resource pool 'vim.ResourcePool:#####-######-#####-#####-#######:resgroup-01'

There are many placeholder VM changes

Example:/var/log/vmware/srm/vmware-dr.log

2024-06-26T17:51:35.914Z verbose vmware-dr[02698] [SRM@6666 sub=PlaceholderVmManager] Placeholder VM inventory data has changed: -->  vmMoRef: vim.VirtualMachine:#####-######-#####-#####-#######:vm-ID --> ["datastore" => "vim.#####-######-#####-#####-#######:datastore-XXXXX"]

Environment

vCenter 8.x
SRM 9.x

Cause

vCLS problem may generate a lot of inventory changes in vCenter which will be pushed to SRM or HMS. HMS will encounter memory problems with huge volume of property changes.
The vCenter is the property collector server side, which may consume a lot of memory for pushing property changes leading to vpxd panic & the vCenter crashing

Resolution

1. Confirm there are no Empty Clusters in vCenter Inventory with DRS & HA Enabled.
1.1 Toggle DRS & HA off on the empty clusters
1.2 Place the cluster into retreat mode to remove any vCLS VMs - Placing the cluster into retreat mode

2. Upgrade vCenter
3. Restart SRM Appliance
4. Upgrade SRM