vCenter with SRM experiences vpxd crash
vCenter
/var/log/vmware/vmon
vmon-1.log: YYYY-MM-DDTHH:MM:50.283Z Wa(03) host-1234 <vpxd> Service exited unexpectedly. Crash count 4. Taking configured recovery action.
vmon-3.log:YYYY-MM-DDTHH:MM:57.047Z Wa(03) host-1234 <vpxd> Service exited unexpectedly. Crash count 0.Taking configured recovery action.
vmon-3.log:YYYY-MM-DDTHH:MM:59.566Z Wa(03) host-1234 <vpxd> Service exited unexpectedly. Crash count 1. Taking configured recovery action.
vmon-3.log:YYYY-MM-DDTHH:MM:50.065Z Wa(03) host-1234 <vpxd> Service exited unexpectedly. Crash count 2. Taking configured recovery action.
vmon-3.log:YYYY-MM-DDTHH:MM:18.976Z Wa(03) host-1234 <vpxd> Service exited unexpectedly. Crash count 3. Taking configured recovery action.
vpxd.core-worker is produced
Debugging vpxd.core-worker observes "Memory exceeds hard limit. Panic"
/var/log/vmware/eam/eam_api.log
YYYY-MM-DDTHH:MM:16.097Z | INFO | vlsi | LocalizationFilter.java | 108 | API COMPLETE: ClusterVMAgency(ID:'Agency:########-####-####-####-XXXXXXXX:null').queryRuntime[opId=1181127682, sessionId=5393671A]. Result:
eam.EamObject.RuntimeInfo {
issue = (eam.issue.Issue) [
(eam.issue.cluster.agent.VmNotRemoved) {
time = yyyy-mm-dd hh:mm:ss,158,
description = <unset>,
key = 15,
agency = 'Agency:########-####-####-####-XXXXXXXX:null',
solutionId = 'VSPHERE.LOCAL\vpxd-extension-########-####-####-####-YYYYYYY',
agencyName = 'vCLS',
solutionName = ' ',
agent = 'Agent:########-####-####-####-AAAAAAA:null',
cluster = 'ClusterComputeResource:domain-cnumber:########-####-####-####-ZZZZZZZZ',
vm = 'VirtualMachine:vm-ID:########-####-####-####-ZZZZZZZZ',
},
],
goalState = 'enabled',
entity = 'Agency:########-####-####-####-XXXXXXXX:null',
status = 'red',
SRM Appliance
Production VM changes received from VC
Example:/var/log/vmware/srm/vmware-dr.log
2024-06-26T17:51:36.213Z info vmware-dr[02698] [SRM@6666 sub=Replication opID=9a3efe88] [HandleProductionVmLocationChange]: Start handle Production VM location change for protectd VM protected-vm-vmid. Folder: 'vim.Folder:#####-######-#####-#####-#######:group-vID, Resource pool 'vim.ResourcePool:#####-######-#####-#####-#######:resgroup-01'
There are many placeholder VM changes
Example:/var/log/vmware/srm/vmware-dr.log
2024-06-26T17:51:35.914Z verbose vmware-dr[02698] [SRM@6666 sub=PlaceholderVmManager] Placeholder VM inventory data has changed: --> vmMoRef: vim.VirtualMachine:#####-######-#####-#####-#######:vm-ID --> ["datastore" => "vim.#####-######-#####-#####-#######:datastore-XXXXX"]
vCenter 8.x
SRM 9.x
vCLS problem may generate a lot of inventory changes in vCenter which will be pushed to SRM or HMS. HMS will encounter memory problems with huge volume of property changes.
The vCenter is the property collector server side, which may consume a lot of memory for pushing property changes leading to vpxd panic & the vCenter crashing
1. Confirm there are no Empty Clusters in vCenter Inventory with DRS & HA Enabled.
1.1 Toggle DRS & HA off on the empty clusters
1.2 Place the cluster into retreat mode to remove any vCLS VMs - Placing the cluster into retreat mode
2. Upgrade vCenter
3. Restart SRM Appliance
4. Upgrade SRM