ESXI Maintenance mode tasks gets stuck and cannot be cancelled in a cluster with VKS configured
search cancel

ESXI Maintenance mode tasks gets stuck and cannot be cancelled in a cluster with VKS configured

book

Article ID: 408654

calendar_today

Updated On:

Products

VMware

Issue/Introduction

- Putting an ESXI host into maintenance mode gets stuck and cannot be cancelled

- The ESXI host is in a cluster where VKS is configured

- The only way to cancel the task is to reboot the vCenter

- The VM tab in the vCenter UI for the ESXI host shows that no VMs are running on it

Environment

vCenter 8.x

TKGs/VKS

Cause

- In a SSH session to the ESXI host, the following command shows that there is an envoy-xxxx PodVM running on the host: vim-cmd vmsvc/getallvms

Vmid      Name                       
00    envoy-xxxx   

- In the host UI, the above envoy-xxxx PodVM shows as powered off

- In /var/log/vmware/vpxd/vpxd.log the host already seems to be in maintenance mode per error below:

Error:
-->    com.vmware.vapi.std.errors.already_in_desired_state
--> Messages:
-->    vcenter.wcp.node.alreadyindesiredstate<Node identified by host-xxxx is already in state NodeMaintenance.>
-->

- WCP is supposed to delete this PodVM when host enters maintenance mode but this does not happen hence the task gets stuck.

 

 

Resolution

This stale PodVM needs to be manually deleted in order for the host to enter maintenance mode. 

Important: Ensure that a fresh backup or offline snapshot of the vCenter Server Appliance has been created. If the vCenter Server is part of a Linked Mode replication group, backups/offline snapshots need to be created for every member of the Linked Mode group. Do not skip this step.

1. SSH to affected host that will not enter maintenance mode, log in with root

2. Run this to get the ID of the envoy pod/vm: vim-cmd vmsvc/getallvms

Make a note of the VMID for the PodVM

3. Run this command to remove the terminating PodVM using the VMID above: vim-cmd vmsvc/destroy <vmid>
 
4. After the above step the PodVM should then show as orphaned in the vCenter and should be gone from the Host UI

5. If its still present at this point it will need to be deleted from the vCenter database.

6 Open SSH session to the vCenter, log in with root, enter shell

7. Stop Services: vmware-vpxd, wcp

service-control --stop vpxd

service-control --stop wcp

8. Access database of the vCenter: :

/opt/vmware/vpostgres/current/bin/psql -U postgres -d VCDB

9, Query the ID for the problem VM (change <VM-Name> to the VM name) in the VPX_ENTITY table:

   select * from vpx_entity where name like '%<VM-Name>%';

10. Delete the VM_ID from the following tables in the same order:

delete from VPX_COMPUTE_RESOURCE_DAS_VM where VM_ID=####;
delete from VPX_COMPUTE_RESOURCE_DRS_VM where VM_ID=####;
delete from VPX_COMPUTE_RESOURCE_ORC_VM where VM_ID=####;
delete from VPX_VM_SGXINFO where VM_ID=####;
delete from VPX_GUEST_DISK where VM_ID=####;
delete from VPX_VM_VIRTUAL_DEVICE where ID=####;
delete from VPX_VM_DS_SPACE where VM_ID=####;
delete from VPX_NON_ORM_VM_CONFIG_INFO where ID=####;
delete from VPX_NORM_VM_FLE_FILE_INFO where VM_ID=####;
delete from VPX_VDEVICE_BACKING_REL where VM_ID=####;
delete from VPX_VIRTUAL_DISK_IOFILTERS where VM_ID=####;
delete from VPX_VM_STATIC_OVERHEAD_MAP where VM_ID=####;
delete from VPX_VM_TEXT where VM_ID=####;
delete from VPX_VM where ID=####;
delete from VPX_ENTITY where ID=####;

Reference KB: Manually removing a stale VM from the vCenter Server vpostgres database

11. Start the services that were stopped in step 7.

12. Log back into vCenter, the envoy PodVM should be gone and host should now enter maintenance mode successfully.