Unable to disable workload management after disabling DRS on the cluster in vSphere with Tanzu.

Products

VMware vSphere with Tanzu

Issue/Introduction

Disabling a WCP cluster after DRS has been disabled results in failure.
Users will see errors as follows in the /var/log/vmware/wcp/wcpsvc.log

2022-10-10T13:26.586Z error wcp [workload/vcinvnt.go:151] [opID=CLUSTER_NAME] Failed to find resource pool for workload ServerFaultCode: The object 'vim.ResourcePool:resgroup-1234' has already been deleted or has not been completely created. Err %!v(MISSING)

2022-10-10T13:38:31.581Z error wcp [workload/workload_impl.go:1498] Workload CLUSTER_NAME is already being removed

Users will see the following, or similar, logging in the /var/log/vmware/vpxd/vpxd.log

2022-10-10T13:26.586Z info vpxd[06272] [Originator@6876 sub=Default opID=wcp-6245a1a1-68] [VpxLRO] -- ERROR lro-577005 -- AuthorizationManager -- vim.AuthorizationManager.setEntityPermissions: vmodl.fault.ManagedObjectNotFound:
--> Result:
--> (vmodl.fault.ManagedObjectNotFound) {
--> faultCause = (vmodl.MethodFault) null,
--> faultMessage = <unset>,
--> obj = 'vim.ResourcePool:resgroup-1234'
--> msg = ""
--> }
--> Args:
-->
--> Arg entity:
--> 'vim.ResourcePool:resgroup-1234'

NOTE: This error may be encountered if other vSphere objects cannot be found due to incorrect order of deletion. It is important to check the /var/log/vmware/wcp/wcpsvc.log to identify where the failure is occurring prior to running the workaround noted in this KB.

Environment

vCenter Server
VMware vSphere 7.0 with Tanzu

Cause

This failure occurs because the resource pools backing the Namespace objects in vSphere inventory are provided by DRS. When DRS is disabled, the association to the resource pool can not longer be found, leading to a failure to remove the Namespace object.

Resolution

Caution:

This workaround requires manual database edits, please ensure you have vCenter snapshots or backups prior to running this workaround.
Also, if NSX-T is in use as the backing network, please ensure WCP service has cleaned up the NSX-T objects BEFORE running this workaround. See the steps below:
- SSH to the vCenter server appliance.
- Run the following command:
  - #python3 /usr/lib/vmware-wcp/nsx_policy_cleanup.py --cluster domain-c11040:035xxxe8-eb20-41xx-a7xx-972068xxxxxx -u <nsx admin user> -p '<nsx mgr admin pass>' --mgr-ip=<nsx mgr ip> --no-warning --top-tier-router-id=domain-c11040:035xxxe8-eb20-41xx-a7xx-972068xxxxxx --all-res -r
Ensure the WCP service is disabled prior to database edits.

Preparation:

Take snapshots of the vCenter. If vCenter is running in linked mode with other vCenters, please take powered off snapshots of all vCenters to ensure consistent PSC replication state is captured.
Use the decrypt script on vCenter to identify your WCP Supervisor Cluster IDs:

# /usr/lib/vmware-wcp/decryptK8Pwd.py

Example output:

Read key from file

Connected to PSQL

Cluster: domain-c8:4f67k834-5436-7456-b307-467g109j5xxx
IP: <CLUSTER_IP>
PWD: <ROOT_PASSWORD>

Cluster: domain-c20:4f67k834-5436-7456-b307-467g109j5xxx

IP: <CLUSTER_IP>
PWD: <ROOT_PASSWORD>

NOTE: You can compare this cluster ID to the vCenter GUI by going to Inventory -> Select the cluster that is stuck in disabling state -> Check the URL, you will see something like: ClusterComputeResource:domain-c8:4f67k834-5436-7456-b307-467g109j5602

For the purpose of this example:

The good cluster is: "cluster": "domain-c20:4f67k834-5436-7456-b307-467g109j5xxx",
The bad cluster is: "cluster": "domain-c8:4f67k834-5436-7456-b307-467g109j5xxx"

There can be multiple WCP clusters deployed per vCenter, it is critical that we identify the Cluster ID of the old "bad" cluster prior to deleting anything from the database.
In this example; the bad cluster is ID: domain-c8

Procedure:

Stop the WCP service on vCenter
# vmon-cli -k wcp
Completed Stop service request
Connect to vCenter database
# PGPASSFILE=/etc/vmware/wcp/.pgpass psql -U wcpuser -h localhost VCDB
<snip>
VCDB=>
Gather clusters from cluster_db_configs. We will delete bad Cluster domain-c8:4f67k834-5436-7456-b307-467g109j5602
VCDB=> select cluster from cluster_db_configs ;
count
-------
domain-c8:4f67xxxx-5436-7456-b307-467g10xxxxxx
domain-c20:4f67xxxx-5436-7456-b307-467g10xxxxxx
(2 row)
Gather clusters from workload_configs. We will delete bad Cluster domain-c8:4f67k834-5436-7456-b307-467g109j5602
VCDB=> select cluster from workload_configs ;
count
-------
domain-c8:4f67kxxx-5436-7456-b307-467g109j5xxx
domain-c20:4f67kxxx-5436-7456-b307-467g109j5xxx
(2 row)
Drop the workload row for our problem cluster.
VCDB=> delete from workload_configs where cluster = 'domain-c8:4f67k834-5436-7456-b307-467g109j5xxx';
DELETE 1
Exit the database prompt, return to bash
VCDB=> \q
Start WCP service
# vmon-cli -i wcp
Completed Start service request.
Once the WCP service has started again, navigate to Workload Management in the vCenter and select the cluster to be disabled, then select DEACTIVATE to completely remove WCP from the cluster.