How to Remove Excess TKGm-Generated Cluster Modules from vCenter to Resolve 'failed to verify cluster module for object' in CAPV Logs
search cancel

How to Remove Excess TKGm-Generated Cluster Modules from vCenter to Resolve 'failed to verify cluster module for object' in CAPV Logs

book

Article ID: 313091

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

Symptoms:
A TKGm cluster will generate two cluster modules on vSphere: a KubeadmControlPlanes module and a MachineDeployment module. Each cluster module corresponds to a vAPI response size of 36 bytes on vCenter. In some cases, additional cluster modules may be created on vSphere, even if they are not actively utilised, leading to significantly larger vAPI response sizes. If the response size exceeds 7MB, CAPV will no longer be able to access cluster modules from vCenter and the pod may crash. You will see the below error in CAPV logs when this scenario is encountered:
 
E0719 08:57:47.271292       1 clustermodule_reconciler.go:93] capv-controller-manager/vspherecluster-controller/rms/test01-tkg "msg"="failed to verify cluster module for object" "error"="GET https://endpoint.test/rest/vcenter/cluster/modules: 500 Internal Server Error"  "moduleUUID"="6114a41f-3451-79f2-77b8-24f7031676fl" "name"="test01-tkg-control-plane"
 
Additionally, you may see the following error message in the vCenter endpoint logs when the response size limit is exceeded:
 
2023-07-19T08:57:47.266Z | ERROR | vAPI-I/O dispatcher-1     | SessionFacade                  | Unexpected error occurred while executing the call with session [email protected] (internal id 82w43c20-7351-44f1-9974-3740ff89v283, token 9a85t...) for method com.vmware.vcenter.cluster.modules.list with uuid 50e1s528-1019-49gf-8llb-fe4k93bdk935.
com.vmware.vapi.endpoint.common.UnacceptableResponseException: Response size 106942232b is greater than allowed 7000000b
 

To get the number of expected modules, we can run:

kubectl get vspherecluster -A -o json | jq '.items[].spec.clusterModules[].moduleUUID | count' -r | wc -l
 

And compare that number against the number of modules existing on vCenter:

govc cluster.module.ls | wc -l

 


Resolution

A permanent resolution for this issue is still under investigation.

Workaround:

Note: Please check that there are no other systems creating/using cluster modules before doing this. Otherwise modules created by somebody else get deleted.

Note: If we delete a module which was used by CAPV, a new reconciliation will recreate it.

Note: If there are multiple management clusters within the same vCenter, you will need to follow the steps below for each additional management cluster. However, for subsequent clusters, replace "govc-modules.txt" with "filtered.txt" Continue this process until you have removed all cluster modules generated by all management clusters from "filtered.txt".

Procedure to remove excess cluster modules

First generate a complete list of cluster modules using the below command:

govc cluster.module.ls > govc-modules.txt

 

We can get the list of clustermodules from the management cluster by using the following command (note: this uses the commands `kubectl` and `jq`):

echo $(kubectl get vspherecluster -A -o json | jq '.items[].spec.clusterModules[].moduleUUID' -r; head )
 

Now to filter out overlapping modules, we can run the following (if we are removing modules from a second management cluster, use "filtered.txt" in place of "govc-modules.txt". It is important to rename the input "filtered.txt" file to something different to avoid overwriting the file. eg: filtered-1.txt):

(kubectl get vspherecluster -A -o json | jq '.items[].spec.clusterModules[].moduleUUID' -r; cat govc-modules.txt | awk '{print $NF}') | sort | uniq -c | grep -E '^ +1 ' | awk '{print $NF}' > filtered.txt

 

And lastly we should execute `govc cluster.module.rm` for every entry in that list. Depending on the version of the govc CLI that you have, please use the appropriate method:

For govc version 0.31.0 or higher, please use the below command:

govc cluster.module.rm - < filtered.txt


For govc version lower than 0.31.0, please use the below command (this could take a long time depending on the amount of modules that excess exist on vCenter): 

while read -r ID; do echo "Deleting $ID"; govc cluster.module.rm $ID; done < filtered.txt

 

Some time after cleanup, we should check if we still get a increasing amount of cluster modules. And compare the number against the expected one.

To get the number of expected modules, we could run:

kubectl get vspherecluster -A -o json | jq '.items[].spec.clusterModules[].moduleUUID | count' -r | wc -l


And compare that number against

govc cluster.module.ls | wc -l

 


Additional Information

Impact/Risks:
When this issue occurs, you will see many 500 errors. After some time VPXD might crash, indicated by "503 Service Unavailable" in CAPV logs:
 
E0719 09:02:38.748461       1 controller.go:317] controller/vspherecluster "msg"="Reconciler error" "error"="unexpected error while probing vcenter for infrastructure.cluster.x-k8s.io/v1beta1, Kind=VSphereCluster test/test01-tkg: POST \"/sdk\": 503 Service Unavailable" "name"="test01-tkg" "namespace"="test" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="VSphereCluster"