During node pool customization worker nodes are restart randomly, leading to cluster nodes transitioning to a 'NotReady' state. This also causes the node pool to remain indefinitely in the 'Customizing' state.
3.2.0.1
When clusters are deleted, their associated policies are sometimes left behind. If a cluster is re-created with the same name, multiple policies will point to the same workload cluster and nodepool but with different cluster IDs. This can cause issues when the CaaS Spoke pod restarts, as it may reconcile policies inconsistently due to matching target nodepools by cluster and nodepool name.
The root cause is that policies are not properly removed when clusters are deleted, leading to "stale" policies remaining in the database.
App.logCaaS.Wait- Expected attributes did not match in cr status. Attributes to match was {"phase":"Provisioned","pause":true} and cr status is {"autoscalerEnabled":false,"conditions":[{"lastTransitionTime":"2025-xx-xxT13:57:01Z","message":"MachineDeployment still provisioning, DesiredReplicas=4 Replicas=5 ReadyReplicas=4 UpdatedReplicas=5","reason":"CAPVResourceNotReady","severity":"Info","status":"False","type":"Ready"}
This is a known issue and a permanent fix is implemented in TCA 3.4 version. A patch is underway and will be available in the next GA version.
Workaround:
Identify the Stale Nodepolicies:
This script will query all policies and all clusters from DB, store cluster info and policy info locally. Then give the stale policy list .
Run following script in TCAM
$ curl -kLO https://vmwaresaas.jfrog.io/artifactory/cnf-generic-local/kb/3.3.0/detect-staled-poilcies
$ bash detect-staled-poilcies
Note: For air-gapped environments, download this script to a machine with access to TCA, then transfer it to the TCA system and execute it locally.
Clean up Stale Nodepolicies:
1. SSH into both TCA-M and TCA-CP.
2. Delete the stale policies from the policy_intent table in the caas_hub database on TCA-M and the caas_spoke database on TCA-CP using the following commands, replacing 'replace-with-real-policies-uid' with the actual UID of the stale policies:
kubectl exec -n tca-mgr postgres-0 -- psql -d caas_hub -c "delete from policy_intent where uid='replace-with-real-policies-uid';"
kubectl exec -n tca-cp-cn postgres-0 -- psql -d caas_spoke -c "delete from policy_intent where uid='replace-with-real-policies-uid';"
Patch TCA-M services:
1. Verify the TCA version is 24441748 (3.2.0.1) and the current web engine version is 3.2.0-ob-24441215.
cat /opt/vmware/config/cnva_version.properties … em.buildNumber=24441748 …kubectl -n tca-mgr describe deploy/tca-api|grep tag {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{"kbld.k14s.io/images":"- origins:\n - resolved:\n tag: 3.2.0-... tag: 3.2.0-ob-24441215 tag: 3.2.0-ob-24441215 tag: 3.2.0-ob-24441215
2. Switch to sudo mode.
3. Download the patch-changes.tar file: curl -kLO https://vmwaresaas.jfrog.io/artifactory/cnf-generic-local/kb/3.2.0/patch-changes.tar
4. Extract the tar file: tar -xvf patch-changes.tar
5. Navigate into the patch-changes folder: cd patch-changes.6
6. Execute the patch script: ./patch-tca.sh
7. Wait for all tcxproduct Ready to be True by running: watch kubectl get tcxproduct
8. Double-confirm the patch is applied by checking the tca-api deployment tag.
kubectl -n tca-mgr describe deploy/tca-api|grep tag {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{"kbld.k14s.io/images":"- origins:\n - resolved:\n tag: 3.2.0-...
tag: 3.2.0-ob-24441215
tag: 3.2.0-ob-24441215
tag: 3.2.0-ob-24767039
Cleaning up the stale nodepolicies directly addresses the core issue of leftover policies after cluster deletion, preventing future conflicts when clusters are recreated with the same name. Patching the TCA-M services ensures that the system is updated with the necessary fixes to prevent such issues from reoccurring and to maintain proper reconciliation of policies.