Node Pool is stuck in a customizing state, and the worker nodes are automatically restarting.

Products

VMware Telco Cloud Automation VMware Telco Cloud Platform

Issue/Introduction

During node pool customization worker nodes are restart randomly, leading to cluster nodes transitioning to a 'NotReady' state. This also causes the node pool to remain indefinitely in the 'Customizing' state.

Environment

TCA: 3.2.0.1
TCP: 5.0.2

Cause

When clusters are deleted, their associated policies are sometimes left behind. If a cluster is re-created with the same name, multiple policies will point to the same workload cluster and nodepool but with different cluster IDs. This can cause issues when the CaaS Spoke pod restarts, as it may reconcile policies inconsistently due to matching target nodepools by cluster and nodepool name.

The root cause is that policies are not properly removed when clusters are deleted, leading to "stale" policies remaining in the database.

App.log

CaaS.Wait- Expected attributes did not match in cr status. Attributes to match was {"phase":"Provisioned","pause":true} and cr status is {"autoscalerEnabled":false,"conditions":[{"lastTransitionTime":"2025-##-##T13:57:01Z","message":"MachineDeployment still provisioning, DesiredReplicas=4 Replicas=5 ReadyReplicas=4 UpdatedReplicas=5","reason":"CAPVResourceNotReady","severity":"Info","status":"False","type":"Ready"}

Resolution

This is a known issue and Resolved in TCA 3.3.0.1

Workaround:

Identify the Stale Nodepolicies:

This script will query all policies and all clusters from DB, store cluster info and policy info locally. Then give the stale policy list .

Run following script in TCAM

$ curl -kLO https://packages.broadcom.com/artifactory/tca-distro/kb/3.3.0/detect-staled-poilcies

$ bash detect-staled-poilcies

Note: For air-gapped environments, download this script to a machine with access to TCA, then transfer it to the TCA system and execute it locally.

Clean up Stale Nodepolicies:

1. SSH into both TCA-M and TCA-CP.
2. Delete the stale policies from the policy_intent table in the caas_hub database on TCA-M and the caas_spoke database on TCA-CP using the following commands, replacing 'replace-with-real-policies-uid' with the actual UID of the stale policies:

kubectl exec -n tca-mgr postgres-0 -- psql -d caas_hub -c "delete from policy_intent where uid='replace-with-real-policies-uid';"
kubectl exec -n tca-cp-cn postgres-0 -- psql -d caas_spoke -c "delete from policy_intent where uid='replace-with-real-policies-uid';"

Patch TCA-M services:

1. Verify the TCA version is 24441748 (3.2.0.1) and the current web engine version is 3.2.0-ob-24441215.

cat /opt/vmware/config/cnva_version.properties
…
em.buildNumber=24441748
…
kubectl -n tca-mgr describe deploy/tca-api|grep tag
{"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{"kbld.k14s.io/images":"- origins:\n - resolved:\n tag: 3.2.0-...
tag: 3.2.0-ob-24441215
tag: 3.2.0-ob-24441215
tag: 3.2.0-ob-24441215

2. Switch to sudo mode.
3. Download the patch-changes.tar file: curl -kLO https://packages.broadcom.com/artifactory/tca-distro/kb/3.2.0/patch-changes.tar

4. Extract the tar file: tar -xvf patch-changes.tar
5. Navigate into the patch-changes folder: cd patch-changes.6
6. Execute the patch script: ./patch-tca.sh
7. Wait for all tcxproduct Ready to be True by running: watch kubectl get tcxproduct
8. Double-confirm the patch is applied by checking the tca-api deployment tag.

kubectl -n tca-mgr describe deploy/tca-api|grep tag {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{"kbld.k14s.io/images":"- origins:\n - resolved:\n tag: 3.2.0-...

tag: 3.2.0-ob-24441215

tag: 3.2.0-ob-24767039

Cleaning up the stale nodepolicies directly addresses the core issue of leftover policies after cluster deletion, preventing future conflicts when clusters are recreated with the same name. Patching the TCA-M services ensures that the system is updated with the necessary fixes to prevent such issues from reoccurring and to maintain proper reconciliation of policies.