The event with the error "Failed to customize node pool.See nodepool condition for error details" gets triggered frequently.
The node pool has already been deleted and does not exist in VMware Telco Cloud Automation.
3.1
This issue occurs when a node pool is deleted in while there is an ongoing node pool customization task in running state. This may result in the backend resource being deleted but customization job being stuck without completion.
2024-08-20T04:35:13.760723384Z stdout F 2024-08-20 04:35:13.760 UTC [ClusterAutomationService_SvcThread-4, Ent: HybridityAdmin, Usr: HybridityAdmin, , TxId: 7518eae4-b224-4df2-b660-e2f296a9025a] WARN CaaS- Operation : EventNotification,Intent : xGnL9d3dQSqhrpHF7LA8dg Error adding customization errors: Did not find node policy for "node pool name"
2024-08-20T04:35:13.760735339Z stdout F java.lang.Exception: Did not find node policy for "node pool name"
2024-08-20T04:35:13.760743316Z stdout F at com.vmware.telco.service.clusterautomation.jobs.EventNotifier.fillCustomisationErrorMessage(EventNotifier.java:252)
2024-08-20T04:35:13.760751141Z stdout F at com.vmware.telco.service.clusterautomation.jobs.EventNotifier.doRequestStatusUpdate(EventNotifier.java:175)
2024-08-20T04:35:13.760777478Z stdout F at com.vmware.telco.service.clusterautomation.jobs.EventNotifier.executeState(EventNotifier.java:49)
2024-08-20T04:35:13.760787525Z stdout F at com.vmware.telco.service.clusterautomation.jobs.BaseJob.run(BaseJob.java:67)
We see that there is an invalid intent id for the node pool
2024-08-19T06:05:49.605641164Z stdout F 2024-08-19 06:05:49.605 UTC [ClusterAutomationService_SvcThread-73523, Ent: HybridityAdmin, Usr: HybridityAdmin, , TxId: 7518eae4-b224-4df2-b660-e2f296a9025a] INFO c.v.t.s.c.ClusterAutomationService- Updating nodepool having intent id 6bb69855-e74c-4cc3-8e96-ec3c1fcce0ca, condition NodepolicyReady, as Customizing node pool which occured at 2024-08-19T06:06:11.538559274Z
2024-08-19T06:05:49.606354803Z stdout F 2024-08-19 06:05:49.606 UTC [ClusterAutomationService_SvcThread-73523, Ent: HybridityAdmin, Usr: HybridityAdmin, , TxId: 7518eae4-b224-4df2-b660-e2f296a9025a] WARN c.v.t.c.utils.NodePoolStatusDbUtil- Unable to find node pool record with intent id: 6bb69855-e74c-4cc3-8e96-ec3c1fcce0ca
2024-08-19T06:05:49.61206233Z stdout F 2024-08-19 06:05:49.611 UTC [ClusterAutomationService_SvcThread-73521, Ent: HybridityAdmin, Usr: HybridityAdmin, , TxId: 7518eae4-b224-4df2-b660-e2f296a9025a] WARN c.v.t.s.c.ClusterAutomationService- Dropping Condition update for event : {"type":"MESSAGE_BASED_RATE_LIMITABLE_EVENT","eventName":"NodepolicyReady","details":{"reason":"K8S121004","status":"False","message":"Failed to customize node pool","k8sEvent":{"kind":"Event","type":"Error","count":1,"owner":"tca","reason":"CustomizationRequestFailed","source":{"component":"tca-nodepool"},"message":"Failed to customize node pool. See node pool condition for error details","metadata":{"uid":"38cae3e3-edf0-410e-8acc-3815b5c5a88d","name":"abcd-nodepool-CustomiseNodePool.38cae3e3-edf0-410e-8acc-3815b5c5a88d","namespace":"udg"},"objectId":"38cae3e3-edf0-410e-8acc-3815b5c5a88d:CustomizationRequestFailed","clusterId":"abcdcfcc-3861-41b8-8846-a10cb1d054ac","clusterName":"mgmt-cluster","involvedObject":{"name":"nodepoolname","namespace":"udg"}},"severity":"Warning","clusterName":"test","errorMessage":"[NodepolicyIsReconciling] . Node-Policy stage failed. Error: node test-abc-9f422-mzvsk-mmh2n err: vmconfig status is Failed, nodeconfig status is Normal.vmconfig failed: plugin vmReconfigPlugin reconcile failed: reconfigure VM failed with error *types.InvalidArgument. A specified parameter was not correct: spec.deviceChange.device.port.switchUuid\n.node udg-sfmu-9f422-mzvsk-stznw err: vmconfig status is Failed, nodeconfig status is Normal.vmconfig failed: plugin vmReconfigPlugin reconcile failed: reconfigure VM failed with error *types.InvalidArgument. A specified parameter was not correct: spec.deviceChange.device.port.switchUuid\n.. Reason: MachineDeployment test\/"node pool name" is not found, skip reconcileError adding customization errors: Did not find node policy for udg:node poll name","nodePoolName":"node pool name","raiseK8sEvent":true,"mgmtClusterName":"mgmt-cluster","k8sEventLocation":"20240529022948707-2b99de15-fd43-44c3-b1b9-e2cd8bad4ec9","clusterObjectType":"nodepool","clusterObjectTypes":"nodepool","needConditionUpdate":true,"fillCustomisationError":true,"conditionUpdateLocation":"20240529022939056-4121057e-1667-4367-ab27-19718ca5fd8b"},"eventCount":71527,"firstTimestamp":"2024-08-19T06:06:11.529197313Z","sequence":143095,"eventSource":"20240529022948707-2b99de15-fd43-44c3-b1b9-e2cd8bad4ec9","intentId":"6bb69855-e74c-4cc3-8e96-ec3c1fcce0ca","previousMessage":"[NodepolicyIsReconciling] . Node-Policy stage failed. Error: node test-nodepoolname-9f422-mzvsk-mmh2n err: vmconfig status is Failed, nodeconfig status is Normal.vmconfig failed: plugin vmReconfigPlugin reconcile failed: reconfigure VM failed with error *types.InvalidArgument. A specified parameter was not correct: spec.deviceChange.device.port.switchUuid\n.node test-nodepoolname-9f422-mzvsk-stznw err: vmconfig status is Failed, nodeconfig status is Normal.vmconfig failed: plugin vmReconfigPlugin reconcile failed: reconfigure VM failed with error *types.InvalidArgument. A specified parameter was not correct: spec.deviceChange.device.port.switchUuid\n.. Reason: MachineDeployment test\/nodepoolname is not found, skip reconcile"}
We can manually clean up the invalid intent id identified from the logs using the following postgres query:
1) SSH to TCA-CP
2) connect to postgres
kubectl exec -it postgres-0 -n tca-cp-cn -- /bin/bash psql -U postgres -d tca
3) Verify that the intent exists with id: 6bb69855-e74c-4cc3-8e96-ec3c1fcce0ca. Output should have 1 record.
select * from "cluster_intents" where val->>'intentId'='6bb69855-e74c-4cc3-8e96-ec3c1fcce0ca';
4) Delete the orphaned intent
delete from "cluster_intents" where val->>'intentId'='6bb69855-e74c-4cc3-8e96-ec3c1fcce0ca';
5) Validate the intent is deleted by repeating the step 3. The output should have 0 record.
This will invalidate the orphaned intent.