Workload cluster rehoming operation stuck due to failed destination management cluster
search cancel

Workload cluster rehoming operation stuck due to failed destination management cluster

book

Article ID: 409633

calendar_today

Updated On:

Products

VMware Telco Cloud Automation VMware Telco Cloud Platform

Issue/Introduction

  • When attempting to rehome (move) a Workload Cluster from a Source Management Cluster to a Destination Management Cluster (e.g., during an upgrade procedure), the operation becomes stuck in a "Pending" state.
  • The re-homing task does not complete and hangs indefinitely.

  • The initial error observed might be vcPrime Edit is failing.

  • The Destination Management Cluster shows signs of instability, such as:

    • Inconsistent VM states in the vCenter UI.

    • Broken "Cluster Diagnosis" checks in the TCA UI.

    • Critical pods (such as kube-vip) are down or unreachable in the destination cluster.

Environment

TCA: 3.2

TCP: 5.0

Cause

  • This issue occurs when the Destination Management Cluster is in an irrecoverable or broken state during the move operation.
  • For a re-homing operation to succeed, the destination cluster must be fully healthy to accept the ownership of the workload cluster resources.
  • If the destination cluster fails (e.g., CAPI/CAPV services are down) after the move has been initiated but before it completes, the Workload Cluster record in the TCA database becomes "stuck" pointing to a non-functional destination.

Resolution

Manually revert the cluster association in the TCA Manager database

  1. Create backup of the TCA Manager appliance. See Backing Up VMware Telco Cloud Automation Control Plane
  2. Prepare the Environment

    • Verify the health of the Source Management Cluster and ensure its CAPI/CAPV resources are running.

    • Identify the exact names of your clusters:

      • <WORKLOAD_CLUSTER_NAME>: The name of the cluster being moved.

      • <SOURCE_MC_NAME>: The name of the original (healthy) Management Cluster.

  3. Access the TCA Database SSH into the TCA Manager appliance as the admin user

  4. Connect to the PostgreSQL database pod:

    kubectl exec -it postgres-0 -n tca-mgr -- psql -d tca -U tca_admin -h localhost

  5. Revert NodePool Relationships Check if the NodePools are mapped to the broken destination cluster, and revert them to the source.

    1. Check current mapping:

      select val->'metadata'->>'mgmtClusterName' from "K8sClusterDetails" where val->>'rowType'='nodePool' and val->'metadata'->>'clusterName'='<WORKLOAD_CLUSTER_NAME>';

      select val->'metadata'->>'mgmtClusterName' from "K8sClusterNodeConfiguration" where val->>'rowType'='nodePool' and val->'metadata'->>'clusterName'='<WORKLOAD_CLUSTER_NAME>';

    2. Update mapping (Run only if the output above shows the Target/Destination management cluster):

      UPDATE public."K8sClusterDetails" SET val = jsonb_set(val, '{metadata, mgmtClusterName}', '"<SOURCE_MC_NAME>"') where val->>'rowType'='nodePool' and val->'metadata'->>'clusterName'='<WORKLOAD_CLUSTER_NAME>';

      UPDATE public."K8sClusterNodeConfiguration" SET val = jsonb_set(val, '{metadata, mgmtClusterName}', '"<SOURCE_MC_NAME>"') where val->>'rowType'='nodePool' and val->'metadata'->>'clusterName'='<WORKLOAD_CLUSTER_NAME>';

  6. Revert Addon Relationships Check and revert the management cluster association for Addons.

    1. Check current mapping:

      select val->'metadata'->>'mgmtClusterName' from "K8sClusterDetails" where val->>'rowType'='addOn' and val->'metadata'->>'clusterName'='<WORKLOAD_CLUSTER_NAME>';

    2. Update mapping:

      UPDATE public."K8sClusterDetails" SET val = jsonb_set(val, '{metadata, mgmtClusterName}', '"<SOURCE_MC_NAME>"') where val->>'rowType'='addOn' and val->'metadata'->>'clusterName'='<WORKLOAD_CLUSTER_NAME>';

  7. Revert Cluster Relationship Check and revert the main cluster object association.

    1. Check current mapping:

      select val->'metadata'->>'mgmtClusterName' from "K8sClusterDetails" where val->>'rowType'='cluster' and val->'metadata'->>'name'='<WORKLOAD_CLUSTER_NAME>';

    2. Update mapping:

      UPDATE public."K8sClusterDetails" SET val = jsonb_set(val, '{metadata, mgmtClusterName}', '"<SOURCE_MC_NAME>"') where val->>'rowType'='cluster' and val->'metadata'->>'name'='<WORKLOAD_CLUSTER_NAME>';

  8. Clear Re-homing Error Flags Remove the status flags that indicate a stuck pivot operation:

    UPDATE public."K8sClusterDetails" SET val = val #- '{status, hasPivotError}' where val->>'rowType'='cluster' and val->'metadata'->>'name'='<WORKLOAD_CLUSTER_NAME>';

    UPDATE public."K8sClusterDetails" SET val = val #- '{status, pivotAccepted}' where val->>'rowType'='cluster' and val->'metadata'->>'name'='<WORKLOAD_CLUSTER_NAME>';

  9. Sync State via UI

    1. Log out of the database and the SSH session.

    2. Log in to the TCA Manager UI.

    3. Locate the Workload Cluster (it should now reflect the Source Management Cluster).

    4. Trigger an Edit operation on the Workload Cluster.

    5. Do not make any changes; simply click Next through the wizard until you reach Apply.

    6. Click Apply. This forces the TCA Manager to refresh the database state and reconcile the cluster status.

Once the Workload Cluster is successfully secured back on the source management cluster, choose one of the following cleanup options:

Option A: Decommission destination management cluster (Preferred)

  1. Delete the broken Destination Management Cluster.

  2. Deploy a new, healthy Destination Management Cluster.

  3. Run the Cluster Diagnosis workflow to verify health before attempting the re-home operation again.

Option B: Retain and cleanup management cluster

To prevent duplicate CR registration states across both source and target management clusters, you must manually purge the stale Workload Cluster CRs on the target management cluster.

  1. Delete all nodepools (will also delete all associated MachineDeployments)

    kubectl delete -n <WORKLOAD_CLUSTER_NAME> tknp --all

  2. Delete the TKC object (will also delete the TKCP and all associated cluster-api resources)

    kubectl delete -n <WORKLOAD_CLUSTER_NAME> tkc <WORKLOAD_CLUSTER_NAME>

  3. Delete the Workload Cluster namespace to remove all remaining orphaned resources

    kubectl delete ns <WORKLOAD_CLUSTER_NAME>