Cluster Custom Resource (CR) displays a false status (waiting for remediation) in a stretched cluster scenario.

Products

VMware VMware Telco Cloud Automation

Issue/Introduction

This article provides the steps to work around this issue in TCA 2.0

The workaround is to address the incorrect cluster status within the stretched cluster, by removing and remediating the cluster status, before importing the cluster and/or associated nodepools to TCA .

Symptoms:
When a cluster custom resource (CR) reports a status of "WaitingForRemidiation", the cluster and/or all associated node pools cannot be imported to Telco Cloud Automation (TCA) using tcactl.

See below sample CR:

status:
conditions:
- lastTransitionTime: "2022-05-07T18:58:28Z"
    reason: WaitingForRemediation @ Machine/<node_name>
    severity: Warning
    status: "False"
    type: Ready

Operations to instantiate CNF’s using TCA will fail when the cluster is in this state.

Check cluster CR status:

1) SSH to the control plane node of the management cluster corresponding to the stretched cluster where problem is observed.

2) Verify the cluster CR status:
kubectl get tcakubernetesclusters -n <cluster_name> <cluster_name> -o yaml

The status of ready condition for a problematic cluster will be as shown below:

status:
conditions:
- lastTransitionTime: "2022-05-07T18:58:28Z"
    reason: WaitingForRemediation @ Machine/<cluster_node_name>
    severity: Warning
    status: "False"
    type: Ready
- lastTransitionTime: "2022-05-07T18:58:28Z"
    reason: WaitingForRemediation @ Machine/<cluster_node_name>
    severity: Warning
    status: "False"
    type: ControlPlaneReady

3) Verify the kcp status:
kubectl get kcp -n <cluster_name> <cluster_node_name> -o yaml.

status:
conditions:
- lastTransitionTime: "2022-05-07T18:58:28Z"
    reason: WaitingForRemediation @ Machine/<cluster_node_name>
    severity: Warning
    status: "False"
    type: Ready

- lastTransitionTime: "2022-05-07T18:58:28Z"
    reason: WaitingForRemediation @ Machine/<cluster_node_name>
    severity: Warning
    status: "False"
    type: MachinesReady

4) Verify machine status:
kubectl get machines -n <cluster_name> <cluster_node_name> -o yaml

conditions:
- lastTransitionTime: "2022-05-07T18:58:26Z"
    reason: WaitingForRemediation
    severity: Warning
    status: "False"
    type: Ready

- lastTransitionTime: "2022-05-07T18:57:59Z"
    reason: WaitingForRemediation
    severity: Warning
    status: "False"
    type: OwnerRemediated

Do not proceed with the work around steps if these conditions are not present as shown above.

Environment

VMware Telco Cloud Automation 2.0
VMware Telco Cloud Automation 2.0.1

Cause

This problem is caused due to a clean-up issue with the capi 0.3.23 version that is being used with TCA 2.0/TKG 1.4

Resolution

This issue is to be addressed in Telco Cloud Automation (TCA) 2.1.

Workaround:
1) Download kubectl 1.24:
https://dl.k8s.io/release/v1.24.0/bin/linux/amd64/kubectl

2) Validate the version of kubectl by running the following command from where the patch kubectl binary is located:
./kubectl version –client
The version should be 1.24.0

3) SSH to management cluster (corresponding to the problematic tkg stretched cluster) using capv@mgmt-cluster-controlplane-ip

4) Using the following command, generate a kubeconfig file and replace the CLUSTERNAME-CONF with a sensible name based on the cluster you are working with:
kubectl config view --minify --raw >> /tmp/CLUSTERNAME-CONF"

5) SCP the kubeconfig file to the system where the kubectl binary from step 1 was downloaded.

6) Run the following command:
./kubectl --kubeconfig kubeconfig <kubeconfig_filename> get kcp <cluster_node_name> -n <cluster_name> -o yaml

Replace the name space and control plane node with the affected cluster name namespace. Replace the kubeconfig file name with the respective file name you copied. This will validate the kubectl connection directly from your system to the management cluster, using the provided kubeconfig file.

7) Run the following command:
./kubectl --kubeconfig <cluster_kubeconfig> edit machine -n <cluster_name> <cluster_node_name> --subresource=status

8) Delete the condition with the OwnerRemediated reason.
Note: Delete the entire condition, not only the reason line. A condition starts and ends with the "-" character. The first line of a condition will normally be "- lastTransitionTime xxxxx" .
Note: The "--subresource=status" flag is mandatory and must not be skipped.

9) Run the following command to verify that the Ready status is set to True
kubectl get kcp -n <cluster_name> <cluster_node_name> -o yaml

Additional Information

Affected versions:
TCA 2.0/TKG 1.4/ capi 0.3.23