Workload cluster lifecycle operations is stuck due to reason: EndpointMonitorApplyingFailed

Products

VMware Telco Cloud Automation

Issue/Introduction

TCA was migrated from TCA 2.3 to 3.2.
TCA 2.3 had Airgap and Harbor partner system with FQDN in upper case. No issues were observed with this configuration on TCA 2.3.
Below Error observed while creating/updating the cluster:

Error is due to Endpoint monitor reconciling failure, requeuing" tcakuberneterscluster="test-tkg-harbor-01" namespace="test-tkg-harbor-01" error="failed to apply endpoint monitor for airgap appliance: failed to create the endpoint: Endpoint.monitoring.telco.vmware.com \"TESTAIRGAP.example.com.443\" is invalid: metadata.name: Invalid value: \"TESTAIRGAP.example.com.443\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9][a-z0-9])?(
.[a-z0-9]([-a-z0-9][a-z0-9])?)*')"

Environment

3.2

Cause

The cause of this issue is that when doing workload cluster lifecycle operations like creation/updating/ etc, in TCA 3.2 the management cluster will try to create custom resources known as "endpoints" on the target workload cluster. The endpoint entry to be created uses the address/FQDN as part of its name. If the address contain uppercase letters the name of the endpoint would be invalid as a custom resource name. So the endpoint creation fails and the cluster stuck in "processing" status.

Also if the vCenter FQDN has uppercase letters, same issue will be encountered.

For clarification:

vCenter and/or airgap FQDN contains uppercase letters: workload cluster stuck in "processing" status with endpoint creation error.
Harbor FQDN contains uppercase letters: workload cluster is shown as "provisioned" if vCenter and airgap FQDN is in lowercase. But the harbor addon will be shown as "processing"

Resolution

Script is applied as a workaround for this issue. Below is the high level explanation on what the workaround script does:

It logs into the specified management cluster and looks for the configuration for the monitor operator addon.

It copies the configuration and makes change to the newly copied one to mark the monitor operator addon as disabled.

It changes the related TBR to use the newly copied addon configuration.

So for existing workload cluster stuck in "processing" status, management cluster will now see monitor operator is disabled so it won't try to create any "endpoints" for it. Similarly, for future newly created workload clusters, management cluster will also skip creating any "endpoints."

Please follow the below steps for applying the script:

SSH into the TCA-CP VM that manages the management cluster as "admin" user.
Copy the attached script into the TCA-CP VM, either by copying the content or scp to the VM.
Execute `bash <script filename> --mc <MC name>`.
example `bash disable-monitor-operator --mc mc-tkg-252`.
To restore the operation, call `bash <script filename> --mc <MC name> --restore`
`--verbose` can be used to show more detailed information.
If there are multiple management cluster that are having this issue, execute the script for every MC.
If the management cluster is upgraded later, the script needs to be executed again.

When the --restore option is provided, the script will just change the TBR to use the old addon configuration.

The impact of the script is the "health" column in the "connected endpoint" section in the cluster's detail page will display the value as "Unknown". When clicking the "View Health Details" link next to the "Unknown" health status, an error message would be shown saying the endpoint is not found. It is expected since we have disabled the creation of the endpoint.
The "status" column won't be impacted. And the "connected endpoint" page in the administration page won't be impacted neither.

Attachments

disable-monitor-operator-7e85476 get_app