Clusters Stuck in Detaching status from TMC Self-Managed UI
search cancel

Clusters Stuck in Detaching status from TMC Self-Managed UI

book

Article ID: 378981

calendar_today

Updated On:

Products

VMware Tanzu Mission Control - SM VMware Tanzu Mission Control Self-Managed

Issue/Introduction

When attempting to detach clusters from the Tanzu Mission Control Self-Managed (TMC SM) UI, the clusters remain stuck in the detaching state. This issue prevents the successful detachment of clusters from TMC.

Despite the fact the status on TMC UI is detaching, you can see that the tmc namespace and agents are removed from the backend cluster. 

Cause

The core issue stems from the expiration of certificates used by key services like Kafka Exporter and Cluster Reaper. Although the certificates were rotated by cert-manager, the running pods did not pick up the newly issued certificates, continuing to use the expired ones, leading to authentication failures.

The key logs indicating the certificate expiration include:

 

Cluster Reaper service logs:

tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 

Kafka Exporter logs:

tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 2024-10-01T10:18:05Z is after 2024-09-21T06:54:48Z

 

This is a recurring issue where cert-manager successfully renews the certificates, but the pods do not automatically reload them. As a result, the affected services fail to authenticate or connect, leading to problems such as "stuck" cluster detachments.

Resolution

To resolve the issue, a rollout restart of all relevant pods in the tmc-local namespace was performed. This ensures that the new certificates are picked up by the restarted pods.


The command that has to be run is the following: kubectl -n tmc-local rollout restart deployment account-manager-server that will be for all the deployments and stateful sets. 


These commands restart each service, allowing the new certificates to be loaded. After performing the restart, the new pods successfully picked up the updated certificates, and the clusters that were previously stuck in the detaching process were able to detach as expected.


This process ensures that all affected services resume normal operation with valid certificates.