Tanzu Mission Control Self-Managed (TMC-SM) Lifecycle Cluster Operations (Deploy, Create, Delete, Modify) Stuck due to Kafka Stale Lock
search cancel

Tanzu Mission Control Self-Managed (TMC-SM) Lifecycle Cluster Operations (Deploy, Create, Delete, Modify) Stuck due to Kafka Stale Lock

book

Article ID: 429040

calendar_today

Updated On:

Products

VMware Tanzu Platform - Kubernetes

Issue/Introduction

Lifecycle cluster operations in Tanzu Mission Control Self-Managed, such as deploying, deleting, or modifying clusters, may become stuck in a Pending state within the TMC UI.

Investigation of the TMC-SM management cluster pods reveals that the Kafka broker is in a CrashLoopBackOff state, displaying the following error in the logs:

org.apache.kafka.common.KafkaException: Failed to acquire lock on file .lock in /bitnami/kafka/data. A Kafka instance in another process or thread is using this directory.

This failure prevents the platform from processing task status updates or state changes across the environment.

Cause

The underlying cause is a stale filesystem lock on the Kafka data directory within the TMC-SM management cluster.

This might occur following an ungraceful shutdown (e.g., node failure or abrupt pod restart), which leaves a .lock file behind on the persistent volume. When the Kafka process attempts to restart, it detects this file and fails to start to prevent potential data corruption. As a result, lifecycle cluster operations cannot move past the "Pending" phase because the backend cannot acknowledge the completion of tasks.

Resolution

To resolve this issue, the stale lock file must be manually removed from the Kafka persistent volume to allow the broker to start and resume processing lifecycle tasks.

  1. Find the Kafka pods and their parent StatefulSet in the TMC-SM namespace (commonly tmc-local or similar).
    • kubectl get pods -A | grep kafka
  2. Note the number of replicas.
  3. Scale the StatefulSet to Zero: Scale down the Kafka StatefulSet to ensure no active processes are accessing the volume.
    • kubectl scale statefulset <kafka-statefulset-name> -n <tmc-namespace> --replicas=0
  4. Remove the Stale Lock File: Mount the Kafka PVC to a temporary maintenance pod or use an existing tool to delete the lock file.
    • rm /bitnami/kafka/data/.lock
  5. Restore Kafka Service: Scale the StatefulSet back to its original replica count
    • kubectl scale statefulset <kafka-statefulset-name> -n <tmc-namespace> --replicas=<previous count>
  6. Verify Recovery:
    • Monitor the Kafka logs to confirm the broker transitions to a STARTED state.
    • Once Kafka is healthy, verify that the lifecycle cluster operations in the TMC UI begin to update and complete successfully.