TMC-SM Agent Pods in CrashLoopBackOff After Supervisor Upgrade or Re-registration
search cancel

TMC-SM Agent Pods in CrashLoopBackOff After Supervisor Upgrade or Re-registration

book

Article ID: 437606

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service VMware Tanzu Mission Control - SM

Issue/Introduction

  • Following a vSphere Supervisor cluster upgrade, or after de-registering and subsequently re-registering a Supervisor to Tanzu Mission Control Self-Managed (TMC-SM), domain-local pods within the TMC namespace may fail to initialize, entering a persistent Init:CrashLoopBackOff state.
  • When inspecting the affected pods, the init-node container repeatedly terminates with an Exit Code 255. This issue prevents the TMC agents from successfully reconciling on the worker nodes during the deployment or upgrade phase.
  • Running kubectl get pods -n <tmc-namespace> shows pods belonging to the domain-local-ds DaemonSet stuck in Init:CrashLoopBackOff.

  • Describing the pod (kubectl describe pod <pod-name> -n <tmc-namespace>) reveals the init-node container failing repeatedly.

  • Container logs for the init-node container are often empty or abruptly cut off.

 

Environment

  • VMware Tanzu Mission Control - SM

Cause

  • This issue is caused by a race condition between the init-node container's initialization script and the underlying host's container runtime (containerd). It is not a failure of TMC connectivity, but rather an architectural quirk triggered by how the agent installer interacts with the container runtime in the Kubernetes environment during deployment or upgrade cycles.
  • The domain-local-ds DaemonSet utilizes an initialization container (init-node) designed to escape its containerized isolation using nsenter. Its purpose is to write a custom TLS certificate to the underlying worker node and subsequently restart the host node's containerd service to apply the new certificate.
  • The original execution command looks like this:

nsenter --mount=/proc/1/ns/mnt -- sh -c 'printenv "tls.crt" > /etc/ssl/certs/$stack_type.crt ; systemctl restart containerd'

  • By forcefully restarting the containerd service—the exact service keeping the container itself alive—the container essentially unplugs its own life support. It is violently terminated before it can report a successful execution back to the Kubelet.

  • A review of the node-level system logs during pod execution confirms this sequence:
    1. The container successfully issues the restart command to the host OS.
    2. The new containerd process boots up, scrubs the abruptly killed container process, and flags it as a "dead shim".

    3. The Kubelet temporarily loses connection to the containerd.sock socket, registers a hard failure, and triggers the CrashLoopBackOff.

Resolution

This issue is resovled in TMC-SM v1.4.4.

Workaround

  • To resolve this issue, you must modify the domain-local-ds DaemonSet to introduce a brief sleep command (sleep 5) immediately following the systemctl restart command.
  • This delay keeps the shell process alive just long enough for the containerd service to cycle successfully, preventing the runtime from marking the container as a leaked shim and allowing the Kubelet to process the state change gracefully.

Step 1: Patch the DaemonSet Run the following command, ensuring you replace <tmc-namespace> with your actual TMC namespace:

kubectl patch ds domain-local-ds -n <tmc-namespace> --type='strategic' -p '{"spec": {"template": {"spec": {"initContainers": [{"name": "init-node", "command": ["nsenter", "--mount=/proc/1/ns/mnt", "--", "sh", "-c", "printenv TLS_CRT > /etc/ssl/certs/$stack_type.crt ; systemctl restart containerd && sleep 5"], "env": [{"name": "TLS_CRT", "valueFrom": {"configMapKeyRef": {"name": "stack-config", "key": "tls.crt", "optional": false}}}]}]}}}}'

Step 2: Restart the DaemonSet Rollout Force the DaemonSet to redeploy its pods so they pick up the patched configuration:

kubectl rollout restart ds/domain-local-ds -n <tmc-namespace>

Step 3: Monitor the Rollout Verify that the new pods initialize successfully and transition into a Running state:

kubectl rollout status ds/domain-local-ds -n <tmc-namespace>
kubectl get pods -n <tmc-namespace>