TMC-SM Agent Pods in CrashLoopBackOff After Supervisor Upgrade or Re-registration

Products

VMware vSphere Kubernetes Service VMware Tanzu Mission Control - SM

Issue/Introduction

Following a vSphere Supervisor cluster upgrade, or after de-registering and subsequently re-registering a Supervisor to Tanzu Mission Control Self-Managed (TMC-SM), domain-local pods within the TMC namespace may fail to initialize, entering a persistent Init:CrashLoopBackOff state.
When inspecting the affected pods, the init-node container repeatedly terminates with an Exit Code 255. This issue prevents the TMC agents from successfully reconciling on the worker nodes during the deployment or upgrade phase.
Running kubectl get pods -n <tmc-namespace> shows pods belonging to the domain-local-ds DaemonSet stuck in Init:CrashLoopBackOff.
Describing the pod (kubectl describe pod <pod-name> -n <tmc-namespace>) reveals the init-node container failing repeatedly.
Container logs for the init-node container are often empty or abruptly cut off.

Environment

VMware Tanzu Mission Control - SM

Cause

This issue is caused by a race condition between the init-node container's initialization script and the underlying host's container runtime (containerd). It is not a failure of TMC connectivity, but rather an architectural quirk triggered by how the agent installer interacts with the container runtime in the Kubernetes environment during deployment or upgrade cycles.
The domain-local-ds DaemonSet utilizes an initialization container (init-node) designed to escape its containerized isolation using nsenter. Its purpose is to write a custom TLS certificate to the underlying worker node and subsequently restart the host node's containerd service to apply the new certificate.
The original execution command looks like this:

nsenter --mount=/proc/1/ns/mnt -- sh -c 'printenv "tls.crt" > /etc/ssl/certs/$stack_type.crt ; systemctl restart containerd'

By forcefully restarting the containerd service—the exact service keeping the container itself alive—the container essentially unplugs its own life support. It is violently terminated before it can report a successful execution back to the Kubelet.
A review of the node-level system logs during pod execution confirms this sequence:

1. The container successfully issues the restart command to the host OS.
2. The new containerd process boots up, scrubs the abruptly killed container process, and flags it as a "dead shim".
3. The Kubelet temporarily loses connection to the containerd.sock socket, registers a hard failure, and triggers the CrashLoopBackOff.

Resolution

This issue is resovled in TMC-SM v1.4.4.

Workaround

To resolve this issue, you must modify the domain-local-ds DaemonSet to introduce a brief sleep command (sleep 5) immediately following the systemctl restart command.
This delay keeps the shell process alive just long enough for the containerd service to cycle successfully, preventing the runtime from marking the container as a leaked shim and allowing the Kubelet to process the state change gracefully.

Step 1: Patch the DaemonSet Run the following command, ensuring you replace <tmc-namespace> with your actual TMC namespace:

kubectl patch ds domain-local-ds -n <tmc-namespace> --type='strategic' -p '{"spec": {"template": {"spec": {"initContainers": [{"name": "init-node", "command": ["nsenter", "--mount=/proc/1/ns/mnt", "--", "sh", "-c", "printenv TLS_CRT > /etc/ssl/certs/$stack_type.crt ; systemctl restart containerd && sleep 5"], "env": [{"name": "TLS_CRT", "valueFrom": {"configMapKeyRef": {"name": "stack-config", "key": "tls.crt", "optional": false}}}]}]}}}}'

Step 2: Restart the DaemonSet Rollout Force the DaemonSet to redeploy its pods so they pick up the patched configuration:

kubectl rollout restart ds/domain-local-ds -n <tmc-namespace>

Step 3: Monitor the Rollout Verify that the new pods initialize successfully and transition into a Running state:

kubectl rollout status ds/domain-local-ds -n <tmc-namespace>
kubectl get pods -n <tmc-namespace>