Workload Cluster upgrade stuck with newly upgraded node in NotReady status and constantly being recreated

search cancel

Workload Cluster upgrade stuck with newly upgraded node in NotReady status and constantly being recreated

book

Article ID: 391877

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Management

Issue/Introduction

During a TKGm Workload Cluster upgrade, the following symptoms are observed:

One or more ControlPlane and/or Worker nodes get successfully upgraded.
At some point, one of the nodes gets stuck in NotReady status and is constantly recreated in a loop by ClusterAPI.
Logging into the problematic node and checking containerd logs show the following error:

# ssh capv@<node-ip>
# sudo -i
# journalctl -u containerd

"PullImage \"projects.registry.vmware.com/tkg/antrea-advanced-debian@sha256:<sha>\" failed" error="rpc error: code = Canceled desc = failed to pull and unpack image \"projects.registry.vmware.com/tkg/antrea-advanced-debian@sha256:<sha>\": context canceled"
"crictl images" output in the node doesn't show any Antrea image.
"crictl pull projects.registry.vmware.com/tkg/antrea-advanced-debian@sha256:<sha>" in the node hangs and doesn't complete.
"projects.registry.vmware.com" is reachable through ping and curl commands from the problematic node.

Cause

A possible cause for this issue is underlying networking troubles in the ESXi host where the stuck node is hosted.

The fact that other ControlPlane/Worker nodes were successfully upgraded usually indicates networking at cluster level works fine.

Since the registry is reachable, it is possible that containerd is timing out when trying to pull the image due to slow connectivity with the registry.

Resolution

To verify connectivity issues with the registry based on the described symptoms, follow these steps:

To avoid constant recreation of the problematic node, you can pause the cluster's reconciliation to have more time to troubleshoot within the node.
In the Management Cluster context:
# kubectl patch cluster <Workload Cluster> --type merge -p '{"spec":{"paused": true}}' -n <namespace>
Use "ctr" tool to try to pull the image from the registry. "ctr" utility offers a more verbose output than "crictl", so it'll help to identify slow connection issues.
# ssh capv@<node-ip>
# sudo -i
# ctr pull projects.registry.vmware.com/tkg/antrea-advanced-debian@sha256:<sha>

If we observe a very slow download speed, that'll cause containerd to time out pulling the image.

In the example below, after more than 130s the image was still being pulled, causing the timeouts:
After the troubleshooting is completed, unpause the cluster's reconciliation:
In the Management Cluster context:
# kubectl patch cluster <Workload Cluster> --type merge -p '{"spec":{"paused": false}}' -n <namespace>

If slow connectivity with the registry is detected from the problematic VM, but not from other VMs, this could indicate an issue with the underlying ESXi host.

On vSphere Client, migrate the VM to a different ESXi host following How to Migrate Your Virtual Machine to a New Compute Resource.

Once migrated, reboot the VM and verify the status. If containerd still complains about Antrea image pull, try the above "ctr pull" command again and verify the connectivity with the registry.

If the issue is resolved after migrating the VM to a different ESXi host, this most likely means there's an issue with the original ESXi host's networking.
It is recommended to place the problematic ESXi host in Maintenance Mode after migrating/powering off all its hosted VMs: Place an ESXi Host in Maintenance Mode in the VMware Host Client

From vSphere Client: right click on the ESXi host > Maintenance Mode > Enter Maintenance Mode

Feedback

thumb_up Yes

thumb_down No