Cluster upgrade fails due to not available IP's on the defined CIDR

Products

VMware vSphere Kubernetes Service VMware vCenter Server 8.0

Issue/Introduction

During a cluster upgrade, the update process hangs in a "deleting" state. Consequently, pods go down and become unavailable, accompanied by the following event log:

kubectl describe pod <pod-name> ... ... Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 85s (x585 over 128m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "71xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx9536": plugin type="antrea" failed (add): rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/run/antrea/cni.sock: connect: no such file or directory"

kube-controller-manager logs:

MM DD HH:MM:SS tkc-xxxxx-workers-xxxxx-xxxxx-xxxx5 containerd[621780]: time="YYYY-MM-DDTHH:MM:SS.xxxxxxxxxZ" level=error msg="Failed to destroy network for sandbox \"c1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx9b\"" error="plugin type=\"antrea\" failed (delete): rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /var/run/antrea/cni.sock: connect: no such file or directory\""
MM DD HH:MM:SS tkc-xxxxx-workers-xxxxx-xxxxx-xxxx5 containerd[621780]: time="YYYY-MM-DDTHH:MM:SS.xxxxxxxxxZ" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:argocd-application-controller-0,Uid:d5bxxxxx-xxxx-xxxx-xxxx-xxxxxxxx5cfc,Namespace:argocd,Attempt:0,} failed, error" error="rpc error: code = Unknown desc = failed to setup network for sandbox \"c1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx9b\": plugin type=\"antrea\" failed (add): rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /var/run/antrea/cni.sock: connect: no such file or directory\""
MM DD HH:MM:SS tkc-xxxxx-workers-xxxxx-xxxxx-xxxx5 containerd[621780]: time="YYYY-MM-DDTHH:MM:SS.xxxxxxxxxZ" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:cert-manager-846c766d98-dzb4v,Uid:acexxxxx-xxxx-xxxx-xxxx-xxxxxxxx53f4,Namespace:cert-manager,Attempt:0,}"
MM DD HH:MM:SS tkc-xxxxx-workers-xxxxx-xxxxx-xxxx5 containerd[621780]: time="YYYY-MM-DDTHH:MM:SS.xxxxxxxxxZ" level=error msg="Failed to destroy network for sandbox \"52xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx64\"" error="plugin type=\"antrea\" failed (delete): rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /var/run/antrea/cni.sock: connect: no such file or directory\""
MM DD HH:MM:SS tkc-xxxxx-workers-xxxxx-xxxxx-xxxx5 containerd[621780]: time="YYYY-MM-DDTHH:MM:SS.xxxxxxxxxZ" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:cert-manager-846xxxxxxx-xxxxx,Uid:acexxxxx-xxxx-xxxx-xxxx-xxxxxxxx53f4,Namespace:cert-manager,Attempt:0,} failed, error" error="rpc error: code = Unknown desc = failed to setup network for sandbox \"52xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx64\": plugin type=\"

Environment

vCenter 8.0

Cause

The current CIDR (Class Inter-Domain Routing) defined on the guest cluster has not available IP addresses to continue with the process.
As an example, network with mask /28 has 14 usable IP's available. Keep in mind that apart from nodes, Cluster needs more IPs:

Virtual IP (VIP) requires IP from the same network.
Infrastructure components as NSX Container Plugin (NCP) or internal load balancers may reserve IPs for management or gateway interfaces.
Rolling Updates and Maintenance requires IPs from the same range. During an upgrade, new nodes are provisioned before old nodes are deleted.

For a 9-node cluster, even a single-node surge could temporarily require additional IPs that exceed the 14 available in a /28 network.
As a recommendation, you can check the usage on NSX-T Manager UI under Networking > IP Address Pools or the specific Segment to see actual number of allocated IPs.
For a 9-node production cluster a /27 (30 usable IPs) is generally recommended to provide enough room for upgrades and infrastructure services.

Resolution

Redesign the cluster network to accommodate a larger pool of available IP addresses. This is achieved by modifying and expanding the Service and Pod CIDR ranges within the vSphere Kubernetes Guest Cluster.
Refer: Changing Service and Pod CIDR Ranges in vSphere Kubernetes Guest Cluster