Cluster upgrade fails due to not available IP's on the defined CIDR
search cancel

Cluster upgrade fails due to not available IP's on the defined CIDR

book

Article ID: 439101

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service VMware vCenter Server 8.0

Issue/Introduction

During a cluster upgrade, the update process hangs in a "deleting" state. Consequently, pods go down and become unavailable, accompanied by the following event log:

kubectl describe pod <pod-name>
...
...
Events:

  Type     Reason                  Age                   From     Message
  ----     ------                  ----                  ----     -------
  Warning  FailedCreatePodSandBox  85s (x585 over 128m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "71xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx9536": plugin type="antrea" failed (add): rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/run/antrea/cni.sock: connect: no such file or directory"

kube-controller-manager logs: 

MM DD HH:MM:SS tkc-xxxxx-workers-xxxxx-xxxxx-xxxx5 containerd[621780]: time="YYYY-MM-DDTHH:MM:SS.xxxxxxxxxZ" level=error msg="Failed to destroy network for sandbox \"c1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx9b\"" error="plugin type=\"antrea\" failed (delete): rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /var/run/antrea/cni.sock: connect: no such file or directory\""
MM DD HH:MM:SS tkc-xxxxx-workers-xxxxx-xxxxx-xxxx5 containerd[621780]: time="YYYY-MM-DDTHH:MM:SS.xxxxxxxxxZ" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:argocd-application-controller-0,Uid:d5bxxxxx-xxxx-xxxx-xxxx-xxxxxxxx5cfc,Namespace:argocd,Attempt:0,} failed, error" error="rpc error: code = Unknown desc = failed to setup network for sandbox \"c1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx9b\": plugin type=\"antrea\" failed (add): rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /var/run/antrea/cni.sock: connect: no such file or directory\""
MM DD HH:MM:SS tkc-xxxxx-workers-xxxxx-xxxxx-xxxx5 containerd[621780]: time="YYYY-MM-DDTHH:MM:SS.xxxxxxxxxZ" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:cert-manager-846c766d98-dzb4v,Uid:acexxxxx-xxxx-xxxx-xxxx-xxxxxxxx53f4,Namespace:cert-manager,Attempt:0,}"
MM DD HH:MM:SS tkc-xxxxx-workers-xxxxx-xxxxx-xxxx5 containerd[621780]: time="YYYY-MM-DDTHH:MM:SS.xxxxxxxxxZ" level=error msg="Failed to destroy network for sandbox \"52xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx64\"" error="plugin type=\"antrea\" failed (delete): rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /var/run/antrea/cni.sock: connect: no such file or directory\""
MM DD HH:MM:SS tkc-xxxxx-workers-xxxxx-xxxxx-xxxx5 containerd[621780]: time="YYYY-MM-DDTHH:MM:SS.xxxxxxxxxZ" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:cert-manager-846xxxxxxx-xxxxx,Uid:acexxxxx-xxxx-xxxx-xxxx-xxxxxxxx53f4,Namespace:cert-manager,Attempt:0,} failed, error" error="rpc error: code = Unknown desc = failed to setup network for sandbox \"52xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx64\": plugin type=\"

Environment

vCenter 8.0

Cause

The current CIDR (Class Inter-Domain Routing) defined on the guest cluster has not available IP addresses to continue with the process.
As an example, network with mask /28 has 14 usable IP's available. Keep in mind that apart from nodes, Cluster needs more IPs:

  • Virtual IP (VIP) requires IP from the same network.
  • Infrastructure components as NSX Container Plugin (NCP) or internal load balancers may reserve IPs for management or gateway interfaces.
  • Rolling Updates and Maintenance requires IPs from the same range. During an upgrade, new nodes are provisioned before old nodes are deleted.

For a 9-node cluster, even a single-node surge could temporarily require additional IPs that exceed the 14 available in a /28 network.
As a recommendation, you can check the usage on NSX-T Manager UI under Networking > IP Address Pools or the specific Segment to see actual number of allocated IPs.
For a 9-node production cluster a /27 (30 usable IPs) is generally recommended to provide enough room for upgrades and infrastructure services.

Resolution

Redesign the cluster network to accommodate a larger pool of available IP addresses. This is achieved by modifying and expanding the Service and Pod CIDR ranges within the vSphere Kubernetes Guest Cluster.
Refer: Changing Service and Pod CIDR Ranges in vSphere Kubernetes Guest Cluster