Pods lose network connectivity after Tanzu Kubernetes Grid (TKG) Upgrade
search cancel

Pods lose network connectivity after Tanzu Kubernetes Grid (TKG) Upgrade

book

Article ID: 438912

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Management

Issue/Introduction

Inter-pod communication across different worker nodes fails (e.g., connection timeouts or TCP refusals).

  • "kube-proxy" pods on the affected nodes log the following errors:     

    "Failed to retrieve node IPs err=host IP unknown; known addresses" 

    "Can't determine this node's IP, assuming loopback; if this is incorrect, please set the --bind-address flag"

  • "antrea-agent" pods log continuous failures to bind IP addresses to local interfaces or resolve internal services:      

        "IP address not set for Pod: <namespace>/<pod-name>"

    "Error when initializing flow exporter" err="failed to resolve FlowAggregator Service
        
    

Environment

TKG 2.5

Cause

  • A transient initialization race condition occurs during the node upgrade and provisioning sequence.
  • The Kubernetes scheduler dispatches the "antrea-agent" and "kube-proxy"  DaemonSets before the vSphere Cloud Provider Interface (CPI) fully populates the "InternalIP" field on the newly provisioned Kubernetes Node object.
  • Consequently, the network daemons read an empty address array and cache an incomplete node identity, defaulting to the loopback address.
  • This prevents the Antrea agent from programming the required L2Forwarding and SpoofGuard Open vSwitch (OVS) flow rules, resulting in localized packet drops at the node boundary.

Resolution

Force a state synchronization by restarting the affected network daemons.

Once the node "InternalIP" has been successfully stamped by the CPI (observable via "kubectl get nodes -o wide"), restarting the pods will allow them to fetch the correct identity from the Downward API and accurately reconstruct the local OVS flow tables.
      
      1. Identify the affected worker node where pod traffic is dropping.

      2. Restart the "antrea-agent" pod on the affected node:
      
     kubectl delete pod -n kube-system -l app=antrea,component=antrea-agent --field-selector spec.nodeName=<NODE_NAME>
         
      3. Restart the "kube-proxy" pod on the affected node to ensure iptables NAT rules are accurately mapped:

         kubectl delete pod -n kube-system -l k8s-app=kube-proxy --field-selector spec.nodeName=<NODE_NAME>