Inter-Node Pod Communication Disruption in Antrea Hybrid Mode Following Node Lifecycle Operations in VCF 9.x
search cancel

Inter-Node Pod Communication Disruption in Antrea Hybrid Mode Following Node Lifecycle Operations in VCF 9.x

book

Article ID: 429416

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service Tanzu Kubernetes Runtime

Issue/Introduction

In environments utilizing guest clusters with Antrea hybrid mode enabled, network traffic between pods located on nodes in different subnets may be disrupted. This condition is typically observed following specific lifecycle operations, including:

  • Updates to the VM Class of the cluster nodes.

  • Node scale-down followed by a scale-up event.

Traffic between pods on the same subnet remains unaffected, while cross-subnet traffic experiences total loss or timeouts.

Environment

  • VMware Cloud Foundation (VCF) 9.1

  • vSphere Kubernetes Releases (VKR) 1.35.0

  • vSphere Kubernetes Service (VKS) 3.6.0

  • Antrea CNI in Hybrid Mode

Cause

The underlying cause could not be identified, but the issue is tied to a failure in the networking state synchronization during node reconfiguration events. The CNI fails to correctly update routing or encapsulation rules for inter-subnet communication when the node object is modified or replaced.

Resolution

This issue is resolved in vSphere Kubernetes Releases (VKR) 1.35.1. An upgrade to this version or a later release is recommended to permanently address the synchronization failure.

Workaround: If an upgrade cannot be performed immediately, connectivity can be restored by restarting the antrea-agent pods on the nodes experiencing the disruption.

  1. Identify the affected nodes where inter-subnet traffic is failing.

  2. Execute the following command to restart the agent on a specific node

    kubectl delete pod -n kube-system -l app=antrea,component=antrea-agent --field-selector spec.nodeName=<Node_Name>

  3. Verify that the antrea-agent pod has returned to a Running state and test pod-to-pod connectivity across subnets.

Note: This workaround is temporary. The disruption may recur if further node lifecycle operations are performed before the fix version is applied.