In environments utilizing guest clusters with Antrea hybrid mode enabled, network traffic between pods located on nodes in different subnets may be disrupted. This condition is typically observed following specific lifecycle operations, including:
Updates to the VM Class of the cluster nodes.
Node scale-down followed by a scale-up event.
Traffic between pods on the same subnet remains unaffected, while cross-subnet traffic experiences total loss or timeouts.
VMware Cloud Foundation (VCF) 9.1
vSphere Kubernetes Releases (VKR) 1.35.0
vSphere Kubernetes Service (VKS) 3.6.0
Antrea CNI in Hybrid Mode
The underlying cause could not be identified, but the issue is tied to a failure in the networking state synchronization during node reconfiguration events. The CNI fails to correctly update routing or encapsulation rules for inter-subnet communication when the node object is modified or replaced.
This issue is resolved in vSphere Kubernetes Releases (VKR) 1.35.1. An upgrade to this version or a later release is recommended to permanently address the synchronization failure.
Workaround: If an upgrade cannot be performed immediately, connectivity can be restored by restarting the antrea-agent pods on the nodes experiencing the disruption.
Identify the affected nodes where inter-subnet traffic is failing.
Execute the following command to restart the agent on a specific nodekubectl delete pod -n kube-system -l app=antrea,component=antrea-agent --field-selector spec.nodeName=<Node_Name>
Verify that the antrea-agent pod has returned to a Running state and test pod-to-pod connectivity across subnets.
Note: This workaround is temporary. The disruption may recur if further node lifecycle operations are performed before the fix version is applied.