- Intermittent 401s when hitting the endpoints associated with the service corresponding to NSX-T load balancer
- POD1 an POD2 are on the same worker node
- POD1 traffic going towards the NSX load balancer works fine for all requests and successfully completes with HTTP 200 OK messages
- But the POD2 traffic going to NSX load balancer fails with HTTP status 401 errors
- Traffic flow: POD <---> Open vSwitch <---> Worker Node's Vnic on ESXi Host <---> Host uplinks <---> NSX LB
VMware NSX
VMware vSphere Kubernetes Service (vSphere with Tanzu)
- Verifying logs at NSX LB by enabling Debug log (on load balancer) and Access log (on Virtual server) shows that POD1 requests are coming and getting sent to the pool members and successfully completed but no POD2 requests are coming to the NSX load balancer
- As both PODs belong to the same worker node, when packet captures are performed at Node's switchport, we can still see POD1 traffic only but not POD2 traffic indicating that the issue is present with in POD2 itself causing this issue
Steps to capture packets at worker node level:
- Traffic flow: POD2 ---> Open vSwitch ---> Open vSwitch uplink is worker node ---> Node's Vnic on ESXi host --> ESXi host uplinks (vmnic's) ---> NSX LB (#.#.#.#) (Active Edge) ----> Traffic Routes towards the pool members
- Since there is no POD2 traffic seen at Node's Vnic of ESXi Host, next step would be to capture at worker node level right before it reaches the Vnic of the Node:
1. Need to get the interface of POD that we wanted to capture from the guest cluster in the worker node using the command: ip link | grep <pod name>
2. Once we have that interface, we can then capture at the worker node using the command: tcpdump -i <pod interface> file_name.pcap
3. From the packet capture file with in the POD you can see where the traffic is getting routed and why its not reaching the Vnic interface of the Node