Pods deployed as stetfulset after restart lose their connectivity

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

During normal operations It was noticed that one of the pods from a stateful set couldn't resolve a cluster-internal service (kafka-test.kafka-workload.svc).

Problem was fixed by restarting the failed pod on another worker node this finally allowed the pods to communicate and be able to connect to above service.

This issue happens occasionally and is mainly during restart (or rollout restart of statefullset pods)

Environment

TKGi 1.18

Cause

The affected pod NS(nsx.kafka-workload) backup-solution-2

New pod have been started - At Time

2024-09-23T20:00:36.085Z ffe9a282-942a-4a86-806e-0614bda84d67 NSX 11351 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher Received CNI request message: {"version": "2.0.0", "config": {"netns_path": "/var/run/netns/cni-34378352-39f9-9547-ba15-01b170e0e9c5", "container_id": "5b4beae8007373614649a7e72d40db7a06c71c9192efce21c85fe3db45aa63b6", "dev": "eth0", "mtu": null, "container_key": "nsx.kafka-workload.backup-solution-2", "dns": null, "runtime_config": {}}, "op": "ADD"}

2024-09-23T20:00:49.848Z ffe9a282-942a-4a86-806e-0614bda84d67 NSX 11351 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher_lin Created OVS port backup-solution-2_5b4beae80073736(5b4beae80073736) for container nsx.kafka-workload.backup-solution-2(5b4beae8007373614649a7e72d40db7a06c71c9192efce21c85fe3db45aa63b6)
2024-09-23T20:00:49.848Z ffe9a282-942a-4a86-806e-0614bda84d67 NSX 11351 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher Sent network configuration back to CNI for container nsx.kafka-workload.backup-solution-2: {'return_code': '200', 'return_status': 'OK', 'ip_address': '10.10.5.20/24', 'gateway_ip': '10.10.5.1', 'mac_address': '04:50:56:00:4d:2f', 'vlan_id': 77}

This Pod is supposed to have an IP 10.10.5.20

However on the NCP logs something else have happened:

The request have been sent

2024-09-23T20:00:49.848Z ffe9a282-942a-4a86-806e-0614bda84d67 NSX 11351 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher Sent network configuration back to CNI for container nsx.kafka-workload.backup-solution-2: {'return_code': '200', 'return_status': 'OK', 'ip_address': '10.10.5.20/24', 'gateway_ip': '10.10.5.1', 'mac_address': '04:50:56:00:4d:2f', 'vlan_id': 77}

But the IP have changed to 10.10.5.34
2024-09-23T20:01:03.246Z ffe9a282-942a-4a86-806e-0614bda84d67 NSX 11351 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.hyperbus_service Deleted app nsx.kafka-workload.backup-solution-2% from cache
2024-09-23T20:01:03.247Z ffe9a282-942a-4a86-806e-0614bda84d67 NSX 11351 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.hyperbus_service Put app_id nsx.kafka-workload.backup-solution-2% with IP 10.10.5.34/24, MAC 04:50:56:00:80:a5, gateway 10.10.5.1/24, vlan 2222,CIF 202c5bb0-b8a4-493c-b126-caa1b1db2a26, wait_for_sync False into queue, current size: 1
2024-09-23T20:01:04.025Z ffe9a282-942a-4a86-806e-0614bda84d67 NSX 11351 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.hyperbus_service Updated app nsx.kafka-workload.backup-solution-2% with IP 10.10.5.34/24, MAC 04:50:56:00:80:a5, vlan 2222,gateway 10.10.5.1/24, CIF 202c5bb0-b8a4-493c-b126-caa1b1db2a26, wait_for_sync False

It looks like this information was not reflected and the pod became isolated the new IP 10.10.5.34 was added to several IPsets but not the original IP

2024-09-23T20:00:36.316Z 7cc0a95a-aa4f-4e4b-83e2-bf8ae36a0138 NSX 75997 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="INFO"] nsx_ujo.ncp.nsx.manager.base_k8s_nsxapi Updated IPSet 00b452f5-335a-4f48-9952-51e89222288f9 with IPs ['10.10.5.4', '10.10.5.30', '10.10.5.32', '10.10.5.34', '10.10.5.19', '10.10.5.31', '10.10.5.33']

For example pod with 10.10.5.30 is another instance of the same Stateful set backup-solution-5

time="2024-09-23T19:56:50.560870782Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:backup-solution-5,Uid:a11f1578-4c65-4790-9f62-e9e3bf4ab4e2,Namespace:kafka-workload,Attempt:0,}"

This could be possible race condition when during the CNI creation the wrong IP is returned.

As visible below the same pod was originally created with IP 10.10.5.20 then it got same IP at the time of the event, however the next restart on the same worker node shows different IP 10.10.5.34

This is a discrepancy between the states from NSX and kubernetes

worker-16c64r.9a65171e-6ff1-40e3-b13a-1284c0f87bd1.2024-09-24-14-27-01/nsx-node-agent/nsx-node-agent.stdout.log:
188: 2024-08-22T19:11:58.273Z ffe9a282-942a-4a86-806e-0614bda84d67 NSX 11351 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher Sent network configuration back to CNI for container nsx.kafka-workload.backup-solution-2: {'return_code': '200', 'return_status': 'OK', 'ip_address': '10.10.5.20/24', 'gateway_ip': '10.10.5.1', 'mac_address': '04:50:56:00:4d:2f', 'vlan_id': 77}
61557: 2024-09-23T20:00:49.848Z ffe9a282-942a-4a86-806e-0614bda84d67 NSX 11351 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher Sent network configuration back to CNI for container nsx.kafka-workload.backup-solution-2: {'return_code': '200', 'return_status': 'OK', 'ip_address': '10.10.5.20/24', 'gateway_ip': '10.10.5.1', 'mac_address': '04:50:56:00:4d:2f', 'vlan_id': 77}
61777: 2024-09-24T09:32:22.179Z ffe9a282-942a-4a86-806e-0614bda84d67 NSX 11351 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher Sent network configuration back to CNI for container nsx.kafka-workload.backup-solution-2: {'return_code': '200', 'return_status': 'OK', 'ip_address': '10.10.5.34/24', 'gateway_ip': '10.10.5.1', 'mac_address': '04:50:56:00:80:a5', 'vlan_id': 2222}

Resolution

Normally if there is no network policy (which will create default deny rule for anything not matching the required criteria) or a default deny rule configured on the NSX manager Pod with IP different from what is pushed to NSX will be allowed to connect to services, however if there is a network policy with more restrictive access this pod will be completely isolated.

At this stage restart of the pod on another worker node allows the pod to be become operational again