Reproduction steps:
deployed postgres operator and deployed a cluster with one instance
Then tested kubectl delete pod <PODNAME>
In another tab watching for IP changes during restarted process kubectl get po -owide -w
It is visible that the pod IP after recreation is the Old IP instead of receiving new IP.
Testing from the postgress pod to confirm if pod is able to connect to the kubernetes service IP result in error no route to host
kubectl exec -it cluster-example-1 -- bash -c 'timeout 3 bash -c "echo > /dev/tcp/10.xxx.xxx.1/443" && echo Open || echo Closed'
kubectl exec -it cluster-example-1 -- bash -c 'timeout 3 bash -c "echo > /dev/tcp/10.xxx.xxx.1/443" && echo Open || echo Closed'
Defaulted container "postgres" out of: postgres, bootstrap-controller (init), plugin-barman-cloud (init)
Closed
kubectl exec -it cluster-example-1 -- bash -c 'timeout 3 bash -c "echo > /dev/tcp/10.xxx.xxx.1/443" && echo Open || echo Closed'
Defaulted container "postgres" out of: postgres, bootstrap-controller (init), plugin-barman-cloud (init)
bash: connect: No route to host
bash: line 1: /dev/tcp/10.xxx.xxx.1/443: No route to host
Closed
TKGi 1.2x
Issue: nsx-node-agent configures the pod network interface with a wrong IP address. In such case the pod will never be able to send/receive traffic.
This is not common scenario and alternatively similar situation can happen if NCP service is not running, Such a delay in receiving the new configuration is indication of a slowness either on the ESXi host or NSX services.
This can happen only for statefulset members and standalone pods, and only if below conditions are met:
Confirmation that NCP was running during the pod deletion process is important validation point. In order to confirm there is some sort of delay for the new IP configuration to arrive to the nsx-node-agent is observed with following lines from nsx-node-agent logs:
2025-10-01T10:47:15.592Z 01738526-xxxx-xxxx-xxxx-8d97ccce00f1 NSX 1019181 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher Mark cni_delete_timestamp 1759315635.5921113 for CIF ContainerNetworkInfo('11.xxx.xxx.4/24', '11.xxx.xxx.1', '04:50:56:xx:xx:18', 8, '8ab27b55-a8bc-48ea-aa50-1e432b3c7282')
2025-10-01T10:47:25.132Z 01738526-xxxx-xxxx-xxxx-8d97ccce00f1 NSX 1019181 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher Received CNI request message: {"version": "2.0.0", "config": {"netns_path": "/var/run/netns/cni-4dddb7ab-xxx-xxx-xxx-9da8837ed6aa", "container_id": "a2f095eb1253dfd42bc655d1c8f710dc486732ebfceafdb81a9bdee5b27e4a71", "dev": "eth0", "mtu": null, "container_key": "nsx.cnpg-system.cloudnative-pg-cluster-4-010-1", "dns": null, "runtime_config": {}}, "op": "ADD"}
2025-10-01T10:47:25.598Z 01738526-xxxx-xxxx-xxxx-8d97ccce00f1 NSX 1019181 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher Skip an exsiting CIF config for container nsx.example-system.cloudnative-pg-cluster-example-1 until backoff expires. Last used at 1759315635.5921113
2025-10-01T10:47:26.599Z 01738526-xxxx-xxxx-xxxx-8d97ccce00f1 NSX 1019181 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher Skip an exsiting CIF config for container nsx.example-system.cloudnative-pg-cluster-example-1 until backoff expires. Last used at 1759315635.5921113
2025-10-01T10:47:27.599Z 01738526-xxxx-xxxx-xxxx-8d97ccce00f1 NSX 1019181 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher Skip an exsiting CIF config for container nsx.example-system.cloudnative-pg-cluster-example-1 until backoff expires. Last used at 1759315635.5921113
2025-10-01T10:47:28.600Z 01738526-xxxx-xxxx-xxxx-8d97ccce00f1 NSX 1019181 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher Skip an exsiting CIF config for container nsx.example-system.cloudnative-pg-cluster-example-1 until backoff expires. Last used at 1759315635.5921113
2025-10-01T10:47:29.600Z 01738526-xxxx-xxxx-xxxx-8d97ccce00f1 NSX 1019181 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher Skip an exsiting CIF config for container nsx.example-system.cloudnative-pg-cluster-example-1 until backoff expires. Last used at 1759315635.5921113
2025-10-01T10:47:30.620Z 01738526-xxxx-xxxx-xxxx-8d97ccce00f1 NSX 1019181 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher_lin Adding container nsx.example-system.cloudnative-pg-cluster-example-1 in namespace /var/run/netns/cni-4dddb7ab-xxx-xxx-xxx-9da8837ed6aa (IP: 11.xxx.xxx.4/24, MAC: 04:50:56:xx:xx:18, gateway: 11.xxx.xxx.1, VLAN: 8, dev: eth0)
From the above snippet it is visible that the Old IP was marked for deletion but due to 15 sec timeout and no New IP received the nsx-node-agent reuses the Old IP
Alternatively in case this problem was observed earlier another log sequence can be observed indicating that the above log snippet already happened and the timer have expired prior.This message indicates there is already a mismatch between the hyperbus cache and the OVS port for the container:
2025-10-14T12:12:53.615Z 1d56895e-676d-43ed-a901-2f742e4f47b1 NSX 3111098 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher Checking if CNI port ContainerNetworkInfo('11.xx.xx.14/24', None, '04:50:xx:xx:xx:79', 8, '6c87f7f0-41fc-4304-a416-4d07d3ae3671') match cache port ContainerNetworkInfo('11.xx.xx.20/24', '11.xx.xx.1', '04:50:xx:xx:xx:87', 9, 'd35a55ff-7d7d-4b22-9a46-4f58834167e0')
2025-10-14T12:12:53.615Z 1d56895e-676d-43ed-a901-2f742e4f47b1 NSX 3111098 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cache Value mismatch, key: attachment_id, v1 :6c87f7f0-41fc-4304-a416-4d07d3ae3671, v2: d35a55ff-7d7d-4b22-9a46-4f58834167e0
The message is not followed by until backoff expires. however the DEL and ADD hyperbus messages trigger the pod network isolation removing the old interface and adding the new one
2025-10-14T12:13:18.947Z 1d56895e-676d-43ed-a901-2f742e4f47b1 NSX 3111098 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.hyperbus_service Put app_id nsx.xxxx.xxxx-xxxxx-example-0% with IP 11.xx.xx.20/24, MAC 04:50:xx:xx:xx:87, gateway 11.xx.xx.1/24, vlan x,CIF d35a55ff-7d7d-4b22-9a46-4f58834167e0, wait_for_sync False into queue for hyperbus DEL,current size: 12025-10-14T12:13:18.947Z 1d56895e-676d-43ed-a901-2f742e4f47b1 NSX 3111098 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.hyperbus_service Put app_id nsx.xxxx.xxxx-xxxx-example-0% with IP 11.xx.xx.7/24, MAC 04:50:xx:xx:xx:3e, gateway 11.xx.xx.1/24, vlan 3,CIF 28c95482-d8af-4775-9be4-36129a36bde2, wait_for_sync False into queue for hyperbus ADD,current size: 2
This problem is expected to be addressed in NCP Release 4.2.4 and respectively added to the relative TKGi Patch release.
This behaviour can be mitigated by setting a higher threshold in nsx_node_agent configuration.
This is controlled by the config parameter config_reuse_backoff_time in the [nsx-node-agent] section, default value is 15 seconds. (it is not defined as variable in the config)
This parameter is not exposed in the ncp bosh job. Therefore it needs to be configured directly on the worker nodes, and the setting will be overwritten by a TKGI cluster upgrade.
There is possible way to apply these setting using daemonset.
The file below file needs to have "config_reuse_backoff_time = 30" in place
cat /var/vcap/jobs/nsx-node-agent/config/ncp.ini
[DEFAULT]
use_stderr = False
[coe]
connect_retry_timeout = 30
[nsx_node_agent]
config_reuse_backoff_time = 30
proc_mount_path_prefix = ''
followed by monit reload and monit restart nsx-node-agent
Approach #1 - Bosh OS Config
Note: Runtime Configs will be applied to all VMs managed by Bosh Director. If you need to install the OS packages in just a subset of clusters and VMs/nodes, it's important that you configure the Runtime Config appropriately making use of the corresponding Runtime Config include and exclude rules. Wrong Runtime Config configuration can result in undesirable updates in clusters and VMs/nodes.
Example of Runtime Config setup:
releases:
- name: "os-conf"
version: "23.0.0"
addons:
- name: nsx-node-agent-update
jobs:
- name: pre-start-script
release: os-conf
properties:
script: |-
#!/bin/bash
INI_FILE="/var/vcap/jobs/nsx-node-agent/config/ncp.ini"
SEARCH_KEY="config_reuse_backoff_time"
SECTION="[nsx_node_agent]"
echo "Checking for $SEARCH_KEY in $INI_FILE"
if grep -q "^${SEARCH_KEY}" "$INI_FILE"; then
echo "No changes to apply: $SEARCH_KEY already present in $INI_FILE"
else
echo "Adding $SEARCH_KEY=30 under $SECTION in $INI_FILE"
sed -i '/^\[nsx_node_agent\]/a config_reuse_backoff_time=30' "$INI_FILE"
fi
include:
deployments: [<service-instance_XXXXXXXXXX>] # Optional, you can define which deployments (TKGi clusters) this runtime config will be applied to.
instance_groups: [<master and/or worker, as defined in the deployment manifest>] # Optional, you can define which instance_groups (cluster nodes, i.e. masters/workers) this runtime config will be applied to.
exclude:
deployments: [<service-instance_XXXXXXXXXX>] # Optional, you can define which deployments (TKGi clusters) this runtime config will not be applied to.
instance_groups: [<master and/or worker, as defined in the deployment manifest>] # Optional, you can define which instance_groups (cluster nodes, i.e. masters/workers) this runtime config will not be applied to.
bosh update-config --type=runtime --name nsx-node-agent os-conf.yaml
bosh configs
Get ID of the config created
bosh config <ID>
Review the created config
tkgi upgrade-cluster <NAME>
Approach #2 - Daemonset
Apply a Daemonset that will verify if the file on each worker /var/vcap/jobs/nsx-node-agent/config/ncp.ini contains the above setting and if not it will append the line in the correct section.
The DS will require privileged mode in order to access the worker node file system
Once the change is applied restart of the nsx-node-agent service is required for the changes to take effect
This change will not be preserved during upgrade or recreation of a worker but the line will be added as long as the Daemonset is running
Pause image (version might differ from TKGi versions) and ubuntu image have to be downloaded to a private registry if cluster does not have internet access
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: update-node-agent-admin
namespace: pks-system
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
tkg: update-node-agent-admin
template:
metadata:
creationTimestamp: null
labels:
tkg: update-node-agent-admin
spec:
containers:
- image: projects.registry.vmware.com/tkg/pause:3.10
imagePullPolicy: IfNotPresent
name: sleep
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
hostPID: true
initContainers:
- command:
- /bin/sh
- -xc
- |
set -e
INI_FILE="var/vcap/jobs/nsx-node-agent/config/ncp.ini"
SEARCH_KEY="config_reuse_backoff_time"
SECTION="[nsx_node_agent]"
if grep -q "^${SEARCH_KEY}" "$INI_FILE"; then
echo "No changes to apply: $SEARCH_KEY already present in $INI_FILE"
else
echo "1 Adding $SEARCH_KEY under $SECTION in $INI_FILE"
sed -i '/^\[nsx_node_agent\]/a config_reuse_backoff_time=30' "$INI_FILE"
fi
image: ubuntu:23.04
imagePullPolicy: IfNotPresent
name: init
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/vcap
name: hostfs
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /var/vcap
type: ""
name: hostfs
KB with additional information for os-config https://knowledge.broadcom.com/external/article/394333/install-custom-os-packages-in-tkgi-nodes.html