Rapid pod recreation causes NSX slowness

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

During Upgrade process from TKGi 1.19 to TKGi 1.20 (respective K8s 1.28 to 1.29) unexpected recreation of pods can occur due to 3rd party custom resource definition (CRD) and Operator taking control of the pods:

Reproduction steps:

1. Creating 2 clusters on 1.19 (K8s 1.28) control cluster (still on 1.19) and test cluster (partially upgraded to 1.20)

2. Deploy of the Postgres custom resource definitions and the barman-cloud sidecar controllers following:

https://cloudnative-pg.io/documentation/1.26/installation_upgrade/

3. Deploy of Postgres cluster with sidecar using plugin barman-cloud.cloudnative-pg.io

cat cluster.yaml

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: cluster-example
spec:
  instances: 3
  imagePullPolicy: Always
  plugins:
  - name: barman-cloud.cloudnative-pg.io
    isWALArchiver: true
    parameters:
      barmanObjectName: minio-store
  storage:
    size: 1Gi

At this stage before upgrade to 1.20 there is no container pod/cluster-example-1 in init state, this request is actually denied by kube-api 1.28

4. During the upgrade after master is fully upgraded the operator can successfully provision pods with sidecar containers.

5. After start of the upgrade once the master node was upgraded to 1.29 the pod/cluster-example-1 appeared and was scheduled on a worker with 1.28

We see the below example with pod error status code:

kubectl get all,cluster -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/cluster-example-1 0/2 InitContainerRestartPolicyForbidden 0 0s <none> bb55a3ba-xxxx-4880-xxxx-371acfc5e390 <none> <none>
pod/cluster-example-1-initdb-mfpsh 0/1 Completed 0 62m 172.xx.xx.2 bb55a3ba-xxxx-4880-xxxx-371acfc5e390 <none> <none>

NAME COMPLETIONS DURATION AGE CONTAINERS IMAGES SELECTOR
job.batch/cluster-example-1-initdb 1/1 18s 62m initdb ghcr.io/cloudnative-pg/postgresql:17.5 batch.kubernetes.io/controller-uid=dd08ee4d-xxxx-xxxx-xxxx-89e9cbc3e7c2

NAME AGE INSTANCES READY STATUS PRIMARY
cluster.postgresql.cnpg.io/cluster-example 62m 1 Waiting for the instances to become active

We can see in about 20 min time in NSX for same pod 1991 logical pod creation:

grep -E 'Operation="CreateLogicalPort".*Operation status="success"' /var/log/nsx-audit-write.* | grep cluster-example-1 | wc -l
1991

There is no indication that something is wrong prior upgrade there are no init pod in failing state and unfortunately from kubernetes there is no direct way how to throttle the pod recreation as this is handled by 3rd party Postgres Operator which recreates the pod every second.

Environment

TKGi 1.19 upgrade to TKGI 1.20

Cause

Kubernetes 1.29 ( TKGi 1.20 ) introduced a new feature (Sidecar Containers which is enabled by default) allowing initContainers to use a restartPolicy:

https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/

Enabling sidecar containers
Starting with Kubernetes 1.29, the SidecarContainers feature is enabled by default. This allows to set a restartPolicy for containers defined within a pods initContainers section.These restartable sidecar containers operate independently from other init containers and the main application container in the pod. They can be started, stopped, or restarted without affecting the rest of the containers.

In this particular incident there were 3rd party components involved:
A postgres operator deployed on the cluster, which manages the lifecycle of postgres clusters (functionally similar to a Statefulset, but managed by the operators controller)
Alongside the operator, a backup solution was used: plugin-barman-cloud
After a Postgres cluster is created, each pod within that cluster includes an init container named plugin-barman-cloud that acts as a sidecar.

initContainers:
  image: ghcr.io/cloudnative-pg/plugin-barman-cloud-sidecar:v0.5.0
  imagePullPolicy: IfNotPresent
  name: plugin-barman-cloud
  resources: {}
  restartPolicy: Always

During the upgrade, the kube-apiserver was firstly updated to 1.29 (which accepts init containers with a restart policy). However, the pod cloudnative-pg-cluster-1-1 got scheduled onto a worker node still running kubelet 1.28, which doesn’t support this feature - causing the init container to be rejected.

{
"kind": "Event",
"apiVersion": "audit.k8s.io/v1",
"level": "RequestResponse",
"auditID": "78d31db2-xxxx-42c8-xxxx-77d920a72fa0",
"stage": "ResponseComplete",
"requestURI": "/api/v1/namespaces/pg-native-cluster-1/pods/cloudnative-pg-cluster-1-1/status",
"verb": "patch",
"user": {
"username": "kubelet",
"uid": "kubelet",
"groups": [
"system:authenticated"
]
},
"sourceIPs": [
"11.xx.xx.6"
],
"userAgent": "kubelet/v1.28.11+vmware.2 (linux/amd64) kubernetes/xxxxxx",
"objectRef": {
"resource": "pods",
"namespace": "pg-native-cluster-1",
"name": "cloudnative-pg-cluster-1-1",
"apiVersion": "v1",
"subresource": "status"
},
"responseStatus": {
"metadata": {},
"code": 200
},
"requestObject": {
"metadata": {
"uid": "309b94f2-xxxx-40e0-xxxx-756d96363b08"
},
"status": {
"conditions": null,
"message": "Pod was rejected: Init container \"plugin-barman-cloud\" may not have a non-default restartPolicy",
"phase": "Failed",
"qosClass": null,
"reason": "InitContainerRestartPolicyForbidden",
"startTime": "2025-xx-27T10:xx:49Z"
}

During the upgrade, the Postgres operator continuously tried to recreate the cloudnative-pg-cluster-1-1 pod every second after it was denied scheduling on a worker node still running kubelet v1.28 (which didn’t support init containers with a restartPolicy).

Each new request reached the kube-api and was accepted, which triggered NCP to call NSX to reserve the necessary logical ports and other resources. Since the worker node couldn’t admit the pod, the operator deleted it and immediately retried, creating a loop of create/delete operations.

This constant churn overwhelmed the nestdb database used by NCP, causing significant slowness and disruption to network operations.

Resolution

Deletion of the Postgres related objects (cluster.namespace) stopped the rapid recreation of pods

Once the problematic pod was deleted, the loop stopped, the tunnels came back online, and nestdb gradually caught up with the backlog of pending operations.

Additional Information

Recommendations (Checks before/during upgrading) to prevent or detect such problems:

There is no straight forward method to prevent this incident from occurring there are other possible causes where similar behaviour could happen

Cronjob with job to run every second
The above scenario but triggered by different operator
Unknown other reasons

To reduce the risk and have better understanding if there is potential problem

Check for containers in an init state, after further analysis and reproduction of this issue it was confirmed that due to such feature was not allowed in kube-api 1.28 there are no pods in init or failed state prior upgrade, it is not possible to verify if such pod was ever created
If pods are in an init state during the upgrade process, consider doing one of the following:

Restart the pod
Scale the deployment to 0
Schedule the pod to move to a worker using > 1.29

Verify if a Postgres operator is running on each service instance in the foundation:

for i in $(bosh deployments  --column=name | grep service-instance); do echo "cluster $i 
result"; bosh -d $i ssh master/0 -c 'sudo /var/vcap/packages/kubernetes/bin/kubectl --kubeconfig=/var/vcap/jobs/smoke-tests/config/kubeconfig api-resources | grep postgresql.cnpg.io | grep clusters' | grep stdout; done;

on already upgraded cluster - sidecar containers can be found using below method, however such container will not be created (cannot be found) prior k8s 1.29 :

kubectl get pods --all-namespaces -o json | jq -r '.items[]| select(.spec.initContainers != null)| select(.spec.initContainers[]? | has("restartPolicy"))| "\(.metadata.namespace)/\(.metadata.name)"'

Last but not least no matter the reason the high volume of pod creations can be checked on all clusters with below command:

for i in $(bosh deployments  --column=name | grep service-instance); do echo "cluster $i result"; bosh -d $i ssh master/0 -c 'if sudo /var/vcap/jobs/ncp/bin/nsxcli -c get ncp-master status | grep -q "This instance is the NCP master"; then sudo  /var/vcap/jobs/ncp/bin/nsxcli -c get ncp-watcher pod; fi' | grep stdout; done;

Results like below indicate high volume of operations:

cluster service-instance_50764bd7-83dc-4e91-b040-4139fa9885d4 result
master/104b43d6-6701-473b-88f2-d9bb162b07f7: stdout | Wed Jul 09 2025 UTC 09:07:22.067
master/104b43d6-6701-473b-88f2-d9bb162b07f7: stdout | Wed Jul 09 2025 UTC 09:07:24.356
master/104b43d6-6701-473b-88f2-d9bb162b07f7: stdout | Average event processing time: 1 msec (in past 3600-sec window)
master/104b43d6-6701-473b-88f2-d9bb162b07f7: stdout | Current watcher started time: Jul 09 2025 08:58:37 UTC
master/104b43d6-6701-473b-88f2-d9bb162b07f7: stdout | Number of events processed: 7550 (in past 3600-sec window)
master/104b43d6-6701-473b-88f2-d9bb162b07f7: stdout | Total events processed by current watcher: 3590
master/104b43d6-6701-473b-88f2-d9bb162b07f7: stdout | Total events processed since watcher thread created: 36205
master/104b43d6-6701-473b-88f2-d9bb162b07f7: stdout | Total watcher recycle count: 62
master/104b43d6-6701-473b-88f2-d9bb162b07f7: stdout | Watcher thread created time: Jul 07 2025 11:25:10 UTC
master/104b43d6-6701-473b-88f2-d9bb162b07f7: stdout | Watcher thread status: Up
master/104b43d6-6701-473b-88f2-d9bb162b07f7: stdout |