Statefulset Pods are stuck in Terminating status while draining Kubernetes Nodes.

search cancel

Statefulset Pods are stuck in Terminating status while draining Kubernetes Nodes.

book

Article ID: 317809

calendar_today

Updated On: 08-21-2024

Products

VMware NSX

Issue/Introduction

Symptoms:

This issue will only occur in a Kubernetes environment with StatefulSet pods.
There is no NoExecute toleration for the node agent demonset
The StatefulSet pod has been deleted, but the pod is stuck in terminating states
This issue has been seen during a Kubernetes eviction Node (Drained).

The following logs can be seen: (NCP log on the master node)

og":"2021-04-08T14:16:20.975Z Node01 NSX 13 - [nsx@6876 comp=\"nsx-container-ncp\" subcomp=\"ncp\" level=\"WARNING\"] vmware_nsxlib.v3.client The HTTP request returned error code 409, whereas 201/200 response codes were expected. Response body {'details': 'Operation failed because of conflicting transaction. Transaction ID: 61bf44a1-1234-####-####-########e66 Address: 618535031', 'httpStatus': 'CONFLICT', 'error_code': 603, 'module_name': 'common-services', 'error_message': 'The object was modified by somebody else. Please retry.', 'error_data': {'STREAM_ID': '3f6a4c95-c4a8-####-####-########8df', 'CONFLICT_VALUE': 'CommunicationMap [DisplayName =ds-tmbdevasx, precedence=13000099, category=APPLICATION, tcpStrict=true, stateful=true, anyScope=true, isDefault=false, connectivityStrategy=null, defaultRuleId=null, schedulerPath=null]', 'CONFLICT_KEY_HASH': '-1648387927382211737', 'CONFLICT_KEY': 
...
{"log":"2021-04-08T14:16:20.975Z Noed01 NSX 13 - [nsx@6876 comp=\"nsx-container-ncp\" subcomp=\"ncp\" level=\"ERROR\" security=\"True\" errorCode=\"NCP00117\"] nsx_ujo.ncp.nsx.policy.nsxapi update_security_policy_rule failed, cause: Unexpected error from backend manager (['10.10.254.10']) for PATCH policy/api/v1/infra/domains/k8sdomain1/security-policies/ks81/rules/ir_1234a15a-d9ec-12ec-1234-fd12e523a781: The object was modified by somebody else. Please retry.
...

Environment

VMware NSX-T Data Center

Resolution

This issue has been fixed in NSX-T 3.1.2 release.

Workaround:
Manually add NoExecute on node agent daemonset.

Edit node agent daemonset: kubectl edit daemonset.apps/nsx-node-agent -n nsx-system
Change the toleration part to following:

      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      - effect: NoSchedule
        key: node.kubernetes.io/not-ready
      - effect: NoSchedule
        key: node.kubernetes.io/unreachable
      - effect: NoExecute
        operator: Exists

Additional Information

Impact/Risks:
The Worker node cannot be fully evicted as the StatefulSet pod are stuck in Terminating status.

Feedback

thumb_up Yes

thumb_down No