Statefulset Pods are stuck in Terminating status while draining Kubernetes Nodes.
search cancel

Statefulset Pods are stuck in Terminating status while draining Kubernetes Nodes.

book

Article ID: 317809

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:

  • This issue will only occur in a Kubernetes environment with StatefulSet pods.
  • There is no NoExecute toleration for the node agent demonset
  • The StatefulSet pod has been deleted, but the pod is stuck in terminating states
  • This issue has been seen during a Kubernetes eviction Node (Drained).

The following logs can be seen: (NCP log on the master node)

og":"2021-04-08T14:16:20.975Z Node01 NSX 13 - [nsx@6876 comp=\"nsx-container-ncp\" subcomp=\"ncp\" level=\"WARNING\"] vmware_nsxlib.v3.client The HTTP request returned error code 409, whereas 201/200 response codes were expected. Response body {'details': 'Operation failed because of conflicting transaction. Transaction ID: 61bf44a1-1234-####-####-########e66 Address: 618535031', 'httpStatus': 'CONFLICT', 'error_code': 603, 'module_name': 'common-services', 'error_message': 'The object was modified by somebody else. Please retry.', 'error_data': {'STREAM_ID': '3f6a4c95-c4a8-####-####-########8df', 'CONFLICT_VALUE': 'CommunicationMap [DisplayName =ds-tmbdevasx, precedence=13000099, category=APPLICATION, tcpStrict=true, stateful=true, anyScope=true, isDefault=false, connectivityStrategy=null, defaultRuleId=null, schedulerPath=null]', 'CONFLICT_KEY_HASH': '-1648387927382211737', 'CONFLICT_KEY': 
...
{"log":"2021-04-08T14:16:20.975Z Noed01 NSX 13 - [nsx@6876 comp=\"nsx-container-ncp\" subcomp=\"ncp\" level=\"ERROR\" security=\"True\" errorCode=\"NCP00117\"] nsx_ujo.ncp.nsx.policy.nsxapi update_security_policy_rule failed, cause: Unexpected error from backend manager (['10.10.254.10']) for PATCH policy/api/v1/infra/domains/k8sdomain1/security-policies/ks81/rules/ir_1234a15a-d9ec-12ec-1234-fd12e523a781: The object was modified by somebody else. Please retry.
...



Environment

VMware NSX-T Data Center

Resolution

This issue has been fixed in NSX-T 3.1.2 release.

Workaround:
Manually add NoExecute on node agent daemonset.

  1. Edit node agent daemonset: kubectl edit daemonset.apps/nsx-node-agent -n nsx-system
  2. Change the toleration part to following:
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      - effect: NoSchedule
        key: node.kubernetes.io/not-ready
      - effect: NoSchedule
        key: node.kubernetes.io/unreachable
      - effect: NoExecute
        operator: Exists

Additional Information

Impact/Risks:
The Worker node cannot be fully evicted as the StatefulSet pod are stuck in Terminating status.