Dataplane DP service re-starts intermittently which might cause VM communication issue
search cancel

Dataplane DP service re-starts intermittently which might cause VM communication issue

book

Article ID: 314002

calendar_today

Updated On:

Products

VMware NSX VMware vDefend Firewall

Issue/Introduction

Symptoms:

BFD sessions of an edge node are going down intermittently which results in Edge Failover (T0/T1 failover)
LACP PDU's could not be processed before the timeout causing LACP flap
Issues running get commands on the edge cli during the time of the issue

In UI, the status of edge nodes is intermittently down or unknown.

Environment

VMware NSX-T Data Center 4.1 or below

Cause

The problem is caused by scalability problem on edge firewall/NAT. When there are firewall/NAT configuration changes in highly scaled edge node, it takes too long to realize the change. It causes DP to restart

Resolution

A new feature is added in NSX 4.2.0 for Optimizing the dp-ipc thread that handles Firewall Config change, BFD and CLI


Workaround:

Optimize Firewall rules/Firewall config changes to reduce container updates on the Edges.
Below are some of the workaround that can performed to reduce the number of container updates on the Edge until a fix(optimization of the dp-ipc thread) is available
Reducing the firewall configuration change frequency whenever possible 
Use large Form Factor Edges
Adding more Edges to the Edge cluster and distributing load equally among the Edge VM's
Limit the scale of firewall/NAT and T1s on an edge node

Note: The Gateway Firewall configuration changes can be triggered by the following.

  • vMotion of virtual machines
  • Virtual machine creation/deletion
  • Virtual machine power up/power down
  • Changes to the rules
  • Changes to the ns-groups
  • Changes to the ip-sets
  • Changes to load-balancer components like virtual servers, VIPs, and pools

 

 

Additional Information

During time of issue if we run any get commands on the Edge CLI we would get below error.
% An unexpected error occurred: The dataplane service is in error state, has failed or is disabled aggrtr4>

In the Syslog on the Edge VM we would see below Errors.
2023-08-14T18:07:14.532Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="dp-si-purge5" level="WARN" eventId="vmwNSXRCUBlockStatus"] {"event_state":0,"event_external_reason":"dp-ipc18 thread blocked to enter RCU quiesce state","event_src_comp_id":"6b7ae106-3aed-45d6-9735-d4be90b7e815","event_sources":{"process_name":"dp-fp:0#012","thread_id":"dp-ipc18","quiesce_blocked_time":"128000"}}
2023-08-14T18:07:22.432Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="urcu1" level="WARN"] blocked 256000 ms waiting for dp-ipc18 to quiesce
2023-08-14T18:09:22.532Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="dp-si-purge5" level="WARN"] blocked 256000 ms waiting for dp-ipc18 to quiesce
2023-08-14T18:10:30.053Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="urcu1" level="INFO" eventId="vmwNSXRCUBlockStatus"] {"event_state":1,"event_external_reason":"all threads exited RCU quiesce blocked state","event_src_comp_id":"6b7ae106-3aed-45d6-9735-d4be90b7e815","event_sources":{"process_name":"dp-fp:0#012","thread_id":"all-threads","quiesce_blocked_time":"0"}}
2023-08-14T18:10:30.058Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="dp-si-purge5" level="INFO" eventId="vmwNSXRCUBlockStatus"] {"event_state":1,"event_external_reason":"all threads exited RCU quiesce blocked state","event_src_comp_id":"6b7ae106-3aed-45d6-9735-d4be90b7e815","event_sources":{"process_name":"dp-fp:0#012","thread_id":"all-threads","quiesce_blocked_time":"0"}}
2023-08-14T18:10:31.052Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="urcu1" level="WARN"] blocked 1000 ms waiting for dp-ipc18 to quiesce
2023-08-14T18:10:32.052Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="urcu1" level="WARN"] blocked 2000 ms waiting for dp-ipc18 to quiesce
2023-08-14T18:10:34.052Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="urcu1" level="WARN"] blocked 4000 ms waiting for dp-ipc18 to quiesce
2023-08-14T18:10:38.052Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="urcu1" level="WARN"] blocked 8000 ms waiting for dp-ipc18 to quiesce
2023-08-14T18:10:46.052Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="urcu1" level="WARN"] blocked 16000 ms waiting for dp-ipc18 to quiesce
2023-08-14T18:11:02.052Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="urcu1" level="WARN"] blocked 32000 ms waiting for dp-ipc18 to quiesce
2023-08-14T18:11:34.052Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="urcu1" level="WARN"] blocked 64000 ms waiting for dp-ipc18 to quiesce

2023-08-14T17:10:35.198Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="timeval" tname="dp-ipc18" level="WARN"] Unreasonably long 437740ms poll interval (436096ms user, 76ms system)
2023-08-14T17:18:04.917Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="timeval" tname="dp-ipc18" level="WARN"] Unreasonably long 449719ms poll interval (437555ms user, 100ms system)
2023-08-14T17:25:23.293Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="timeval" tname="dp-ipc18" level="WARN"] Unreasonably long 438375ms poll interval (438252ms user, 48ms system)
2023-08-14T17:33:00.841Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="timeval" tname="dp-ipc18" level="WARN"] Unreasonably long 453069ms poll interval (437248ms user, 160ms system)
2023-08-14T17:40:17.176Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="timeval" tname="dp-ipc18" level="WARN"] Unreasonably long 436336ms poll interval (436176ms user, 52ms system)
2023-08-14T17:47:57.270Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="timeval" tname="dp-ipc18" level="WARN"] Unreasonably long 449515ms poll interval (438190ms user, 104ms system)
2023-08-14T17:55:13.882Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="timeval" tname="dp-ipc18" level="WARN"] Unreasonably long 436612ms poll interval (436405ms user, 48ms system)
2023-08-14T18:03:06.432Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="timeval" tname="dp-ipc18" level="WARN"] Unreasonably long 469795ms poll interval (442090ms user, 268ms system)
2023-08-14T18:10:30.052Z edgenode.fqdn NSX 5038 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="timeval" tname="dp-ipc18" level="WARN"] Unreasonably long 443620ms poll interval (440017ms user, 140ms system)nsx-edge" subcomp="datapathd" s2comp="timeval" tname="dp-ipc18"

Impact/Risks:

Customer experiences edge failover intermittently and Datapath issue
LACP PDU's getting dropped