Traffic going through Edge Node is interrupted and dataplane service (dp-fp) crashes and generates core dumps
search cancel

Traffic going through Edge Node is interrupted and dataplane service (dp-fp) crashes and generates core dumps

book

Article ID: 318417

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
  • Traffic going through Edge Node is interrupted and dataplane service (dp-fp) crashes and generates core dumps.
  • Edge Node logs (syslog.log) display message(s) similar to:
#get log-file syslog.log | find core.dp-fp
<12>1 2019-12-02T05:15:36.635451+00:00 at101esg02.vmw.arka.run NSX - - - Core file generated: /var/log/core//core.dp-fp:1.1575263711.2688.0.11.g
  • Edge Node logs (kern.log) display message(s) similar to:
#get log-file kern.log | find "Segmentation fault occurred"
kern.log:<5>1 2019-08-28T01:00:29.224169+00:00 edge01.vmware.com kernel - - - [3198001.445096] grsec: Segmentation fault occurred at 0000000000000066 in /opt/vmware/nsx-edge/sbin/datapathd[dp-fp:1:1697] uid/euid:0/0 gid/egid:124/124, parent /lib/systemd/systemd[systemd:1] uid/euid:0/0 gid/egid:0/0
  • When BGP is used, BGP neighborship goes down/up when the dataplane service crashes.
  • When static routes are used, Logical Router routing tables display "unknown" interfaces and static routes are not working as expected following dataplane service crashes.
Example:
#nsxt-edge01(tier0_sr)> get route
...
t0s> * 172.16.0.0/12 [1/0] via 10.10.31.1, unknown, 05w2d02h
t0c * 10.10.22.0/27 is directly connected, uplink-277, 12:52:30
t0c> * 10.10.22.0/27 is directly connected, uplink-277, 12:52:30
t0s> * 192.168.240.0/24 [1/0] via 10.10.31.1, unknown, 05w2d02h
t0s> * 192.168.241.0/24 [1/0] via 10.10.31.1, unknown, 05w2d02h
t0s> * 192.168.242.0/24 [1/0] via 10.10.31.1, unknown, 05w2d0
2h

Environment

VMware NSX-T Data Center
VMware NSX-T Data Center 2.x
VMware NSX-T

Cause

This issue occurs due to an issue with the Flow cache feature (caches forwarding decision on Edge Nodes to improve performances) where segmentation fault occurs when a NULL flow entry being added to the Flow Cache tables. As a result the dataplane service on the Edge Node crashes causing a dataplane impact as described in the Symptoms.

Resolution

This issue is resolved in:

VMware NSX-T Data Center 2.4.3, available at VMware Downloads.

VMware NSX-T Data Center 2.5.1, available at VMware Downloads 

VMware NSX-T Data Center 3.0, available at VMware Downloads.

 


 

 

 


Workaround:
To workaround this issue, disable Flow Cache on all the Edge Nodes following the steps below:
1. Login to the Edge Node using the admin account.
2. Disable Flow cache:
> set dataplane flow-cache disabled
3. Restart the dataplane server
> restart service dataplane -> (Note this will cause a brief interruption to the dataplane)
Note: this setting persists across Edge Node reboot but Flow Cache will need to be disabled on newly deployed Edge Nodes.

Confirm Flow cache is disabled:
> get dataplane flow-cache config
Example of expected output:
Enabled            : false
Mega_hard_timeout_ms: 0
Mega_size          : 0
Mega_soft_timeout_ms: 0
Micro_size         : 0


To workaround the issue where the static routes are showing as "unknown" remove and re-add the static route or edit the route and publish again, this is temporary workaround and issue may occur again so Flow Cache should be disabled to prevent re-occurrences.