NSX-T Edge node crashes with the following "[WARN] unix:/var/run/vmware/edge/dpd.ctl: receive error: Connection reset by peer"
search cancel

NSX-T Edge node crashes with the following "[WARN] unix:/var/run/vmware/edge/dpd.ctl: receive error: Connection reset by peer"

book

Article ID: 345835

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • You are running NSX-T 3.2.0.x, 3.2.1.x, 3.2.2, 4.0.x.
  • You are using DFW rules and some of the rule have "Applied To" on a group which contains overlay segments with logical router switch ports.
  • On the edge node when you run the command get logical-routers you receive the following WARN alert:
An unexpected error occurred: <date-time> edge-appctl 18819 jsonrpc [WARN] unix:/var/run/vmware/edge/dpd.ctl: receive error: Connection reset by peer
  • SSH to the edge node is working.
  • The datapath service is stopped in the edge node:
>get service dataplane
<date-time>
Service name:      dataplane
Service state:     stopped
  • In the edge log /var/log/kern.log we see the following:
datapathd[32578]: segfault at 8 ip <hex-address> sp <hex-uuid> error 4 in datapathd[<hex-address>+15ed000]
  • On the Edge /var/log/syslog:

NSXT-E1 NSX 5266 FIREWALL [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="firewallcp" level="ERROR"] DfwChannel: Failed to update dfw cache due to exception: too many TCP/UDP port: 16

NSXT-E1 datapath-systemd-helper 5197 - - <date-time-1> datapathd 5266 firewallcp [ERROR] DfwChannel: Failed to update dfw cache due to exception: too many TCP/UDP port: 16

NSXT-E1 95ddcdc5d374 3459 - - <date-time-2> datapathd 5266 firewallcp [ERROR] DfwChannel: Failed to update dfw cache due to exception: too many TCP/UDP port: 16

...

NSXT-E1 NSX 3548 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="INFO"] Service datapathd coredump at <date-time-3> file /var/log/core/core.datapathd.<epoch-time>.20851.0.9.gz

  • Core dumps for the dataplane service are seen in the edge node /var/log/core:
core.datapathd.<epoch-time>.20851.0.9.gz

Environment

VMware NSX

Cause

The Central Control Plane (CCP) computed the Downlink port as part of Distributed Firewall Rule's span when the Logical switch / Logical Switch Port (LSP) is used in the Firewall Rule's "Applied To" field.

As a result, DFW Rules are sent to EdgeNode in error. If one of the rules pushed to Edge has the wrong parameters, it may result in a perpetual dataplane crash.

Resolution

This issue is resolved in NSX 4.1.x and 3.2.3 onwards.

In NSX-T 3.2.2, a validation has been implemented to prevent more than 15 logical switch ports being added and will result in an alert similar to:
"Number of values (ranges count as 2 values) in a source/destination ports {port count}. It should not exceed 15."

In NSX 3.2.2, a validation has been added that checks the port count is smaller than or equal to 15. If the port count is greater than 15, then it will fail and return the following error message: "Number of values (ranges count as 2 values) in a source/destination ports {port count}. It should not exceed 15."

To avoid the Edge data path crash, verify that no FW rules contain more than 15 ports. For example: If the port range 1-3 is specified, the rule has 2 ports. Divide the rule into multiple rules if the rule requires more than 15 ports.

Additional Information

The dataplane on the edge node is not functional and therefore will impact services.