Dataplane service crash after upgrade to NSX-T 3.1.2
search cancel

Dataplane service crash after upgrade to NSX-T 3.1.2

book

Article ID: 317185

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:
- Environment recently updated to NSX-T 3.1.2.
- The Edge status would intermittently flap from Up to Down.
- Similar messages like the following maybe seen in the Syslog of an Edge Node (/var/log/syslog):

2021-06-18T08:47:36.988Z edge01.corp.local NSX 3355 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING" eventFeatureName="infrastructure_service" eventType="edge_service_status_changed" eventSev="warning" eventState="On"] The service dataplane changed from STARTED to CRASHED.
2023-03-27T01:50:42.512059+00:00 hostname kernel - - - [23306278.252844] grsec: Segmentation fault occurred at      (nil) in /opt/vmware/nsx-edge/sbin/datapathd[dp-fp:20:30778] uid/euid:0/0 gid/egid:124/124, parent /opt/vmware/edge/dpd/entrypoint.sh[entrypoint.sh:30580] uid/euid:0/0 gid/egid:124/124

2023-03-27T01:50:42.542Z hostname NSX 14567 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] Core file generated: /var/log/core/core.dp-fp:20.1679881842.30647.0.11.gz



- Recent core files in /var/log/core on the Edges like the following: 

-rw-rw-r-- 1 svc.datamover support 33G Jun 18 12:40 core.dp-fp:0.1624004285.3814.0.11
-rw-rw-r-- 1 svc.datamover support 32G Jun 18 12:13 core.dp-fp:1.1624003725.28476.0.11
-rw-rw-r-- 1 svc.datamover support 32G Jun 18 12:11 core.dp-fp:1.1624005420.19033.0.11

Environment

VMware NSX-T Data Center 3.x
VMware NSX-T Data Center

Cause

- L3 routing loop identfied in Logical Router forwarding tables.
- ttl_dec value set to 0 on the logical router forwarding interface for the route which is looping.

For illustration purposes below is one example Topology that could lead to such as loop, but is not exclusive to this topology:

- Two T1 routers connected the same segment on service interfaces:

- In the example above packets destined for network ###.###.###.0/24 will be forwarded over service interface ########-####-####-####-##########83 from T1_Dev to gateway interface 169.254.1.2 on router T1_Prod.
- T1_Prod will then route the packet back to T1_Dev from its service interface ########-####-####-####-##########94 to the gateway IP 169.254.1.1.
- On the service interface the ttl_decrement value should be one, but in the example below it is zero, so this leads to the packet being looped as its TTL is never decremented between hops.

TTldec for service port should be 1 not 0. Example below is incorrect:

    Interface     : ########-####-####-####-##########83
    Ifuid         : 1073
    Name          : t1-4e93e857-####-####-####-##f7
    Fwd-mode      : IPV4_ONLY
    Mode          : lif
    Port-type     : service  
    IP/Mask       : 169.254.1.1/24
    MAC           : 02:50:##:##:##:16
    VNI           : 73819
    Access-VLAN   : untagged
    LS port       : ########-####-####-####-##########16
    Urpf-mode     : STRICT_MODE
    DAD-mode      : LOOSE
    RA-mode       : DISABLED
    Admin         : up
    Op_state      : down
    MTU           : 1500

  "ttl": 0, 
  "type": "lif",
  "urpf-mode": "STRICT_MODE",


Resolution

- Fix available in NSX-T 3.1.3 which corrects the 'dec-ttl' setting that was incorrect for some lrouter ports. 
- Above fix will prevent datapath crashes, but still recommended to identify source of the L3 routing loop and change your configuration to remove this loop, even after application of the fix.

Please note if one is not sure if this is the cause then upload Edge log bundles including the Core files from the crash to a Support Request. Core files will be analyzed for confirmation of the issue.



Workaround:
Not applicable

Additional Information

Impact/Risks:
Datapath will crash and restart. Traffic might be impacted before HA kicks-in or before datapath is restarted if there is no HA.