Symptoms:
- Environment recently updated to NSX-T 3.1.2.
- The Edge status would intermittently flap from Up to Down.
- Similar messages like the following maybe seen in the Syslog of an Edge Node (/var/log/syslog):
2021-06-18T08:47:36.988Z edge01.corp.local NSX 3355 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING" eventFeatureName="infrastructure_service" eventType="edge_service_status_changed" eventSev="warning" eventState="On"] The service dataplane changed from STARTED to CRASHED.
2023-03-27T01:50:42.512059+00:00 hostname kernel - - - [23306278.252844] grsec: Segmentation fault occurred at (nil) in /opt/vmware/nsx-edge/sbin/datapathd[dp-fp:20:30778] uid/euid:0/0 gid/egid:124/124, parent /opt/vmware/edge/dpd/entrypoint.sh[entrypoint.sh:30580] uid/euid:0/0 gid/egid:124/124 2023-03-27T01:50:42.542Z hostname NSX 14567 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] Core file generated: /var/log/core/core.dp-fp:20.1679881842.30647.0.11.gz
- Recent core files in /var/log/core on the Edges like the following:
-rw-rw-r-- 1 svc.datamover support 33G Jun 18 12:40 core.dp-fp:0.1624004285.3814.0.11
-rw-rw-r-- 1 svc.datamover support 32G Jun 18 12:13 core.dp-fp:1.1624003725.28476.0.11
-rw-rw-r-- 1 svc.datamover support 32G Jun 18 12:11 core.dp-fp:1.1624005420.19033.0.11
- L3 routing loop identfied in Logical Router forwarding tables.
- ttl_dec value set to 0 on the logical router forwarding interface for the route which is looping.
For illustration purposes below is one example Topology that could lead to such as loop, but is not exclusive to this topology:
- Two T1 routers connected the same segment on service interfaces:
- In the example above packets destined for network ###.###.###.0/24 will be forwarded over service interface ########-####-####-####-##########83 from T1_Dev to gateway interface 169.254.1.2 on router T1_Prod.
- T1_Prod will then route the packet back to T1_Dev from its service interface ########-####-####-####-##########94 to the gateway IP 169.254.1.1.
- On the service interface the ttl_decrement value should be one, but in the example below it is zero, so this leads to the packet being looped as its TTL is never decremented between hops.
TTldec for service port should be 1 not 0. Example below is incorrect:
Interface : ########-####-####-####-##########83
Ifuid : 1073
Name : t1-4e93e857-####-####-####-##f7
Fwd-mode : IPV4_ONLY
Mode : lif
Port-type : service
IP/Mask : 169.254.1.1/24
MAC : 02:50:##:##:##:16
VNI : 73819
Access-VLAN : untagged
LS port : ########-####-####-####-##########16
Urpf-mode : STRICT_MODE
DAD-mode : LOOSE
RA-mode : DISABLED
Admin : up
Op_state : down
MTU : 1500
"ttl": 0,
"type": "lif",
"urpf-mode": "STRICT_MODE",