NSX data plane disconnects impacting entire environment
search cancel

NSX data plane disconnects impacting entire environment

book

Article ID: 312641

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
  • Host VDR (DLR instance on the host) routes are deleted.
  • Work load VMs experience a network outage.
  • The below entries are observed in netcpa.log on the ESXi host:
2021-03-26T04:16:02.646Z [ 4117700 info ] Received vdr instance message numVdrId 12 ...
2021-03-26T04:16:02.646Z [ 4117700 info ] Full Sync flag is true
2021-03-26T04:16:02.646Z [ 4117700 info ] Updated vdr instance vdr name = edge-117, vdr id = 8016, auth token = eca0c5d5-4753-4934-8a55-bf0fca5b897c, universal = false, localEgress = false
2021-03-26T04:16:02.646Z [ 4117700 info ] Updated vdr instance vdr name = edge-111, vdr id = 8018, auth token = afcbf14e-9598-419f-9d74-878c0ff722a6, universal = false, localEgress = false
2021-03-26T04:16:02.646Z [ 4117700 info ] No flap edge CP link for vdr id 8016
2021-03-26T04:16:02.646Z [ 4117700 info ] No flap edge CP link for vdr id 8018
2021-03-26T04:16:02.779Z [ 3F94700 info ] Vxlan: Change sid 2370391895 to controller 10.123.111.12:0

 
  • Sequence number mismatch can be observed in netcpa.log:
2021-03-22T06:43:47.186Z [ 52C81700 error ] ConfigManager::CheckSequenceNumber get mismatched sequence number, host seqNum 29129, vsm seqNum 28447
 
  • The following entries will be observer in vmkernel.log on the ESXi host:
2021-03-22T06:43:47.229Z [ 52D83700 error ] Failed to parse message payload
2021-03-22T06:43:56.079Z [ 52C00700 info ] Purged stale vdrId 8016, vdrName edge-117, universal: false, localEgress: false
2021-03-22T06:43:56.079Z [ 52C00700 info ] Purged stale vdrId 8020, vdrName edge-120, universal: false, localEgress: false
2021-03-22T06:43:56.079Z [ 52C00700 info ] Purged stale vdrId 8002, vdrName edge-84, universal: false, localEgress: false

2021-03-22T06:43:56.079Z cpu43:42393155)vdrb: VdrCPProcessVdrInstanceMsg:2646: CP:Received Instance message VdrName = edge-117, [I:0x1f50], Type = DEL


Environment

VMware NSX Data Center for vSphere 6.4.x

Cause

This issue is host specific and not all hosts may exhibit this issue for a given DLR / VDR instance.

Example of a scenarios where this could be encountered:
  • Host are disconnected from NSX Manager and the subsequent purge tasks that are created as a result, consequently delete the VDR routes.
  • The sequence number mismatch shows the message between the Management Plane to NETCPA (control plane service), which leverages VSFWD (Management Plane / DFW service) are lost.
  • On detecting the sequence number mismatch, NETCPA marked the VDR instance as stale without stopping the purge timer.

Resolution

This issue is resolved in NSX 6.4.10 and later

Workaround:
Execute a Force Sync of routing on the impacted cluster.
Ensure there are no underlying issues with the hosts that could cause a disconnect to the NSX Manager