NSX data plane disconnects impacting entire environment
search cancel

NSX data plane disconnects impacting entire environment

book

Article ID: 312641

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • Host VDR (DLR instance on the host) routes are deleted.
  • Work load VMs experience a network outage.
  • The below entries are observed in netcpa.log on the ESXi host:
2021-03-26T04:16:02.646Z [ 4117700 info ] Received vdr instance message numVdrId 12 ...
2021-03-26T04:16:02.646Z [ 4117700 info ] Full Sync flag is true
2021-03-26T04:16:02.646Z [ 4117700 info ] Updated vdr instance vdr name = edge-117, vdr id = 8016, auth token = eca0c5d5-####-####-####-bf0fca5b897c, universal = false, localEgress = false
2021-03-26T04:16:02.646Z [ 4117700 info ] Updated vdr instance vdr name = edge-111, vdr id = 8018, auth token = afcbf14e-####-####-####-878c0ff722a6, universal = false, localEgress = false
2021-03-26T04:16:02.646Z [ 4117700 info ] No flap edge CP link for vdr id 8016
2021-03-26T04:16:02.646Z [ 4117700 info ] No flap edge CP link for vdr id 8018
2021-03-26T04:16:02.779Z [ 3F94700 info ] Vxlan: Change sid 2370391895 to controller 10.123.111.12:0

 
  • Sequence number mismatch can be observed in netcpa.log:
2021-03-22T06:43:47.186Z [ 52C81700 error ] ConfigManager::CheckSequenceNumber get mismatched sequence number, host seqNum 29129, vsm seqNum 28447
  • The following entries will be observer in vmkernel.log on the ESXi host:
2021-03-22T06:43:47.229Z [ 52D83700 error ] Failed to parse message payload
2021-03-22T06:43:56.079Z [ 52C00700 info ] Purged stale vdrId 8016, vdrName edge-117, universal: false, localEgress: false
2021-03-22T06:43:56.079Z [ 52C00700 info ] Purged stale vdrId 8020, vdrName edge-120, universal: false, localEgress: false
2021-03-22T06:43:56.079Z [ 52C00700 info ] Purged stale vdrId 8002, vdrName edge-84, universal: false, localEgress: false

2021-03-22T06:43:56.079Z cpu43:42393155)vdrb: VdrCPProcessVdrInstanceMsg:2646: CP:Received Instance message VdrName = edge-117, [I:0x1f50], Type = DEL

Environment

VMware NSX Data Center for vSphere 6.4.x

Cause

This issue is host specific and not all hosts may exhibit this issue for a given DLR / VDR instance.

Example of a scenarios where this could be encountered:
  • Host are disconnected from NSX Manager and the subsequent purge tasks that are created as a result, consequently delete the VDR routes.
  • The sequence number mismatch shows the message between the Management Plane to NETCPA (control plane service), which leverages VSFWD (Management Plane / DFW service) are lost.
  • On detecting the sequence number mismatch, NETCPA marked the VDR instance as stale without stopping the purge timer.

Resolution

This issue is resolved in VMware NSX Data Center for vSphere  6.4.10

Workaround:
Execute a Force Sync of routing on the impacted cluster.
Ensure there are no underlying issues with the hosts that could cause a disconnect to the NSX Manager