Creating vdrPort, Pushing DLR LIFs to the ESXi host in an NSX prepared cluster fails

Products

VMware NSX for vSphere

Issue/Introduction

Pushing DLR LIFs to the ESXi host in an NSX prepared cluster fails.
Adding a distributed router fails or a router is created but not passing data. (For example: Connect a logical switch to a DLR and add a VM to the logical switch, then try to ping the DLR default gateway of the VM, which will reside on the DLR).
On the ESXi hosts, when you check for the vdrPort with the esxtop + n command, there is none present.
Checking for vdr instances displays an instance created but no LIFs or routes:

net-vdr -I -l

VDR Instance Information :

---------------------------

Vdr Name: default+edge-14

Vdr Id: 0x00001388

Number of Lifs: 0

Number of Routes: 0

Number of Hold Pkts: 0

Number of Neighbors: 0

State: Enabled

Controller IP: 10.95.10.173

Control Plane IP: 10.95.10.115

Control Plane Active: Yes

Num unique nexthops: 0

Generation Number: 0

Edge Active: No

Pmac: 00:00:00:00:00:00

net-vdr -C -l default+edge-14

Host locale Id: 423bccb4-####-####-####-9e29bfc6e607

Connection Information:

-----------------------

DvsName VdrPort NumLifs VdrVmac

------- ------- ------- -------

Net-vdr -C -l has blank information.

Environment

VMware NSX for vSphere 6.3.x
VMware NSX for vSphere 6.0.x
VMware NSX for vSphere 6.1.x
VMware NSX for vSphere 6.4.x
VMware NSX for vSphere 6.2.x

Cause

This issue occurs due to an unsupported configuration of having more than one Link Aggregation Group (LAG) on the vSphere Distributed Switch.

Resolution

VMware recommends to use two different vSphere Distributed Switch with two different LAGs.

Note: Only one LAG group is supported for the vSphere Distributed Switch (vDS) used for the VXLAN configuration. If the vDS has more than One LAG port the vDR port creation will fail.

For example: One LAG for the NSX Distributed Logical Router (DLR) and another for non NSX LAG (such as storage, management etc.).

Workaround:
To work around this issue:

Remove the second LAG from the DVS, or if there is more than 1 LAG, remove all others, leaving only one LAG on the DVS for VXLAN.
From the NSX Manager User Interface (UI), unconfigure the VXLAN and reconfigure the VXLAN. This requires the Deletion of Logical switches (disconnecting from the NSX Edges) and removal of the Cluster from the Transport zone (if last cluster, delete the transport zone).

Additional Information

In the /var/log/vmkernel.log file of the affected ESXi host, you see entries similar to:

2016-09-30T20:16:09.077Z cpu0:35425)vdrb: VdrCPProcessVdrInstanceMsg:2376: CP:Received Instance message VdrName = default+edge-5, [I:0x1388], Type = ADD
2016-09-30T20:16:09.077Z cpu0:35425)vdrb: VDRCreateInstance:1032: SYS:VDR instance 5000 already available
2016-09-30T20:16:09.077Z cpu0:35425)WARNING: vdrb: VdrCPProcessVdrInstanceMsg:2382: CP:Instance Create: Failed for [I:0x1388] status: Address already in use
2016-09-30T20:16:09.077Z cpu0:35425)vdrb: VdrCPProcessLinkChangeMsg:2212: CP:Link Change: [I:0x1388], controller IP = 0xe0415ac, change = UP
2016-09-30T20:16:09.077Z cpu0:35425)WARNING: vdrb: VdrCPProcessLinkChangeMsg:2221: CP:Link Change: Received Link UP for already active instance [I:0x1388], controller IP = 0xe0415ac

2016-09-30T20:16:41.960Z cpu0:35425)WARNING: vdrb: VdrCpProcessLifUpdateMessage:1226: CP:Lif Update: Not able to find Active connection

2016-09-30T20:16:43.627Z cpu12:61588)Uplink: 1360: lag1: not found
Also in the vmkernel.log file on the ESXi host, you see entries similar to:
WARNING: vdrb: VdrInitConnectionTeamInfo:2002: CNXN:[C:,P:50331663] VM_NSX: Invalid LAG v2 configuration(2 2).
In the /var/log/netcpa.log file of the affected ESXi host, you see entries similar to:

2016-09-30T21:29:14.639Z error netcpa[5092CB70] [Originator@6876 sub=Default] Cannot connect to the server 172.21.4.12:0
2016-09-30T21:29:23.887Z info netcpa[5096DB70] [Originator@6876 sub=Default] Connected to controller 172.21.4.14:0, using source port 38466
2016-09-30T21:29:23.887Z info netcpa[5096DB70] [Originator@6876 sub=Default] Core: Hello sent: 172.21.4.14:0
2016-09-30T21:29:23.888Z info netcpa[5096DB70] [Originator@6876 sub=Default] Vxlan: received freqCtrlPeriod 1000 freqCtrlQuery 100 freqCtrlUpdate 20
2016-09-30T21:29:23.888Z info netcpa[5096DB70] [Originator@6876 sub=Default] Vxlan: received bteAgeingTime 300
2016-09-30T21:29:23.888Z info netcpa[5096DB70] [Originator@6876 sub=Default] Vxlan: received arpAgeingTime 300
2016-09-30T21:29:23.888Z info netcpa[5096DB70] [Originator@6876 sub=Default] Core: Max Pkt Len of peer 172.21.4.14: 4096
2016-09-30T21:29:23.888Z info netcpa[5096DB70] [Originator@6876 sub=Default] Core: KeepAlive Interval of peer 172.21.4.14: 10
2016-09-30T21:29:23.888Z info netcpa[5096DB70] [Originator@6876 sub=Default] Core: Msg Frequency of peer 172.21.4.14: 200
2016-09-30T21:29:23.888Z info netcpa[5096DB70] [Originator@6876 sub=Default] Core: ShardingSlice length of peer 172.21.4.14: 4194304
2016-09-30T21:29:23.888Z error netcpa[5096DB70] [Originator@6876 sub=Default] Core: unknown data type for hello message: 172.21.4.14: 6
2016-09-30T21:29:23.888Z info netcpa[5096DB70] [Originator@6876 sub=Default] Vxlan: core app ready on 172.21.4.14:0
2016-09-30T21:29:23.888Z info netcpa[5096DB70] [Originator@6876 sub=Default] Vxlan: send session control(1) message to the controller: VNI 5000 controller 172.21.4.14
2016-09-30T21:29:23.888Z info netcpa[5096DB70] [Originator@6876 sub=Default] Vdrb: core app ready on 172.21.4.14:0
2016-09-30T21:29:23.888Z info netcpa[5096DB70] [Originator@6876 sub=Default] Core: Controller is ready: 172.21.4.14:0
2016-09-30T21:29:23.895Z info netcpa[5092CB70] [Originator@6876 sub=Default] Vxlan: receive VNI Session Control(2) message from controller: VNI 5000 controller 172.21.4.14
2016-09-30T21:29:24.726Z info netcpa[5088AB70] [Originator@6876 sub=Default] Connected to controller 172.21.4.12:0, using source port 38722
2016-09-30T21:29:24.726Z info netcpa[5088AB70] [Originator@6876 sub=Default] Core: Hello sent: 172.21.4.12:0
2016-09-30T21:29:24.728Z info netcpa[5088AB70] [Originator@6876 sub=Default] Vxlan: received freqCtrlPeriod 1000 freqCtrlQuery 100 freqCtrlUpdate 20
2016-09-30T21:29:24.728Z info netcpa[5088AB70] [Originator@6876 sub=Default] Vxlan: received bteAgeingTime 300
2016-09-30T21:29:24.728Z info netcpa[5088AB70] [Originator@6876 sub=Default] Core: Max Pkt Len of peer 172.21.4.12: 4096
2016-09-30T21:29:24.728Z info netcpa[5088AB70] [Originator@6876 sub=Default] Core: KeepAlive Interval of peer 172.21.4.12: 10
2016-09-30T21:29:24.728Z info netcpa[5088AB70] [Originator@6876 sub=Default] Core: Msg Frequency of peer 172.21.4.12: 200
2016-09-30T21:29:24.728Z info netcpa[5088AB70] [Originator@6876 sub=Default] Core: ShardingSlice length of peer 172.21.4.12: 4194304
2016-09-30T21:29:24.728Z error netcpa[5088AB70] [Originator@6876 sub=Default] Core: unknown data type for hello message: 172.21.4.12: 6
2016-09-30T21:29:24.728Z info netcpa[5088AB70] [Originator@6876 sub=Default] Core: Controller is ready: 172.21.4.12:0
2016-09-30T21:29:24.729Z info netcpa[5088AB70] [Originator@6876 sub=Default] Core: Sharding Segment Update message: server 172.21.4.14 startSliceId 0 numSlices 342

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.