NSX-v Controller(s) rejecting the requests from hosts to join VNI(s)
search cancel

NSX-v Controller(s) rejecting the requests from hosts to join VNI(s)

book

Article ID: 315194

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:

 

  • NSX Data Center for vSphere version is 6.3.x or pre 6.4.4
  • ESXi host logs (netcpa.log) may display message(s) similar to:

#grep -hE "notification from controller for invalid|notification from controller error" /var/run/log/netcpa.log
2019-12-07T11:43:03.643Z error netcpa[2104030] [Originator@6876 sub=Default] Vxlan: notification from controller error/sub:4/1
2019-12-07T11:43:03.643Z error netcpa[2104030] [Originator@6876 sub=Default] Vxlan: notification from controller for invalid VNI 15212, switchID 0
2019-12-07T11:43:08.645Z error netcpa[2104030] [Originator@6876 sub=Default] Vxlan: notification from controller error/sub:4/1
2019-12-07T11:43:08.645Z error netcpa[2104030] [Originator@6876 sub=Default] Vxlan: notification from controller for invalid VNI 15211, switchID 0
2019-12-07T11:43:08.645Z error netcpa[2104030] [Originator@6876 sub=Default] Vxlan: notification from controller error/sub:4/1
2019-12-07T11:43:08.645Z error netcpa[2104030] [Originator@6876 sub=Default] Vxlan: notification from controller for invalid VNI 15212, switchID 0

  • NSX Controller logs (syslog) display message(s) similar to:

#show log syslog filtered-by “Try to join VNI.* not assigned to this node by TransportSwitch”
2018-12-12T11:15:18.226817+00:00 2018-12-12 11: 15:18,226 2513561004 [vxlan worker 3] WARN com.vmware.controller.apps.vxlan.VxlanService  - Try to join VNI 5008 not assigned to this node by TransportSwitch [Connection [ip=192.168.1.11:21641, cnnId=46], swId=0]
2018-12-12T11:15:18.227796+00:00 2018-12-12 11: 15:18,227 2513561005 [vxlan worker 2] WARN com.vmware.controller.apps.vxlan.VxlanService  - Try to join VNI 5021 not assigned to this node by TransportSwitch [Connection [ip=192.168.1.11:21641, cnnId=46], swId=0]
2018-12-12T11:15:18.755600+00:00 2018-12-12 11: 15:18,755 2513561533 [vxlan worker 2] WARN com.vmware.controller.apps.vxlan.VxlanService  - Try to join VNI 5000 not assigned to this node by TransportSwitch [Connection [ip=10.139.211.138:42621, cnnId=50], swId=0]
2018-12-12T11:15:18.756518+00:00 2018-12-12 11: 15:18,756 2513561534 [vxlan worker 1] WARN com.vmware.controller.apps.vxlan.VxlanService  - Try to join VNI 5018 not assigned to this node by TransportSwitch [Connection [ip=192.168.1.14.138:42621, cnnId=50], swId=0]

  • ESXi host shows the NSX Controller responsible for the impacted VXLAN as "down"

#net-vdl2 -l

Example of output:
(...) output omitted
VXLAN network:  15212
                        Multicast IP:   N/A (headend replication)
                        Control plane:  Enabled ()
                        Controller:     10.10.10.10 (down) <<--------------
                        Controller Disconnected Mode: no
                        Multicast Routing Domain ID: -N/A-
                        MAC entry count:        26
                        ARP entry count:        0
                        Port count:     1
                VXLAN network:  15211
                        Multicast IP:   N/A (headend replication)
                        Control plane:  Enabled ()
                        Controller:     10.10.10.10 (down) <<--------------
                        Controller Disconnected Mode: no
                        Multicast Routing Domain ID: -N/A-
                        MAC entry count:        1
                        ARP entry count:        0
(...) output omitted

 

  • ESXi host has a TCP socket opened on port TCP 1234 to the NSX Controller marked as "down":

#esxcli network ip connection list | grep “1234 “
Example of output:
tcp         0       0  10.200.19.27:20306   10.10.10.10:1234     ESTABLISHED   2103863  host1  netcpa-worker
tcp         0       0  10.200.19.27:20305   10.10.10.9:1234      ESTABLISHED   2103839  host1  netcpa-worker
tcp         0       0  10.200.19.27:20304   10.10.10.8:1234      ESTABLISHED   2103842  host1  netcpa-worker

 

 

Environment

VMware NSX for vSphere 6.4.x
VMware NSX for vSphere 6.3.x

Cause

Sometimes there is a race condition between two cluster events (Cluster down and Sharding Update) in the Controller functionality. When this error is observed Controller may reject VNI join request from ESXi hosts.

Resolution

This issue is resolved in VMware NSX for vSphere 6.4.5, available at VMware Downloads.


Workaround:
To workaround the issue, restart the impacted NSX Controller. To identify the impacted NSX controller follow the steps below:

1.  Verify the NSX controller(s) marked as “down”
# net-vdl2 -l

Example of output:
(...) output omitted
VXLAN network:  15212
                        Multicast IP:   N/A (headend replication)
                        Control plane:  Enabled ()
                        Controller:     10.10.10.10 (down) <<--------------
                        Controller Disconnected Mode: no
                        Multicast Routing Domain ID: -N/A-
                        MAC entry count:        26
                        ARP entry count:        0
                        Port count:     1
                VXLAN network:  15211
                        Multicast IP:   N/A (headend replication)
                        Control plane:  Enabled ()
                        Controller:     10.10.10.10 (down) <<--------------
                        Controller Disconnected Mode: no
                        Multicast Routing Domain ID: -N/A-
                        MAC entry count:        1
                        ARP entry count:        0
(...) output omitted

 
2.  Confirm the ESXi host have a TCP socket open on port 1234 for the NSX controller(s) marked as “down”
#esxcli network ip connection list | grep “1234 “

Example of output:
tcp         0       0  10.200.19.27:20306   10.10.10.10:1234     ESTABLISHED   2103863  host1  netcpa-worker
tcp         0       0  10.200.19.27:20305   10.10.10.9:1234      ESTABLISHED   2103839  host1  netcpa-worker
tcp         0       0  10.200.19.27:20304   10.10.10.8:1234      ESTABLISHED   2103842  host1  netcpa-worker

 
3.  Review the NSX Controller logs and confirm you can see the error “Try to join VNI XXX not assigned to this node by TransportSwitch”
#show log syslog filtered-by “Try to join VNI.* not assigned to this node by TransportSwitch”

Example of output:
2018-12-12T11:15:18.226817+00:00 2018-12-12 11: 15:18,226 2513561004 [vxlan worker 3] WARN com.vmware.controller.apps.vxlan.VxlanService  - Try to join VNI 5008 not assigned to this node by TransportSwitch [Connection [ip=192.168.1.11:21641, cnnId=46], swId=0]
2018-12-12T11:15:18.227796+00:00 2018-12-12 11: 15:18,227 2513561005 [vxlan worker 2] WARN com.vmware.controller.apps.vxlan.VxlanService  - Try to join VNI 5021 not assigned to this node by TransportSwitch [Connection [ip=192.168.1.11:21641, cnnId=46], swId=0]


Additional Information

Impact/Risks:
This will lead to network outages for VMs on that ESXi host connected to that Logical Switch. 
This is a known issue affecting:  
  • VMware NSX for vSphere 6.3.3 to NSX for vSphere 6.3.7
  • VMware NSX for vSphere 6.4.0 to NSX for vSphere 6.4.4.