Netcpa issues in VMware NSX for vSphere 6.x

Products

VMware NSX

Issue/Introduction

Symptoms:

Virtual machines running on the same ESXi host fails to communicate with each other
Virtual machine fails to communicate with the NSX Edge Gateway (ESG)
Routing does not appear to be functioning despite having a defined route for the NSX Edge Gateway
Rebooting the NSX Edge does not resolve the issue
Running the esxcli network vswitch dvs vmware vxlan network list --vds-name=Name_VDS command on the ESXi host displays the VNIs as down

For example:

~ # esxcli network vswitch dvs vmware vxlan network list --vds-name=Compute_VDS
VXLAN ID Multicast IP Control Plane Controller Connection Port Count MAC Entry Count ARP Entry Count
-------- ------------------------- ----------------------------------- --------------------- ---------- --------------- ---------------
5001 N/A (headend replication) Enabled (multicast proxy,ARP proxy) x.x.x.x (down) 1 1 0
5000 N/A (headend replication) Enabled (multicast proxy,ARP proxy) x.x.x.x (down) 1 0 0

In the /var/log/netcpa.log file on the ESXi host, you see entries similar to:

2015-07-16T16:18:58.340Z [FFC97B70 info 'Default'] Vdrb: core app ready on x.x.x.x:0
2015-07-16T16:18:58.341Z [FFC97B70 info 'Default'] Core: Controller is ready: x.x.x.x:0
2015-07-16T16:19:27.112Z [FFDBBB70 error 'Default'] Async read callback failed, connection x.x.x.x:0 was shutdown by peer.
2015-07-16T16:19:27.113Z [FFD7AB70 info 'Default'] Vxlan: ctrl connection x.x.x.x:0 down
2015-07-16T16:19:27.113Z [FFD7AB70 info 'Default'] Vdrb: ctrl connection x.x.x.x:0 down
2015-07-16T16:19:28.350Z [FFD7AB70 info 'Default'] Core: Hello sent: x.x.x.x:0
2015-07-16T16:19:28.350Z [FFC97B70 info 'Default'] Vxlan: received freqCtrlPeriod 1000 freqCtrlQuery 100 freqCtrlUpdate 20
2015-07-16T16:19:28.350Z [FFC97B70 info 'Default'] Vxlan: received bteAgeingTime 300
2015-07-16T16:19:28.350Z [FFC97B70 info 'Default'] Vxlan: received arpAgeingTime 300
2015-07-16T16:19:28.350Z [FFC97B70 info 'Default'] Core: Max Pkt Len of peer x.x.x.x: 4096

In the /var/log/netcpa.log file on the ESXi host, you see entries similar to:

2015-11-02T14:36:10.341Z [5BB13B70 info 'Default'] Core: Controller is ready: x.x.x.x:0
2015-11-02T14:36:40.443Z [5B9EFB70 info 'Default'] Core: Controller is ready: x.x.x.x:0
2015-11-02T14:37:10.364Z [5BB13B70 info 'Default'] Core: Controller is ready: x.x.x.x:0
2015-11-02T14:37:40.385Z [5BA91B70 info 'Default'] Core: Controller is ready: x.x.x.x:0

In the /var/log/netcpa.log file on the ESXi host, you see entries similar to:

netcpa.log:2015-11-02T14:39:40.380Z [5B96DB70 info 'Default'] Vxlan: send VNI Membership Update(Join) to the controller: VNI 10119 controller x.x.x.x
netcpa.log:2015-11-02T14:39:40.380Z [5B96DB70 info 'Default'] Vxlan: send VNI Membership Update(Join) to the controller: VNI 10122 controller x.x.x.x
netcpa.log:2015-11-02T14:39:40.380Z [5B96DB70 info 'Default'] Vxlan: send VNI Membership Update(Join) to the controller: VNI 10124 controller x.x.x.x
netcpa.log:2015-11-02T14:39:40.380Z [5B96DB70 info 'Default'] Vxlan: send VNI Membership Update(Join) to the controller: VNI 10127 controller x.x.x.x

In the /var/log/netcpa.log file on the ESXi host, you see entries similar to:

2015-11-02T14:38:09.152Z [5B8EBB70 error 'Default'] Async read callback failed, connection x.x.x.x:0 was shutdown by peer.
2015-11-02T14:38:39.154Z [5B8EBB70 error 'Default'] Async read callback failed, connection x.x.x.x:0 was shutdown by peer.
2015-11-02T14:39:09.157Z [5B8EBB70 error 'Default'] Async read callback failed, connection x.x.x.x:0 was shutdown by peer.
2015-11-02T14:39:39.159Z [5B9EFB70 error 'Default'] Async read callback failed, connection x.x.x.x:0 was shutdown by peer.

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX for vSphere 6.1.x

Cause

This issue occurs because netcpa fails to reset the flag for the msgScheduler when it tries to re-connect to the NSX Controller.

Netcpa is not sending out a message because the flag in msgScheduler is set to true which indicates a message is being sent out during that time.

An example of a live netcpa core dump:

(gdb) p (MsgScheduler *)0x1f0339a8->_txInProgress
Attempt to extract a component of a value that is not a structure pointer.
(gdb) p ((MsgScheduler *)0x1f0339a8)->_txInProgress
$5 = true =========================================> this should be false

Resolution

This issue is resolved in:

VMware NSX for vSphere 6.1.5.
VMware NSX for vSphere 6.2.

If you are unable to upgrade, follow the workaround.

To work around the issue, restart the netcpa service on the affected ESXi host(s).

Log in as root to the ESXi host through SSH or through the console.
Run the /etc/init.d/netcpad restart command to restart the netcpa agent on the ESXi host:

VMware NSX for vSphere 6.2 introduces a proactive health check which periodically reports the central control plane to local control plane status to NSX Manager and is displayed at the NSX Manager UI. This report also serves as a heartbeat to detect the operational status of the NSX Manager to ESXi host netcpa channel.