Unable to vmotion VMs due to error: "Currently connected network interface"

Products

VMware NSX

Issue/Introduction

Symptoms:

Not able to vMotion Workloads to specific hosts with lcp-ccp session down.
vMotion to specific hosts fails with error:

"Currently connected network interface" "Network Adapter 1" uses network 'DVSwitch[50 29 dd 1a c9 58 df 20-a6 c1 5a 82 a4 d2 21 32} NSX Port Group {dvportgroup-2003}(lcp.ccpSession down)'. Which is not accessible

NOTE: DVS Switch UUID and dvportgroup-ID will be different.

In /var/run/log/nsx-syslog.log the following entries are seen:

2021-11-08T15:45:33Z cfgAgent: NSX 2480940 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="A5FE0700" level="warn"] DaemonHealthMonitor: nsx-proxy echo timeout (60 sec)
2021-11-08T15:45:38Z cfgAgent: NSX 2480940 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="A5FE0700" level="info"] DaemonHealthMonitor: nsx-proxy connected
2021-11-08T15:46:38Z cfgAgent: NSX 2480940 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="A5FE0700" level="warn"] DaemonHealthMonitor: nsx-proxy echo timeout (60 sec)
2021-11-08T15:46:38Z cfgAgent: NSX 2480940 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="A5FE0700" level="warn"] DaemonHealthMonitor: nsx-proxy disconnected
2021-11-08T15:46:43Z cfgAgent: NSX 2480940 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="A5FE0700" level="warn"] DaemonHealthMonitor: nsx-proxy echo timeout (60 sec)
2021-11-08T15:46:48Z cfgAgent: NSX 2480940 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="A5FE0700" level="warn"] DaemonHealthMonitor: nsx-proxy echo timeout (60 sec)
......
The below command will show lcp-ccp session as down on the host where vMotion is failing.

net-dvs -l | grep -i down
com.vmware.common.opaqueDvs.status.component.lcp.ccpSession = down , propType = RUNTIME

Environment

VMware NSX-T Data Center

Cause

The L2 default Any Any rule is applied to ALL traffic traversing virtual machine workloads.

When the L2 default Any Any rule is set to log events and there is a high amount of traffic, this causes a high hit count and increases the number of CPU threads required to process those hits.
This resource-intensive process leaves fewer CPU threads available for nsx-proxy and other internal daemons to use.
The lack of available CPU threads results in the daemons failure to respond to inter-process communication. As a result, the process reports as down.

There is a bug in the current code which prevents cfg-agent from recovering from a down state on its own which leads to the vMotion failing with the alert:

"Unable to vmotion VMs due to error: "Currently connected network interface" "Network Adapter 1" uses network 'DVSwitch[50 29 dd 1a c9 58 df 20-a6 c1 5a 82 a4 d2 21 32} NSX Port Group {dvportgroup-2003}(lcp.ccpSession down)'. Which is not accessible"

Resolution

This issue is resolved in versions 3.1.4 and 3.2.1.

Workaround:
It is possible to work around this issue by carrying our either of the following:

Disable logging for L2 catch-all Any Any rule.

Restart nsx-proxy and nsx-cfgagent on all affected hosts will re-establish the connectivity between components and resolve the vMotion issue.

From host CLI commands:

/etc/init.d/nsx-opsagent restart
/etc/init.d/nsx-cfgagent restart