Ports connected to VDS become disconnected after power off and power on of the VM

Products

VMware NSX

Issue/Introduction

Workload VMs connected to NSX Segment in disconnected state after powered down and powered back on.

# Get the timestamp of Power ON and vm name :

# On esx nsxa logs find the vif details in attach port operation as below and tracking id to validate the request on MP:

2022-05-04T10:57:41.870Z nsx-opsagent[2103060]: NSX 2103060 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="48287277" level="INFO"] [DoVifPortOperation] request=[opId:[########-##-##-####-####] op:[HOSTD_ATTACH_PORT(1)] vif:[########-####-####-####-########73a4] ls:[########-####-####-####-########3702] vmx:[/vmfs/volumes/########-####-########a1d0/vmname/vmname.vmx] lp:[]]

2022-05-04T10:57:41.872Z nsx-opsagent[2103060]: NSX 2103060 - [nsx@6876 comp="nsx-esx" subcomp="mpa-client" tid="48287277" level="INFO"] [SwitchingVertical] SendRequest: To Master APH, type (com.vmware.nsx.switching.VifMsg) correlationId () trackingIdStr (########-####-####-####-########6870) Success.

# On same esx nsxa logs we would also see these call timing out as below :

2022-05-04T10:58:36.876Z nsx-opsagent[2103060]: NSX 2103060 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="48287277" level="ERROR" errorCode="MPA41542"] [MP_AddVnicAttachment] RPC call [########-##-##-####-####-4945] to NSX management plane timout

2022-05-04T10:58:36.876Z nsx-opsagent[2103060]: NSX 2103060 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="48287277" level="ERROR" errorCode="MPA42003"] [DoMpVifAttachRpc] MP_AddVnicAttachment() failed: RPC call to NSX management plane timeout

grep -a "to NSX management plane timout" nsx-syslog.0 | wc -l
196

# Now to trace the call above in MP, check /var/log/proton/message_trace.log for the tracking id from Host side
"########-####-####-####-########6870":

nsx_manager_########-####-####-####-########ef4_20220504_115945/var/log/proton/message_trace.log:54404:

2022-05-04T10:57:41.872Z INFO nsx rpc:RPC_PROXY_CONN_PROVIDER:user-executor-5 InboundMessageRouter 4625 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Message received: client ########-####-####-####-########d0ea, clientType cvn-hv, application SwitchingVertical, messageType com.vmware.nsx.switching.VifMsg, correlationId ########-####-####-####-########6870, rpc true

# Look for below message in /var/log/proton/nsxapi logs to identify if L2TaskExecutor is discarding tasks as below during above call :

2022-05-04T10:56:23.367Z WARN nsx-rpc:RPC_PROXY_CONN_PROVIDER:user-executor-3 L2DiscardOldestPolicy 4625 FABRIC [nsx@6876 comp="nsx-manager" level="WARNING" subcomp="manager"] Discarded 33201 tasks for thread L2TaskExecutor

2022-05-04T11:00:10.479Z WARN nsx-rpc:RPC_PROXY_CONN_PROVIDER:user-executor-9 L2DiscardOldestPolicy 4625 FABRIC [nsx@6876 comp="nsx-manager" level="WARNING" subcomp="manager"] Discarded 33217 tasks for thread L2TaskExecutor

# we can also check if below is this issue is happening very frequently for all the VMs :

gss-prd-csp-2:proton$ zgrep L2DiscardOldestPolicy nsxapi.* | grep 2022-05-04 | wc -l

170

gss-prd-csp-2:proton$ zgrep L2DiscardOldestPolicy nsxapi.* | grep 2022-05-03 | wc -l

298

gss-prd-csp-2:proton$ zgrep L2DiscardOldestPolicy nsxapi.* | grep 2022-05-02 | wc -l

339

gss-prd-csp-2:proton$ zgrep L2DiscardOldestPolicy nsxapi.* | grep 2022-05-01 | wc -l

279

Environment

VMware NSX-T Data Center 3.x
VMware NSX 4.x

Cause

L2TaskExecutor is dropping messaging from hosts frequently. The mechanism which invokes the VifMsg callbacks to deliver the message to L2 vertical is java ThreadPool specific. InboundMessageRouter calls the vertical specific handleMessage callback. In the logs we see many instances of such Rejected tasks ie., a thread was not available in the thread pool to perform the task.

Resolution

This issue is resolved in VMware NSX 3.2.2.0
This issue is resolved in VMware NSX 4.2.0

Workaround:

Contact Broadcom Support if you encounter this issue and cannot upgrade.

Additional Information

Impact/Risks:

Complete DP impact on the VMs showing the issue.