Workload VMs connected to NSX Segment in disconnected state after powered down and powered back on.
# Get the timestamp of Power ON and vm name :
# On esx nsxa logs find the vif details in attach port operation as below and tracking id to validate the request on MP:
2022-05-04T10:57:41.870Z nsx-opsagent[2103060]: NSX 2103060 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="48287277" level="INFO"] [DoVifPortOperation] request=[opId:[########-##-##-####-####] op:[HOSTD_ATTACH_PORT(1)] vif:[########-####-####-####-########73a4] ls:[########-####-####-####-########3702] vmx:[/vmfs/volumes/########-####-########a1d0/vmname/vmname.vmx] lp:[]]
2022-05-04T10:57:41.872Z nsx-opsagent[2103060]: NSX 2103060 - [nsx@6876 comp="nsx-esx" subcomp="mpa-client" tid="48287277" level="INFO"] [SwitchingVertical] SendRequest: To Master APH, type (com.vmware.nsx.switching.VifMsg) correlationId () trackingIdStr (########-####-####-####-########6870) Success.
# On same esx nsxa logs we would also see these call timing out as below :
2022-05-04T10:58:36.876Z nsx-opsagent[2103060]: NSX 2103060 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="48287277" level="ERROR" errorCode="MPA41542"] [MP_AddVnicAttachment] RPC call [########-##-##-####-####-4945] to NSX management plane timout
2022-05-04T10:58:36.876Z nsx-opsagent[2103060]: NSX 2103060 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="48287277" level="ERROR" errorCode="MPA42003"] [DoMpVifAttachRpc] MP_AddVnicAttachment() failed: RPC call to NSX management plane timeout
grep -a "to NSX management plane timout" nsx-syslog.0 | wc -l
196
# Now to trace the call above in MP, check /var/log/proton/message_trace.log for the tracking id from Host side
"########-####-####-####-########6870":
nsx_manager_########-####-####-####-########ef4_20220504_115945/var/log/proton/message_trace.log:54404:
2022-05-04T10:57:41.872Z INFO nsx rpc:RPC_PROXY_CONN_PROVIDER:user-executor-5 InboundMessageRouter 4625 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Message received: client ########-####-####-####-########d0ea, clientType cvn-hv, application SwitchingVertical, messageType com.vmware.nsx.switching.VifMsg, correlationId ########-####-####-####-########6870, rpc true
# Look for below message in /var/log/proton/nsxapi logs to identify if L2TaskExecutor is discarding tasks as below during above call :
2022-05-04T10:56:23.367Z WARN nsx-rpc:RPC_PROXY_CONN_PROVIDER:user-executor-3 L2DiscardOldestPolicy 4625 FABRIC [nsx@6876 comp="nsx-manager" level="WARNING" subcomp="manager"] Discarded 33201 tasks for thread L2TaskExecutor
2022-05-04T11:00:10.479Z WARN nsx-rpc:RPC_PROXY_CONN_PROVIDER:user-executor-9 L2DiscardOldestPolicy 4625 FABRIC [nsx@6876 comp="nsx-manager" level="WARNING" subcomp="manager"] Discarded 33217 tasks for thread L2TaskExecutor
# we can also check if below is this issue is happening very frequently for all the VMs :
gss-prd-csp-2:proton$ zgrep L2DiscardOldestPolicy nsxapi.* | grep 2022-05-04 | wc -l
170
gss-prd-csp-2:proton$ zgrep L2DiscardOldestPolicy nsxapi.* | grep 2022-05-03 | wc -l
298
gss-prd-csp-2:proton$ zgrep L2DiscardOldestPolicy nsxapi.* | grep 2022-05-02 | wc -l
339
gss-prd-csp-2:proton$ zgrep L2DiscardOldestPolicy nsxapi.* | grep 2022-05-01 | wc -l
279
L2TaskExecutor is dropping messaging from hosts frequently. The mechanism which invokes the VifMsg callbacks to deliver the message to L2 vertical is java ThreadPool specific. InboundMessageRouter calls the vertical specific handleMessage callback. In the logs we see many instances of such Rejected tasks ie., a thread was not available in the thread pool to perform the task.
This issue is resolved in VMware NSX 3.2.2.0
This issue is resolved in VMware NSX 4.2.0
Workaround:
Contact Broadcom Support if you encounter this issue and cannot upgrade.
Complete DP impact on the VMs showing the issue.