Symptoms :
1. DRS migrated VM's loose connectivity.
2. VM's logical ports go into blocked state.
NSX-T
NSX Data Center
This issue happens because :
1. While vmotion is triggered, one of the manager restarts.
2. Opsagent, times out ATTACH_VIF request.
3. DETACH_VIF request is received and port gets deleted on MP side.
4. Port gets blocked on host.
Relevant logs :
1. ESXi support bundle : /var/run/log/nsx-syslog.log :
2024-08-01T00:29:28.870Z nsx-opsagent[2110371]: NSX 2110371 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="2111144" level="ERROR" errorCode="MPA42003"] [DoMpVifAttachRpc] MP_AddVnicAttachment() failed: RPC call to NSX management plane timeout
2024-08-01T00:28:33.865Z nsx-opsagent[2110371]: NSX 2110371 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="2111144" level="INFO"] [DoVifPortOperation] request=[opId:[########-##-##-##-##-####-##] op:[HOSTD_ATTACH_PORT(1)] vif:[########-####-####-####-############] ls:[########-####-####-####-############] vmx:[/vmfs/volumes/vsan:################-################/########-####-####-####-############/VM_name.vmx] lp:[]]
2. Port deletion logs in NSX manager support bundle : /var/log/proton/nsxapi.log :
2024-08-01T00:31:17.105Z INFO L2TaskExecutor4 LogicalPortServiceImpl 6607 SWITCHING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Schedule a delayed deletion task for logical port LogicalPort [id=<logical_port_uuid>, logicalPortState=UP, ephemeral=true, logicalSwitchId=LogicalSwitch/########-####-####-####-############, transportZoneId=TransportZone/########-####-####-####-############, transportZoneType=OVERLAY, vif=########-####-####-####-############, vifType=vif, switchingProfileIds=null, switchMode=STANDARD, extraConfigs=null, internalId=<logical_port_uuid>, initState=null, tags=null, pendingConfigFromHostd=true], expected starting time = 1722472307104.
2024-08-01T00:31:17.110Z INFO L2TaskExecutor4 LogicalPortServiceImpl 6607 SWITCHING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] dvport [logical_port_uuid] to be detached on host: [host-uuid]
Resync task will re-trigger the ATTACH_VIF request with same port Id as that of the deleted port and port will get unblocked. When this issue occurs, re-sync will be triggered after 5 minutes and the issue will get resolved.
Version where this is a known issue : NSX 3.2.*, 4.0.*, 4.1.*
Version where this issue is fixed : NSX 4.2.0
Workaround :
Perform a manual vMotion on problematic VM's.