The corresponding logs to identify this issue are as follows:
NSX-T manager: /var/log/proton/nsxapi.log
During processing of hostd-initiated VIF Detach, a VIF attach message (MP_AddVnicAttachment) to MP will be sent as below, creating the stale LogicalPort at NSX MP, but not at host.
2024-10-07T16:32:05.226Z nsx-opsagent[2219843]: NSX 2219843 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="2220371" level="INFO"] [DoVifPortOperation] request=[opId:[5063] op:[HOSTD_DETACH_PORT(2)] vif:[########-####-####-####-############] ls:[########-####-####-####-############] vmx:[/vmfs/volumes/vsan:5268df41ee63a87b-################/########-####-####-####-############/wiwk22jboxdr07_replica.vmx] lp:[]]
ESXI host: /var/log/nsx-syslog.log
2024-10-07T16:32:05.290Z nsx-opsagent[2219843]: NSX 2219843 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="2220371" level="INFO"] [PortOp] Cleared external id from port [########-####-####-####-############] successfully
2024-10-07T16:32:05.342Z nsx-opsagent[2219843]: NSX 2219843 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="2220374" level="INFO"] [NsxaAppRxCallback] Got Message in app_type:[SwitchingVertical]
2024-10-07T16:32:05.342Z nsx-opsagent[2219843]: NSX 2219843 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="2220371" level="INFO"] [MP_AddVnicAttachment] RPC call [5063-5070] to NSX management plane completed in [0] sec
2024-10-07T16:55:28.803Z nsx-opsagent[3265162]: NSX 3265162 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="22921014" level="INFO"] [HandlePriorAttachedPort] handling prior attachment for vif: ls:[########-####-####-####-############] lp:[########-####-####-####-############] tz:[########-####-####-####-############]
2024-10-07T16:55:28.803Z nsx-opsagent[3265162]: NSX 3265162 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="22921014" level="WARNING"] [PortOp] Port [########-####-####-####-############] DVSPROP_PORT_VNIC_EXTERNAL_ID not found ... already cleared on previous vif event, error code [bad0003]
VMware NSX
Stale LogicalPorts along with stale LogicalPortAttachers have an incorrect host-VMX path mapping and are created during processing of hostd-initiated VIF detachment which is based on internal in-memory cache (_lsSwapMap) maintained by opsagent.
This port is then used for subsequent VIF Attach/detach requests which always results in having one extra/incorrect attacher entry. This leads to incorrect behavior during these VIF requests (such as MP returning new LogicalPort/VIF to opsagent/host when VIF was already specified by Opsagent/host).
This eventually causes connectivity issues for the impacted VMs.
This situation only arises when there is a missing hostd-initiated VIF detach for the VIF/VM, and the Opsagent service within the transport node receives 2 successive VIF attach requests from hostd for the same VIF, which is connected to different LogicalSwitches.
The cache (_lsSwapMap) would be populated at some time with network changes during VIF Attach on logical switch 1, and then VIF attach would be received for connecting to logical switch 2, thereby creating an entry of form <Key=VIF:vmxPath, Value={LS1, L2}, But this entry wouldn't have been cleared as there wouldn't have been corresponding VIF detach on logical switch 1, to clear the entry.
In the subsequent VIF Detach from this specific host for this VIF, a VIF attach would be sent out to MP, creating stale LogicalPort.
Workaround:
1. Find out all stale LogicalPort and LogicalPortAttachers as shown below:
a) Get all LogicalPorts and LogicalPortAttachers. This can be done by using these respective APIs
GET /api/v1/logical-ports
GET /api/v1/logical-ports/<lport-id>/state - (Fields of importance are 'transport_node_ids' and 'attachment.attachers.host')
2. For each LogicalPort, read the corresponding LogicalPortAttacher and check if at the host/Tn-id entry, whether this LogicalPort is present.
This can be done by running the command 'net-dvs -l | less' on the host to check if the port exists.
If the port doesn't exist, then the Attacher entry can be deleted
Alternatively, if the hosts are VC-managed, then the presence of LogicalPorts (as DVPorts in VCenter) can be confirmed from VCenter directly.
An SR with VCF NSX should be filed based on the information gathered above to remove any stale logical ports or logical port attacher entries.
If this issue is suspected please open an SR with with VCF NSX team for further investigation and remediation.