Symptoms:
- VMs in a specific Transport node cluster may experience partial or complete network loss.
- Regardless of L2/L3 domain, some IPs may be reachable and other IPs may not be reachable.
- Packet capture on the ESXi host at the capture point VnicTx would show that the packet is not exiting the VM.
- vMotion of the VM or a network adapter disconnect and reconnect would temporarily remediate the issue
In NSX-T manager:
/var/log/proton/nsxapi.log The log message "Created/updated realized TransportNodeCollection" would repeat every 5 minutes:
2022-xx-xxTxx:xx:12.023Z INFO providerTaskExecutor-70 TransportNodeCollectionProvider 7392 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Created/updated realized TransportNodeCollection: GenericPolicyRealizedResource{path=/infra/realized-state/enforcement-points/default/transport-node-collections/xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx, realizationObjectId=xxxxxx, realizationState=REALIZED, intentVersion=xxxx, realizedVersionOnEnforcement=xxxx, realizationAPI=null, entityType=RealizedTransportNodeCollection, readBeforeWriteRequired=false, extendedAttributes={}, intentPaths=[/infra/sites/default/enforcement-points/default/transport-node-collections/xxxx-xxxx-xxxxx-xxxxxx}
In ESXi host:
/var/run/log/vmkernel.log User would observe the "Hang detected" message:
2022-xx-xxTxx:xx:xx.968Z cpu127:21505815)Vmxnet3: 21129: <vm_name>.ethx,00:50:56:xx:xx:xx, portID(xxxxxxxx): Hang detected,numHangQ: x, enableGen: xxx
- Steps to reproduce the issue:
1. Create a Transport node Profile
2. Apply a transport node profile (TNP) to the cluster. And this should create a TransportNodeCollection (TNC)
3. Apply another TNP to the cluster without detaching the pre-existing TNP. And this operation would update the TNC