VMs Intermittent connectivity issues after applying a transport node profile to a cluster without detaching a pre-existing TNP first
search cancel

VMs Intermittent connectivity issues after applying a transport node profile to a cluster without detaching a pre-existing TNP first

book

Article ID: 318326

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:
- VMs in a specific Transport node cluster may experience partial or complete network loss. 
- Regardless of L2/L3 domain, some IPs may be reachable and other IPs may not be reachable.
- Packet capture on the ESXi host at the capture point VnicTx would show that the packet is not exiting the VM.
- vMotion of the VM or a network adapter disconnect and reconnect would temporarily remediate the issue

In NSX-T manager: /var/log/proton/nsxapi.log  The log message "Created/updated realized TransportNodeCollection" would repeat every 5 minutes:
2022-xx-xxTxx:xx:12.023Z INFO providerTaskExecutor-70 TransportNodeCollectionProvider 7392 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Created/updated realized TransportNodeCollection: GenericPolicyRealizedResource{path=/infra/realized-state/enforcement-points/default/transport-node-collections/xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx, realizationObjectId=xxxxxx, realizationState=REALIZED, intentVersion=xxxx, realizedVersionOnEnforcement=xxxx, realizationAPI=null, entityType=RealizedTransportNodeCollection, readBeforeWriteRequired=false, extendedAttributes={}, intentPaths=[/infra/sites/default/enforcement-points/default/transport-node-collections/xxxx-xxxx-xxxxx-xxxxxx}

In ESXi host: /var/run/log/vmkernel.log User would observe the "Hang detected" message:
2022-xx-xxTxx:xx:xx.968Z cpu127:21505815)Vmxnet3: 21129: <vm_name>.ethx,00:50:56:xx:xx:xx, portID(xxxxxxxx): Hang detected,numHangQ: x, enableGen: xxx
 
  • Steps to reproduce the issue:
                1. Create a Transport node Profile
                2. Apply a transport node profile (TNP) to the cluster. And this should create a TransportNodeCollection (TNC)
                3. Apply another TNP to the cluster without detaching the pre-existing TNP. And this operation would update the TNC




Environment

VMware NSX-T Data Center 3.x
VMware NSX-T Data Center

Cause

Applying a Transport Node Profile(TNP) to a cluster creates a Transport Node collection (TNC). If the user applies another TNP to the cluster without detaching the existing TNP first, The revision number of the realized TNC will not get updated. As a result, the Provider is invoked periodically every 5 minutes to get the TNC with the new revision number realized, And since the TNC is already realized with an older revision number, the new realization will not go through. As the realization attempt did not go through provider keeps trying to realize it periodically every 5 minutes indefinitely.

As a cascading effect of the update that is triggered every 5 minutes, some packets are leaked due to a race condition during the VDR connection message processing. VMware Engineering is aware of this issue and is planning to handle this issue in NSX-T 3.2.2 and subsequent releases.

Resolution

This issue is resolved in NSX-T 3.2.2 and all subsequent releases.

Workaround:
1) For the impacted cluster, ensure from the NSX-T GUI (System > Fabric> Nodes > Host Transport Nodes) that none of the hosts show any mismatch with the applied TN profile.
2) If there is NO mismatch, Detach the TN profile. (select cluster > Actions > Detach Transport Node Profile)
3) Wait for the detachment to complete
4) Attach the TN profile back to the cluster. (select cluster > Configure NSX > select the Transport Node Profile
  • Additional Info:
The workaround will work till the time there is any update on the cluster. Any updates on the cluster, like applying a new TNP should be done using the above way. First, detach the existing TNP, then apply the new one.
                
To verify the workaround search for "Created/updated realized TransportNodeCollection" in the NSX Manager /var/log/proton/nsxapi.log. After applying the workaround the string should not be printed every 5 minutes.

Additional Information

Impact/Risks:
Intermittent connectivity issues for workload VMs/ Intermittent packet drops