In this case, ping from a workload VM to the Tier-0 Edge uplink interface too showed drops, indicating the issue lies within the NSX/physical underlay transport.
The issue may appear after recent host upgrades (e.g., to ESXi 8.0u3g) or host re-preparation tasks.
VMware NSX
The root cause is the presence of duplicate Tunnel Endpoint (TEP) IP addresses in the environment.
When multiple ESXi hosts claim the same TEP IP address, the physical network cannot correctly route the return traffic encapsulated in GENEVE packets.
In this case, when the Edge node replied to the workload VM, the physical switch may forward the packet to the wrong host (the duplicate holder) instead of the actual host running the VM. This results in intermittent connectivity depending on how the physical network hashes or updates its ARP tables for that IP.
This duplication often occurs if:
TEP IP Pools are configured with overlapping ranges.
Decommissioned hosts failed to gracefully release their TEP IPs back to the pool, leaving "stale" allocations that are subsequently reassigned to new or upgraded hosts.
To resolve this issue, we must identify and rectify the duplicate TEP IP assignments.
Detailed check of the TEP IP assignments across the Transport Nodes. Ensure every host has a unique TEP IP.
Review the NSX IP Pool configurations to ensure there are no overlapping IP ranges defined across different pools used by the same transport zone.
If we identify IPs that belong to decommissioned hosts but are still marked as "Allocated" in the NSX Manager:
Refer to TEP IP Addresses Not Released After FORCE Deleting Host/Edge Transport Node in NSX-T UI or Stale Edge Node TEP IP Addresses Not Released After Deletion from vCenter Causing IP Pool Exhaustion for steps to manually release stale IP allocations.
Once the pool is cleaned up, we may need to put the affected host into Maintenance Mode and effectively "re-sync" or re-configure the Transport Node to acquire a fresh, unique TEP IP.