Stale data from deleted Transport Nodes can cause Transport Nodes to incorrectly appear Degraded in the NSX UI
search cancel

Stale data from deleted Transport Nodes can cause Transport Nodes to incorrectly appear Degraded in the NSX UI

book

Article ID: 318617

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • Transport Nodes appear Degraded with BFD tunnels down. The remote endpoints of the Down tunnels are TEP IPs which once belonged to Transport Nodes that were deleted. 
  • On a host appearing Degraded,  get logical-switch <LS VNI> vtep-table  shows stale deleted TEP IPs in the VTEP table of at least one Logical Switch
  • "Control Channel to transport Node Down" alarms may appear on NSX UI for the stale transport node
  • Central Control plane replication data shows the stale Transport Node which was deleted, and which is no longer listed in get nodes output:

<HOSTNAME>> get nodes | ignore mgr
UUID                                   Type     IP Address      IPv6 Address          Hostname/FQDN                    Display Name
<UUID1>                                edg      X.X.X.X         N/A                   nsx-edge                         nsx-edge
<UUID2>                                edg      X.X.X.X         N/A                   nsx-edge-2                       nsx-edge-2
<UUID3>                                esx      X.X.X.X         N/A                   nsx-host                         nsx-host
<UUID4>                                esx      X.X.X.X         N/A                   nsx-host-2                       nsx-host-2

  • Comparing above output with CCP replication data shows stale TN in CCP:

<hostname>> set debug
<hostname>> get replication all-transport-node-data dump dump.txt

  • Output will show the file name that the CCP replication data is dumped to:

/var/tmp/ccp/TransportNodeDataXXXXXXXXXXXXXXXXXX.tmp   
(elevate to root)

<hostname>> st en
<user>@<hostname>:~# grep "Transport node" /var/tmp/ccp/TransportNodeDataXXXXXXXXXXXXXXXXXX.tmp
Transport node: <UUID1>
Transport node: <UUID2>
Transport node: <UUID3>
Transport node: <UUID4>
Transport node: <UUID5>       <---- Example of stale TN which was deleted but still exists in CCP
<user>@<hostname>:~#

Environment

VMware NSX-T Data Center 
VMware NSX

Cause

If a Transport Node is removed by the Management Plane when the Central Control plane cluster is unavailable, the Transport Node is not removed from LCP Replicator data store when CCP comes back up and the stale Transport Node data remains in the CCP.

Resolution

This issue is resolved in VMware NSX 4.1.0.2, available at Broadcom downloads.

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.


Workaround:
For 3.2.0+ versions, the stale data Transport Node can be cleaned up by modifying VTEP and MAC table timeout values:

1) Modify VTEP and MAC timeout values on all three Manager nodes:


<hostname>> set debug
<hostname>> get vtep-table timeout
<hostname>> set vtep-table timeout 1 day

<hostname>> get mac-table timeout
<hostname>> set mac-table timeout 1 day


2) Wait till stale entry expires (1 day at least).

3) Set table timeout values back to default values

To remove the stale Transport Node data directly on 3.2.0+, or for versions before 3.2.0, please open a Service Request with Broadcom Support.


Additional Information

Impact/Risks:

Transport Nodes incorrectly appear Degraded in the NSX UI when they try to form tunnels with old TEP IPs.