Datapath outage triggered after reducing the number of uplinks in a host uplink profile
search cancel

Datapath outage triggered after reducing the number of uplinks in a host uplink profile

book

Article ID: 322552

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
  • NSX 3.x and 4.x
  • The ESXi Uplink Profile has been edited and the number of uplinks has been reduced
  • Virtual Machines have lost network connectivity
  • On an ESX host validate the TEP mapping using /usr/lib/vmware/vm-support/bin/dump-vdl2-info.py
  • In this example the Uplink Profile has 4 uplinks and accordingly the hosts had 4 TEPs, vmk10, vmk11, vmk12 and vmk13.
The Uplink Profile was changed from 4 to 2 Uplinks.
The host now has 2 TEPs vmk10 and vmk12
vmk10 is mapped to Endpoint 0 and vmk12 is mapped to Endpoint 2.
However logical switches are mapped to Endpoints 0 and 1.
Output from /usr/lib/vmware/vm-support/bin/dump-vdl2-info.py

        VTEP Count:     2
        CDO status:     enabled (deactivated)
                VTEP Interface: vmk10
                        DVPort ID:      a760bff7-3dec-4e11-9078-71592ef259d1
                        Switch Port ID: 67108881
                        Endpoint ID:    0

                VTEP Interface: vmk12
                        DVPort ID:      2906f7dd-7f81-4a21-8b1f-2bf2a8fa4332
                        Switch Port ID: 67108883
                        Endpoint ID:    2


Sample of logical switch mapping in the same command output
 
Logical Network:        69633
                VTEP Endpoint ID:       1
Logical Network:        68609
                VTEP Endpoint ID:       1
Logical Network:        73760
                VTEP Endpoint ID:       1
Logical Network:        73752
                VTEP Endpoint ID:       0
Logical Network:        73736
                VTEP Endpoint ID:       1
Logical Network:        68608
                VTEP Endpoint ID:       0


Environment

VMware NSX-T
VMware NSX 4.1.0.2

Cause

Normally TEP configuration change should be last in first out, so if reducing from 4 to 2 uplinks, TEP configuration should change from vmk10,vmk11,vmk12,vmk13 to vmk10,vmk11.
On NSX releases prior to 3.2.2 and 4.0.x, when TEPs were created there could be an incorrect TEP ordering.
This was not functionally impacting at the time.
However at a later time if the TEP/uplink configuration is reduced, it can result in the wrong TEPs being removed from hosts.

Resolution

This is a known issue impacting NSX.

Workaround:
To resolve the unordered VTEP's issue, please use the following process, this will then allow you to reduce the number of VTEPs and avoid incorrect endpoint mapping. This workaround applies to VMware NSX versions 3.2.2 and above.

1. Identify the list of hosts with unordered VTEP's
  • On each host, run /usr/lib/vmware/vm-support/bin/dump-vdl2-info.py.
  • Find the VDS used by NSX-T and then look for the VTEP Interfaces, such as vmk10, vmk11, etc.
  • Check if they are ordered incorrectly, for example it may list vmk10 and vmk12, where vmk11 is missing.
Or
  • From NSX-T manager, use the API call:
GET /api/v1/transport-nodes/<TN-UUID>/state -> this will list the hostswitch endpoints vmk10, vmk11, etc.

2. Next in NSX-T UI, create a new host uplink profile with only 1 uplink in it.
3. For each host impacted, do the following 1 by 1:
  1. In vCenter place the first host in maintenance mode.
  2. In the NSX-T UI, select the host System - Fabric - Hosts, expand the Cluster and click the host.
  3. Click on Configure NSX and click NEXT
  4. Click the 3 dots on the Hostswitch and click Edit.
  5. Change the Uplink Profile to the newly created Uplink Profile and ensure to select the VDS uplink's to be used.
  6. Click ADD and FINISH.
  7. Use the API GET /api/v1/transport-nodes/<TN-UUID>/state to verify only one VTEP exists on the host.
  8. You should now see a Mismatch alert under NSX Configuration for the host, this is due to the difference between the host configuration and Transport Node Profile (TNP) applied to the cluster and is to be expected.
  9. Click on the word Mismatch and click on MATCH CLUSTER CONFIGURATION and Click YES to proceed.
  10. This will now detach the single uplink profile we attached and reapply the TNP to the host.
  11. Once this is complete, you can use the API GET /api/v1/transport-nodes/<TN-UUID>/state again to verify the correct number and order of VTEP's are configured on the host now.
  12. Exit the host from vCenter maintenance mode.
4. Repeat the above steps (3.1 to 3.12) again for each host impacted, one at a time.

5. Once all hosts have the above workaround applied and all VTEP's are in correct order, you can proceed and reduce the number of VTEP's and Uplink's to avoid the incorrect endpoint mapping issue.

6.A. In NSX-T 3.2.2, to reduce the number of VTEP's and Uplinks, for example from 4 to 2:
  • In the NSX-T UI, edit the host Uplink Profile used by the TNP, reducing the number of uplinks from 4 to 2. 
  • Then use the following API call to edit the TNP from 4 to 2 uplinks.
GET /api/v1/transport-node-profiles
  • Identify the TNP used by the cluster and run:
GET /api/v1/transport-node-profiles/<TNP-UUID>
  • From the results, reduce the number of Uplinks from 4 to 2 in the body and we will then use this for the body of the PUT API E.G.
  • Uplinks before:
"uplinks": [
                    {
                        "vds_uplink_name": "Uplink 1",
                        "uplink_name": "Uplink-1"
                    },
                    {
                        "vds_uplink_name": "Uplink 2",
                        "uplink_name": "Uplink-2"
                    },
                    {
                        "vds_uplink_name": "Uplink 3",
                        "uplink_name": "Uplink3"
                    },
                    {
                        "vds_uplink_name": "Uplink 4",
                        "uplink_name": "Uplink4"
                    }
                    
  • Uplinks after:
"uplinks": [
                    {
                        "vds_uplink_name": "Uplink 1",
                        "uplink_name": "Uplink-1"
                    },
                    {
                        "vds_uplink_name": "Uplink 2",
                        "uplink_name": "Uplink-2"
                    }
  • Use the complete body received from the GET with the edited number of uplinks for the API:
PUT /api/v1/transport-node-profiles/<TN-UUID>
 
6.B. In NSX-T 3.2.3.1, to reduce the number of VTEPs and Uplinks, for example from 4 to 2:
  • Use the following API call to edit the TNP from 4 to 2 uplinks.
GET /api/v1/transport-node-profiles
  • Identify the TNP used by the cluster and run:
GET /api/v1/transport-node-profiles/<TNP-UUID>
  • From the results, reduce the number of Uplinks from 4 to 2 in the body and we will then use this for the body of the PUT API E.G.
  • Uplinks before:
"uplinks": [
                    {
                        "vds_uplink_name": "Uplink 1",
                        "uplink_name": "Uplink-1"
                    },
                    {
                        "vds_uplink_name": "Uplink 2",
                        "uplink_name": "Uplink-2"
                    },
                    {
                        "vds_uplink_name": "Uplink 3",
                        "uplink_name": "Uplink3"
                    },
                    {
                        "vds_uplink_name": "Uplink 4",
                        "uplink_name": "Uplink4"
                    }
                    
  • Uplinks after:
"uplinks": [
                    {
                        "vds_uplink_name": "Uplink 1",
                        "uplink_name": "Uplink-1"
                    },
                    {
                        "vds_uplink_name": "Uplink 2",
                        "uplink_name": "Uplink-2"
                    }
  • Use the complete body received from the GET with the edited number of uplinks for the API:
PUT /api/v1/transport-node-profiles/<TN-UUID>
  • Verify the number of uplinks:
GET api/v1/transport-node/<TN-UUID>
  • Verify the number of VTEP's:
GET api/v1/transport-nodes/<TN-UUID>/state
  • If the number of VTEP's are still incorrect, edit the host Uplink Profile used for the TNP from 4 to 2.
  • Verify the number of VTEP's again:
GET api/v1/transport-nodes/<TN-UUID>/state

Note: Transport node UUID <TN-UUID> can be found in the NSX-T UI, under: System - Fabric - Hosts, Expand Cluster and click the 3 dots beside the host and click copy ID to Clipboard.

If the above workaround fails or if you are on VMware NSX version less than 3.2.2 and you still have incorrect endpoint mapping, you can reboot the host, this will remap the TEP to the correct endpoint and resolve the datapath issue.

Note: In the example given above (in symptoms) of reducing from 4 to 2 uplinks, the host will continue to use vmk10 and vmk12, however the reboot will resolve the endpoint mapping issue and there will be no further functional impact.