Edge node configuration is incorrect after upgrading to 3.2.x versions

Products

VMware NSX

Issue/Introduction

After upgrading to NSX-T 3.2.0 ,3.2.1 or 3.2.3 from 3.1.x versions, some users may experience issues with the Edge node configuration. This can manifest as incorrect entries in the vm_deployment_config section, particularly affecting fields such as vc_id, compute_id, storage_id, and host_id.
To view the Edge configuration, use the following API request:
curl -v -k -u admin -H "Content-Type:application/json" -X GET https://<MGR_IP>/api/v1/transport-nodes/<Edge_UUID>

Note: Edge UUID can be obtained from the Edge Transport Nodes pane in the NSX UI, or from 'get nodes' output in the Manager admin shell.

When this issue is hit, at least one value in the "vm_deployment_config" section in the output from the above API is incorrect:

 "vm_deployment_config" : {
        "vc_id" : "<UUID>",
        "compute_id" : "resgroup-XXXX",
        "storage_id" : "datastore-XXXX",
        "management_network_id" : "dvportgroup-XXXX",
        "management_port_subnets" : [ {
          "ip_addresses" : [ "X.X.X.X" ],
          "prefix_length" : XX
        } ],
        "default_gateway_addresses" : [ "X.X.X.X" ],
        "data_network_ids" : [ X, X ],
        "reservation_info" : {
          "memory_reservation" : {
            "reservation_percentage" : X
          },
          "cpu_reservation" : {
            "reservation_in_shares" : "X_PRIORITY",
            "reservation_in_mhz" : X
          }
        },
        "resource_allocation" : {
          "cpu_count" : X,
          "memory_allocation_in_mb" : XXXXX
        },
        "placement_type" : "VsphereDeploymentConfig"
      },

Examples of incorrect configurations in "vm_deployment_config" section:

1. "compute_folder_id" is an optional field which may be missing in VMC environments

Affected Edge nodes will appear in an unexpected folder in vCenter.
This issue is cosmetic, and there is no dataplane impact.

2. "data_network_ids" may contain the Managed Object Reference ID (MoRef) of a non-existent port group

Re-deployment attempts of Edge node will fail. BGP and BFD tunnels may be affected in this scenario as well.
Logging example in this scenario:

20xx-xx-xxTxx:xx:xx.xxxZ ERROR ActivityWorkerPool-1-6 OvfDeploy 7135 SYSTEM [nsx@6876 comp="nsx-manager" errorCode="MP40409" level="ERROR" subcomp="manager"] Ovf deploy failed for vm <Edge hostname> on vc <VC hostname> with error 'The provided network mapping between OVF networks and the system network is not supported by any host.'
com.vmware.nsx.management.lcm.vc.soap.exceptions.CmFabricOvfDeployFailedException: The provided network mapping between OVF networks and the system network is not supported by any host.
        at com.vmware.nsx.management.lcm.vc.soap.service.ovf.OvfDeploy.createImportSpec(OvfDeploy.java:790) ~[?:?]
        at com.vmware.nsx.management.lcm.vc.soap.service.ovf.OvfDeploy.installOvf(OvfDeploy.java:373) ~[?:?]

Alarm example logging in this scenario:

20xx-xx-xxTxx:xx:xx.xxxZ INFO pool-61-thread-1 MonitoringEventInstanceProcessor 8406 MONITORING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="monitoring"] Context for alarm with eventid edge.edge_vm_vsphere_settings_mismatch and entity id <UUID> is {"entity_id":"<UUID>","edg}e_vm_vsphere_settings_mismatch_reason":" configuration on vSphere : {\"Storage Id\":\"datastore-XXXX\",\"Data Networks Ids\":[\"dvportgroup-XXXX\",\"\"]}, intent vSphere configuration :{\"Storage Id\":\"datastore-XXXX\",\"Data Networks Ids\":[\"dvportgroup-XXXX\",\"dvportgroup-XXXX\"]}"

20xx-xx-xxTxx:xx:xx.xxxZ FATAL pool-61-thread-1 MonitoringServiceImpl 8406 MONITORING [nsx@6876 alarmId="<UUID>" alarmState="OPEN" comp="nsx-manager" entId="<UUID>" errorCode="MP701099" eventFeatureName="edge" eventSev="CRITICAL" eventState="On" eventType="edge_vm_vsphere_settings_mismatch" level="FATAL" nodeId="<UUID>" subcomp="monitoring"] The Edge node <UUID> configuration on vSphere does not match the policy intent configuration. The Edge node configuration visible to user on UI or API is not same as what is realized. The realized Edge node changes made by user outside of NSX Manager are shown in the details of this alarm and any edits in UI or API will overwrite the realized configuration. Fields that differ for the Edge node are listed in runtime data configuration on vSphere : {"Storage Id":"datastore-XXXX","Data Networks Ids":["dvportgroup-XXXX",""]}, intent vSphere configuration   {"Storage Id":"datastore-XXXX","Data Networks Ids":["dvportgroup-XXXX","dvportgroup-XXXX"]}

3. host_id might be incorrectly populated due to data migration issues. Post-upgrade to NSX-T 3.2.x from 3.1.x versions, there may be discrepancies with the host_id in the Edge node's configuration.

host_id is excluded from the vSphere Location mismatch alarm as the vMotion of Edge VMs is common and expected.
Host Column Populated Conditions
- The Host column for Edge VMs will be populated if a host was specified at the time of deployment; otherwise, it will remain blank.
- In two specific upgrade paths, the host_id may become visible:
  1. Upgrading from 3.1.x to 3.2.1: host_id may appear even if the customer chose a cluster at deployment time (not a specific Host), due to edge data migration issues.
  2. Upgrading from 3.1.x to 3.2.x: host_id may show even if the customer chose a cluster at deployment time, if the Edge intent was updated before the upgrade. The 3.1.x PUT operation could have inadvertently added it without user intention.

Environment

VMware NSX-T Data Center

Cause

During the upgrade to 3.2.x, some Corfu tables for object creation intent and object realization are merged.
If there are stale or missing values in the pre-upgrade Edge configuration tables, it can result in incorrect information in the resulting merged tables after upgrading.

Resolution

This issue is resolved in VMware NSX-T 3.2.2, available at Broadcom downloads.

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Workaround:

Deploying a new Edge VM and replacing it in the Edge cluster can be used as a workaround on affected versions. As this is an issue with existing Edge data migration during upgrade to 3.2.x, new Edge VMs are not affected.
Workaround for Stale host_id:
- Identify the Edge nodes with stale host_id information.
- From UI, by selecting the Edge and then in Actions > Sync Edge Node Configuration initiate a edge configuration re-sync.
- Utilize the Edge refresh API endpoint to update the Edge node configuration.

> For 3.2.0/3.2.1 versions, curl -v -k -u admin -H "Content-Type:application/json" -X POST https://<MGR_IP>/api/v1/transport-nodes/<Edge_UUID>/refresh

> For 3.2.3 version, curl -v -k -u admin -H "Content-Type:application/json" -X POST https://<manager-ip>/api/v1/transport-nodes/<tn-id>?action=refresh_node_configuration&resource_type=EdgeNode

Additional Information

Impact/Risks:
Depending on the value in "vm_deployment_config" that is incorrect, impact can range from being only cosmetic to BGP going down and being unable to re-deploy affected Edge VMs after upgrade.