NSX-T Edge Pre-check Errors and Upgrade is Stuck at 'switch

Products

VMware NSX

Issue/Introduction

During an NSX upgrade, after the pre-check stage, a message similar to the following is displayed for an Edge Node.

Edge node "UUID" vmId is not found on NSX Manager. Please refer to https://kb.vmware.com/s/article/90072

The edge node is stuck in an upgrade for a long time at the 'switch_os' step.
If a retry is attempted, it will then be stuck at the 'download_os' step. This can also be seen when running get upgrade progress-status from the NSX admin CLI:

> get upgrade progress-status

***************************************************************************
Node Upgrade has been started. Please do not make any changes, until
the upgrade operation is complete. Run "get upgrade progress-status"
to show the progress of last upgrade step.
***************************************************************************
Upgrade steps:
download_os [2022-09-16 17:35:52 - 2022-09-16 17:36:21] SUCCESS

Environment

VMware NSX-T Data Center

VMware NSX

Cause

The 'vmMoref id' of the Edge virtual machine is incorrectly populated in Edge upgrade unit metadata.

Resolution

This issue is resolved in VMware NSX 4.1.0, available at Broadcom downloads.

This precheck error can be ignored if the Edges were deployed on an ESXi host that is not prepared for NSX and the Compute Manager is up and reachable.

If the Compute Manager which was used to create the Edge Transport Node is no longer registered with NSX, the error cannot be resolved by following the workarounds below. The Edges would need to be redeployed on a new Compute Manager in that case.

This issue can be encountered in relation to two different issues (noted below). In the first, a warning is issued for Edge Nodes following the pre-checks. In the second, the edge node upgrade is stuck indefinitely at the 'switch_os' step.

Vmid attribute is missing from EdgeNodeExternalconfig
EntityId (VmMoreF) is stale in the DeploymentUnitInstance database table

Workaround for Issue 1: "Vmid attribute is missing from EdgeNodeExternalconfig"

API Method : [This method is available only from NSX version 3.2.3 through 4.1.0]

Note: If the API fix is not applicable due to the version, there is a workaround that involves making changes on the database. Please contact Broadcom Support in such a situation, referencing this KB article and providing the details about this along with the rest of the issue description.

*Ensure there is a viable backup available before taking the steps below.

Collect the Edge virtual machine vmID (VmMoreF Id) from vCenter

Click on the edge virtual machine console from the vCenter.
You can obtain the vmID (VmMoreF Id) from the URL after completing the previous step.

Example:

Get the UUID of the Edge from manager node CLI as the admin user

nsxmgr> get nodes
Use an API call similar to the following to get the the payload of the Edge from the following API:

GET https://<nsxMgrIp>/api/v1/transport-nodes/<edgeTnId>
Note: Replace nsxMgrIp with the FQDN or IP address of an NSX Manager node and <edgeTnId> with the node UUID value obtained in Step 2.
Create the payload for the POST API in the format below, using the output collected from step 3 including the vmID collected from Step 1b.

{
"vm_deployment_config": {
},
"node_user_settings": {
},
"node_settings": {
},
"vm_id": " "
}

Example reference output:

{
"vm_deployment_config": {
"vc_id": "c47f70db-####-####-####-###########",
"compute_id": "domain-##",
"storage_id": "datastore-###",
"host_id": "host-##",
"management_network_id": "dvportgroup-##",
"management_port_subnets": [
{
"ip_addresses": [
"192.168.#.#
],
"prefix_length": 24
}
],
"default_gateway_addresses": [
"192.168.#.#"
],
"data_network_ids": [
"dvportgroup-###",
"dvportgroup-###"
],
"reservation_info": {
"memory_reservation": {
"reservation_percentage": 100
},
"cpu_reservation": {
"reservation_in_shares": "HIGH_PRIORITY",
"reservation_in_mhz": 0
}
},
"resource_allocation": {
"cpu_count": 4,
"memory_allocation_in_mb": 8192
},
"placement_type": "VsphereDeploymentConfig"
},
"node_user_settings": {
"cli_username": "admin"
},
"node_settings": {
"hostname": "######",
"search_domains": [
"example.com"
],
"dns_servers": [
"192.168.#.#"
],
"enable_ssh": true,
"allow_ssh_root_login": true
},
"vm_id":"vm-##"
}
Execute an API call similar to the following:
POST https://<nsxMgrIp>/api/v1/transport-nodes/<edgeTnId>?action=addOrUpdatePlacementReferences

In the Body section, use the payload drafted from Step 4 and replace the nsxMgrIp in the URL above like in step 3.
Retry the upgrade

Workaround for Issue 2: "EntityId (VmMoreF) is stale in DeploymentUnitInstance table"

If you have identified the issue as Issue #2 noted above, and the NSX version is between 3.2.3 and 4.1.0, the previous POST API workaround described under Workaround for Issue #1 will resolve it.

Note:

This API fix is not available in the following from versions >> 3.2.0.1, 3.2.1, 3.2.2, 4.0.1
If the API fix is not applicable for your version, there is a workaround that involves making changes on the database, please contact Broadcom Support for the same.

Additional Information

If you are contacting Broadcom support about this issue, please provide the following:

NSX Edge log bundles for affected Edges in the Edge Cluster
Ensure log date range covers the full date of the event(s) being investigated. When in doubt, retrieve logs for all time.
NSX Manager log bundles
ESXi host log bundles for all hosts supporting affected Edge VMs
Text of any error messages seen in NSX GUI or command lines pertinent to the investigation

Handling Log Bundles for offline review with Broadcom support