NVDS to VDS migration fails in VM

Products

VMware NSX

Issue/Introduction

You are migrating NVDS to VDS and migration fails for some hosts.
Cli output show.

{"host": "<UUID>", "overall_state": "UPGRADE_IN_PROGRESS", "ip_address": "***.***.***.***", "upgrade_stage": "VM_RETRIVAL", "_protection": "NOT_PROTECTED"},
Some hosts might complete migration successfully.
You find a timeout warning for the migration task in the /var/log/nsx-syslog.log file on the NSX manager:

/var/log/nsx-syslog.log
<Timestamp> cli.commands.manager.node_services INFO Tn [<TN_ID>] successfully entered VCMmode, start migrating...
<Timestamp> cli.commands.manager.node_services WARNING GetVDSMigrationStatus timeout, tn id: <TN_ID>, expected: SUCCESS
<Timestamp> cli.commands.manager.node_services INFO NVDS Migration successful TNs: []
<Timestamp> cli.commands.manager.node_services INFO NVDS Migration Failed TNs: {'<TN_ID': 'GetVDSMigrationStatus: ...

You find the migration task of the ESXi failed due to "Failed to get the VMs on host host-****"

/var/log/proton/nsxapi.log
<Timestamp> INFO MigrateToCvdsTaskExecutor3 VMOperationImpl 12117 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Getting list of VMs in compute manager <CM UUID>
<Timestamp> WARN MigrateToCvdsTaskExecutor3 VMOperationImpl 12117 FABRIC [nsx@6876 comp="nsx-manager" level="WARNING" subcomp="manager"] Failed to get the VMs on host host-*******
<Timestamp> ERROR MigrateToCvdsTaskExecutor3 MigrateToCvdsTask 12117 FABRIC [nsx@6876 comp="nsx-manager" errorCode="PM100" level="ERROR" subcomp="manager"] MigrateToCvdsTask on host [<Transport node ID>] failed. Current stage VM_RETRIVAL, Aborting all remaining stages.
java.lang.NullPointerException: null
        at com.vmware.nsx.management.policy.migration.util.MigrateToCvdsTask.run(MigrateToCvdsTask.java:518) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_352]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_352]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_352]
VMs, templates, or vSAN objects are still registered on the ESXi after it has entered into Maintenance Mode.
- You can find VMs suffixed with '(inaccessible)' and templates with vSphere Client.
  Select the ESXi in [Inventory] and navigate to [VMs]. Check both [Virtual Machines] and [VM Templates].

Environment

VMware NSX-T Datacenter 3.x

Cause

NullPointerException is hit when retrieving a list of VMs from an ESXi. It is known that templates and inaccessible VMs could cause such a NullPointerException.

Resolution

This is a known issue impacting VMware NSX.

Workaround:

1. Exit MM and Migrate or Remove all VM templates on the host to other hosts. Remove from inventory all inaccessible VMs.

2. Clean up old topology by triggering below rest-api
POST https://<nsx_manager_ip>/api/v1/nvds-urt?action=cleanup

3. Create new precheck with below api and note down precheck id from the output
POST https://<nsx_manager_ip>/api/v1/nvds-urt/precheck

4. Generate URT topology with below api using precheck id from step 2.
GET https://<nsx_manager_ip>/api/v1/nvds-urt/topology/<precheck_id>

5. Apply the topology using below api with the payload that was received as output from step 3.
POST https://<nsx_manager_ip>/api/v1/nvds-urt/topology?action=apply

6. Retrigger migration for host using below api
POST https://<nsx_manager_ip>/api/v1/transport-nodes/<tn_id>?action=migrate_to_vds

Note: all versions before and including 3.2.5 could potentially hit this issue