Symptoms:
One of the local NSX cluster upgrade fails during data migration task of BridgeEndpointProfileRelationsMigrationTask
'Get upgrade process-status' displays status with the error as below:
Upgrade steps: download_os [2022-11-28 13:29:36 - 2022-11-28 13:29:56] SUCCESS shutdown_manager [2022-11-28 13:30:00 - 2022-11-28 13:32:11] SUCCESS install_os [2022-11-28 13:32:11 - 2022-11-28 13:33:02] SUCCESS migrate_manager_config [2022-11-28 13:33:02 - 2022-11-28 13:33:07] SUCCESS switch_os [2022-11-28 13:33:07 - 2022-11-28 13:33:12] SUCCESS reboot [2022-11-28 13:33:12 - 2022-11-28 13:33:46] SUCCESS run_migration_tool [2022-11-28 13:35:02 - 2022-11-28 13:36:43] FAILED ------ Output of last step start ------ Status: 2022-11-28 13:35:03.454191 Deleting datastore files 2022-11-28 13:35:03.514168 Copying old datastore files 2022-11-28 13:35:04.823565 Done copying old datastore files 2022-11-28 13:35:06.493375 Start Corfu server 2022-11-28 13:35:10.574856 Process corfu-server started 2022-11-28 13:36:40.295315 Error running logical data migration tool. return value 1, log file /var/log/proton/logical-migration.log Overall Progress: (3/6) ---- (1) CCP: Completed [1 object(s)] (2022-11-28 13:35:10 - 2022-11-28 13:35:16) ---- -------------------------------------------------------------------------------------------- ---- (2) Proton: Completed [52040 object(s)] (2022-11-28 01:35:31 - 2022-11-28 01:35:45) ---- -------------------------------------------------------------------------------------------- ---- (3) Policy: Completed [38034 object(s)] (2022-11-28 01:35:57 - 2022-11-28 01:36:21) ---- -------------------------------------------------------------------------------------------- ---- (4) Logical: 41% [10773 of 25803 object(s)] (2022-11-28 01:36:35 - ) ---- Currently Migrating: BridgeEndpointProfileRelationsMigrationTask 0% [0 of 0 objects] (2022-11-28 01:36:39 - ) -------------------------------------------------------------------------------------------- ---- (5) CBM: Pending -------------------------------------------------------------------------------------------- ---- (6) UFO Checkpointing: Pending -------------------------------------------------------------------------------------------- Stdout: Starting Manager run_migration_tool script Ending Manager run_migration_tool script Troubleshooting: Upgrade has failed and retry may not work. Appliance OS is of a new version; however, UI will not be available. Please contact GSS to rollback the system to the previous version. ------ Output of last step end ------
Dry-run migration tool also shows failure in data migration
In NSX manager/ Dry run tool logical-migration logs we can see below ERROR logs stating that no edge node with index 1 was found
2022-11-29T07:29:45.790Z INFO main MigrationTask 3014 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Edge Cluster ID in use: 25ff79ee-####-####-####-########d65 2022-11-29T07:29:45.790Z INFO main MigrationTask 3014 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Edge Cluster path: /global-infra/sites/PA-INFRA/enforcement-points/default/edge-clusters/25ff79ee-####-####-####-########d65 2022-11-29T07:29:45.790Z INFO main MigrationTask 3014 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Edge Node index in use: 1 2022-11-29T07:29:45.790Z WARN main UfoCorfuTableMigrator 3014 - [nsx@6876 comp="nsx-manager" level="WARNING" subcomp="manager"] ERROR while running logical migration MappingDetails{modelName='null', migrationType=null, reason='This task will fix/add BridgeEndpointProfile relationships with Edge Cluster and Edge Node', customMigratorClassName='com.vmware.nsx.management.migration.impl.BridgeEndpointProfileRelationsMigrationTask', fieldMappings=null, targetProtoName='null', requiresCustomCode='false', owner='null', apiToTest='null'} java.lang.RuntimeException: No Edge Node found with index 1
During proton migration, NSX creates PolicyEdgeNodes with index in path blindly
During policy migration, NSX checks if there are any PolicyEdgeNodes with UUID, then we copy the data from its corresponding PolicyEdgeNode with index and delete that copy. This is done by matching the edgeTnId kept with PolicyEdgeNode. With this logic, NSX may end up deleting the GM copy instead of the LM copy
This issue is resolved in NSX-T Data Center 3.2.3 and 4.0.2 onwards.
Workaround:
Reach out to VMware NSX support for verification.