After replacing Managers or while running Upgrade prechecks, Repo_Sync is Failed
search cancel

After replacing Managers or while running Upgrade prechecks, Repo_Sync is Failed

book

Article ID: 322436

calendar_today

Updated On:

Products

VMware NSX VMware Avi Load Balancer

Issue/Introduction

  • After 1 or more NSX Managers are deployed/redeployed, REPO_SYNC is in Failed state
  • Entries similar to the below are observed in the NSX Manager log /var/log/proton/nsxapi.log

    2024-02-24T12:00:26.882Z  INFO RepoSyncThread-1707748646882 RepoSyncServiceImpl 4841 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Starting Repo sync thread RepoSyncThread-12345678964321

    2024-02-24T12:00::32.208Z  INFO RepoSyncThread-1707748646882 RepoSyncFileHelper 4841 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Command to get server info for https://#.#.#.#:443/repository/4.1.1.0.0.22224312/HostComponents/rhel77_x86_64_baremetal_server/upgrade.sh returned result CommandResultImpl [commandName=null, pid=2227086, status=SUCCESS, errorCode=0, errorMessage=null, commandOutput=HTTP/1.1 404 Not Found

    2024-02-24T12:00::11.583Z  INFO RepoSyncThread-1707748646882 RepoSyncFileHelper 4841 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Command to check if remote file exists for https://#.#.#.#:443/repository/4.1.1.0.0.22224312/Manager/vmware-mount/libvixMntapi.so.1 returned result CommandResultImpl [commandName=null, pid=2228965, status=SUCCESS, errorCode=0, errorMessage=null, commandOutput=HTTP/1.1 404 Not Found

    2024-02-24T12:00::11.583Z ERROR RepoSyncThread-1707748646882 RepoSyncServiceImpl 4841 SYSTEM [nsx@6876 comp="nsx-manager" errorCode="MP21057" level="ERROR" subcomp="manager"] Unable to start repository sync operation.See logs for more details.
  • While preparing for an upgrade the Check Upgrade Readiness UI shows an error:
    "Upgrade-coordinator upgrade failed. Error - Repository Sync status is not success on node <node IP>."
    "Repository sync is not complete"

  • Entries similar to the below are observed in NSX Manager log /var/log/syslog
    2024-02-24T12:00:52.800Z NSX_Manager NSX 98866 SYSTEM [nsx@6876 comp="nsx-manager" errorCode="MP30487" level="ERROR" subcomp="upgrade-coordinator"] Repository sync is not successful on <Managers IPs>. Please ensure Repository Sync Status is successful on all MP cluster nodes.2024-02-24T12:00:52.800Z NSX_Manager NSX 98866 SYSTEM [nsx@6876 comp="nsx-manager" errorCode="MP30040" level="ERROR" subcomp="upgrade-coordinator"] Error while updating upgrade-coordinator due to error Repository Sync status is not success on node <Managers IPs>. Please ensure Repository Sync status is success on all MP nodes before proceeding..
  • After Replacing an NSX Manager 
    2025-01-07T21:34:50.640Z  INFO RepoSyncResultTsdbListener-2-1 RepoSyncResultTsdbListener 5032 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] perform FullSync in RepoSyncResultTsdbListener, repoSyncResultMsg managed_resource {
    }
    status: REPO_SYNC_STATUS_FAILED
    status_message {
    }failure_message {
      value: "Unable to connect to File /repository/4.2.1.0.0.24304122/Manager/dry-run/dry_run.py on source <Manager IP>. Please verify that file exists on source and install-upgrade service is up."
    }
    error_code: 21057

Environment

VMware NSX 4.1.0
VMware NSX 4.2
VMware NSX-T Data Center 3.2.x

Cause

This is a known issue impacting VMware NSX. It is due to missing files within the /repository directory within each NSX Manager. 

Resolution

This issue is resolved in VMware NSX 4.2.0

Workaround:

Warning this procedure involves the use of the "rm" command which irreversibly removes files from the system.
Ensure backups are taken and restore passphrase is known before proceeding.


Identifying the issue:

On each VMware NSX Manager Appliance, check which directories are present in the /repository directory:
As root user run: ls -l /repository
You may see either of the 3 below:

  • If the environment has been upgraded, then you expect to see a from and to version directory structure, that is a directory with the previous VMware NSX version as the name and a directory with the current VMware NSX version as the name, for example:
    • drwxrwx--- 7 uuc grepodir 4096 <date> 4.1.0.0.0.21332672
    • drwxrwx--- 7 uuc grepodir 4096 <date> 4.1.1.0.0.22224312
       
  • If the environment has not been upgraded, then you expect to see a from version directory structure, that is a directory with the current VMware NSX version as the name, for example:
    • drwxrwx--- 7 uuc grepodir 4096 <date> 4.1.0.0.0.21332672
  • In some instance there may be no VMware NSX directory version in the repository.


Based on the above results, you will need to then complete one or more of the below options:

  1. If the environment was freshly deployed and not upgraded and the from VMware NSX directory is missing, you need to complete the steps in 'Option: Deploy OVA file in /repository' below.
  2. If the environment was upgraded and the from version is missing, you need to use use the steps in 'Option: Deploy MUB file in /repository' below.
  3. If the environment was upgraded and the to VMware NSX directory is missing, you need to use use the steps in 'Option: Deploy MUB file in /repository' below.
  4. If the environment was upgraded and the to and from VMware NSX directories are missing, you need to complete the steps in 'Deploy MUB file in /repository' below and 'Option: Deploy OVA file in /repository' below.
  5. If the required files are present for both to and from versions but have been replaced incorrectly there may only be missing permissions; In this case follow the 'Deploy MUB file in /repository' guide below from step 8 onwards.

Option: Correcting User and Group permission recursively for the /repository directory after coping (scp) it from a know good source manager.

The user and group the whole of the /repository directory should be  user: uuc and group: grepodir for the directory and all subdirectories and files.
The permission should be wrx wrx.
This is was not the case when the directory was copied with scp to the newly replaced manager/s.
To ensure the correct user, group, and permission the following command is executed at the cli of each replacement manager.

Copy /repository directory to new manager.

Open an SSH session to the known good host.
#scp -r /repository <remote User>@<IP of Remote Server>:/
Example command:
#scp -r /repository [email protected]:/

This command copies the /repository directory recursively to the root directory (/) of host A.B.C.D.
Now the user,  group, and permission will need to be check and corrected.

This will recursively set the user and group:
#chown -R uuc:grepodir /repository

This will recursively set the required permissions:
#chmod -R 770 /repository


Example:
 

The cannot connect to dry-run.py error was corrected by setting these attributes
Check that the REPO_SYNC FAIL state has been cleared.

 

Option: Deploy MUB file in /repository:

  1. Download VMware-NSX-upgrade-bundle-<version>.mub MUB file following these instructions: Download Broadcom products and software
       The downloaded version should match the version reported NOT found in the logs, in this example 4.1.1.0.0.22224312.
  2. To identify the Orchestrator node, log into any Manager as admin and run: 

    nsx-mngr> get service install-upgrade
    Service name:      install-upgrade
    Service state:     stopped
    Enabled on:        #.#.#.#   <<< orchestrator node
  3. Copy the downloaded mub file to /image directory of orchestrator node.
  4. As root user, extract MUB file on the orchestrator node:

    # cd /image
    # tar -xf VMware-NSX-upgrade-bundle-<version>.mub
  5. This will create a new file with the same name and .tar.gz extension.
  6. Delete the folder for your current version under /repository
    For example in this example the system runs 4.1.1

    # rm -rf /repository/4.1.1.0.0.22224312

  7. Extract tar.gz to /repository

    # tar -xzf /image/VMware-NSX-upgrade-bundle-<version>.tar.gz -C /repository

  8. Set proper permissions and ownership of the /repository files by executing the following

    /opt/vmware/proton-tomcat/bin/reposync_helper.sh

  9. From the UI Resolve the REPO_SYNC on the orchestrator node: System -> Appliances -> View Details and click Resolve for REPO_SYNC and wait for this to complete.
  10. Once completed, press Resolve for each of the other 2 Managers.
  11. Clean up the downloaded mub file and extracted tar.gz file from /image:

    rm -f /image/VMware-NSX-upgrade-bundle-<version>.mub
    rm -f /image/VMware-NSX-upgrade-bundle-<version>.tar.gz
    rm -f /image/VMware-NSX-upgrade-bundle-<version>.tar.gz.sig


Option: Deploy OVA file in /repository:

  1. Download nsx-unified-appliance-<version>.ova file following these instructions: Download Broadcom products and software. The downloaded version should match the version missing in the repository as identified above from the 'Identifying the issue' section.
  2. Deploy this manager as a separate appliance in vCenter and do not connect to the cluster.
  3. From this newly deployed manager, copy the /repository/<version> directory to all 3 existing managers missing the directory.
  4. As root user, run the command /opt/vmware/proton-tomcat/bin/reposync_helper.sh on all the 3 existing managers, not the newly deployed one.
  5. From the UI Resolve the REPO_SYNC on the orchestrator node: System -> Appliances -> View Details and click Resolve for REPO_SYNC and wait for this to complete.
  6. Now resolve the repo-sync failure on the other 2 nodes, from “System” -> “Appliances” page and wait for this to complete.
  7. The newly deployed manager can now be powered off and deleted once the REPO_SYNC is working. 
     

Option: Advanced LB (AVI):

  • It is possible that this same issue can be caused if NSX ALB files are missing from the repository.
    This typically occurs if at one time NSX ALB was deployed but later removed. If a user manually deletes the ALB files from the repository, for example to free disk space, then it can cause this sync failure. Logs will explicitly refer to ALB files e.g.

2024-03-19T09:41:34.557Z  INFO RepoSyncThread-1710841232019 RepoSyncFileHelper 85527 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Command to get server info for https://#.#.#.#:443/repository/21.1.2-9124/Alb_controller/ovf/controller.cert returned result CommandResultImpl [commandName=null, pid=1677285, status=SUCCESS, errorCode=0, errorMessage=null, commandOutput=HTTP/1.1 404 Not Found
2024-03-19T09:42:08.746Z  INFO RepoSyncThread-1710841232019 RepoSyncFileHelper 85527 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Command to get server info for https://#.#.#.#:443/repository/22.1.6-9191/Alb_controller/ovf/controller-disk1.vmdk returned result CommandResultImpl [commandName=null, pid=1677876, status=SUCCESS, errorCode=0, errorMessage=null, commandOutput=HTTP/1.1 404 Not Found

/var/log/proton/nsxapi.log

2024-05-29T14:32:15.898Z INFO http-nio-127.0.0.1-7440-exec-23 RepoSyncServiceImpl 117206 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" reqId="<UUID>" subcomp="manager" username="uproton"] Starting Repository sync process, current result is RepoSyncResult [nodeId=<NODE UUID>, status=FAILED, statusMessage=, failureMessage=Unable to connect to File /repository/21.1.2-9124/Alb_controller/ovf/controller.ovf on source #.#.#.#. Please verify that file exists on source and install-upgrade service is up., errorCode=21057, percentage=0.0]

  1. Identify the NSX ALB version, in the example above it is 21.1.2
  2. Download the NSX ALB Controller ova from the VMware customer connects portal and copy it to the orchestrator node
  3. Create the directory if it does not exist

    #mkdir /repository/21.1.2-9124/Alb_controller/ovf

  4. Extract the ova files

    # tar -xvf /image/Controller.ova -C /repository/21.1.2-9124

  5. Ensure there are 4 files

     controller.ovf
     controller.mf
     controller.cert
     controller-disk1.vmdk

  6. Set proper permissions and ownership of the /repository files by executing the following:

    /opt/vmware/proton-tomcat/bin/reposync_helper.sh

  7. From the UI Resolve the REPO_SYNC on the orchestrator node: System -> Appliances -> View Details click Resolve for REPO_SYNC
  8. Once completed, repeat for each of the other 2 Managers.

Alternate option for ALB controller ova file if the customer does not intend to use ALB:

The ALB controller file check can be bypassed during Repo sync by resetting the AlbControllerVmFabricModule values to default following the below steps:

  1. Remove the Alb directory from /repository using:

        # rm -rf /repository/21.1.2-9124

  2. Get the ALB fabric ID with the bellow API:
    • GET https://<nsx-manager-ip>/api/v1/fabric/modules  >>>> note the ID present in section: "fabric_module_name" : "AlbControllerVmFabricModule",
  3. Get the ALB details with the below API call:
     
    • GET https://<nsx-manager-ip>/api/v1/fabric/modules/<alb_fabric_id>
  4. Reset the values of 'AlbControllerVmFabricModule' using the below PUT API call:

    • PUT https://<nsx-manager-ip>/api/v1/fabric/modules/<alb-fabric-id> along with adding the header "Content-Type:Application.Json"
      {
      "fabric_module_name" : "AlbControllerVmFabricModule",
          "current_version" : "1.0",
          "deployment_specs" : [ {
            "fabric_module_version" : "1.0",
            "versioned_deployment_specs" : [ {
              "host_version" : "",
              "service_vm_ovf_url" : [ "ALB_CONTROLLER_OVF" ],
              "host_type" : "ESXI"
            } ]
          } ],
          "source_authentication_mode" : "NO_AUTHENTICATION",
          "disk_provisioning" : "THIN",
          "resource_type" : "FabricModule",
          "id" : "######-####-####-####-##########",
          "display_name" : "######-####-####-####-##########",
         "_revision" : 1
      }'

If all Options still fail to bring REPO_SYNC to SUCCESS state try below WA

1. Reset Upgrade Plan on ALL 3 Manager nodes

  • Log in to manager as admin
  • > set debug-mode
  • > rollback upgrade-coordinator

2. Check the upgrade status using API

  •  GET https://<NSX_MGR>/api/v1/upgrade/summary

    Output should show status as NOT STARTED with System version and Target_version matching the current running version of the manager nodes

3. Delete Upgrade plan from ALL 3 managers and VIP IP's

  • Log into all 3 managers as admin
  • Stop install-upgrade service on ALL Managers
  • > stop service install-upgrade
  • Run DELETE API call to remove upgrade plan for ALL managers and VIP IP

    DELETE https://<NSX_MGR1>/api/v1/upgrade-mgmt/plan

    DELETE https://<NSX_MGR2>/api/v1/upgrade-mgmt/plan

    DELETE https://<NSX_MGR3>/api/v1/upgrade-mgmt/plan

    DELETE https://<NSX_MGR(VIP-IP)>/api/v1/upgrade-mgmt/plan

  • Start install-upgrade service on Orchestrator node only
    > start service install-upgrade

4. Confirm /repository only displays the Current Version of NSX on each manager node

5. WinSCP Target Version MUB file to /image directory on ALL Managers

6. Extract MUB files on ALL Manager nodes

# cd /image
# tar -xf VMware-NSX-upgrade-bundle-<version>.mub

This will create a new file with the same name and .tar.gz extension.

7. Extract tar.gz to /repository

# tar -xzf /image/VMware-NSX-upgrade-bundle-<version>.tar.gz -C /repository

8. Change permissions of extracted bundle in /repository on ALL manager nodes

# chmod -R 777 4.1.1.0.0.22224312

9. Set proper permissions and ownership of the /repository files by executing the following

/opt/vmware/proton-tomcat/bin/reposync_helper.sh

10. From the UI Resolve the REPO_SYNC on the orchestrator node: System -> Appliances -> View Details and click Resolve for REPO_SYNC and wait for this to complete.

11. Once completed, press Resolve for each of the other 2 Managers.

12. Clean up the downloaded mub file and extracted tar.gz file from /image:

rm -f /image/VMware-NSX-upgrade-bundle-<version>.mub
rm -f /image/VMware-NSX-upgrade-bundle-<version>.tar.gz
rm -f /image/VMware-NSX-upgrade-bundle-<version>.tar.gz.sig

 

Additional Information

If you are contacting Broadcom support about this issue, please provide the following:

 

  • The current version of NSX .
  • The version being upgraded to.
  • The state of the REPO_SYNC on all three managers

Handling Log Bundles for offline review with Broadcom support