After replacing Managers or while running Upgrade prechecks, Repo_Sync is Failed
search cancel

After replacing Managers or while running Upgrade prechecks, Repo_Sync is Failed

book

Article ID: 322436

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:

  • NSX 4.1.x
  • After 1 or more NSX Managers are deployed/redeployed, REPO_SYNC is in Failed state
  • NSX Manager log /var/log/proton/nsxapi.log
2024-02-24T12:00:26.882Z  INFO RepoSyncThread-1707748646882 RepoSyncServiceImpl 4841 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Starting Repo sync thread RepoSyncThread-12345678964321
2024-02-24T12:00::32.208Z  INFO RepoSyncThread-1707748646882 RepoSyncFileHelper 4841 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Command to get server info for https://xxx.xxx.xxx.xxx:443/repository/4.1.1.0.0.22224312/HostComponents/rhel77_x86_64_baremetal_server/upgrade.sh returned result CommandResultImpl [commandName=null, pid=2227086, status=SUCCESS, errorCode=0, errorMessage=null, commandOutput=HTTP/1.1 404 Not Found
2024-02-24T12:00::11.583Z  INFO RepoSyncThread-1707748646882 RepoSyncFileHelper 4841 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Command to check if remote file exists for https://xxx.xxx.xxx.xxx:443/repository/4.1.1.0.0.22224312/Manager/vmware-mount/libvixMntapi.so.1 returned result CommandResultImpl [commandName=null, pid=2228965, status=SUCCESS, errorCode=0, errorMessage=null, commandOutput=HTTP/1.1 404 Not Found
2024-02-24T12:00::11.583Z ERROR RepoSyncThread-1707748646882 RepoSyncServiceImpl 4841 SYSTEM [nsx@6876 comp="nsx-manager" errorCode="MP21057" level="ERROR" subcomp="manager"] Unable to start repository sync operation.See logs for more details.
  • While preparing for an upgrade the Check Upgrade Readiness UI shows an error
"Upgrade-coordinator upgrade failed. Error - Repository Sync status is not success on node <node IP>."
"Repository sync is not complete"
  • NSX Manager log /var/log/syslog
2024-02-24T12:00:52.800Z NSX_Manager NSX 98866 SYSTEM [nsx@6876 comp="nsx-manager" errorCode="MP30487" level="ERROR" subcomp="upgrade-coordinator"] Repository sync is not successful on <Managers IPs>. Please ensure Repository Sync Status is successful on all MP cluster nodes.
2024-02-24T12:00:52.800Z NSX_Manager NSX 98866 SYSTEM [nsx@6876 comp="nsx-manager" errorCode="MP30040" level="ERROR" subcomp="upgrade-coordinator"] Error while updating upgrade-coordinator due to error Repository Sync status is not success on node <Managers IPs>. Please ensure Repository Sync status is success on all MP nodes before proceeding..



Environment

VMware NSX 4.1.0

Cause

This is a known issue impacting VMware NSX. It is due to missing files within the /repository directory within each NSX Manager. 

Resolution

Workaround:

Warning this procedure involves the use of the "rm" command which irreversibly removes files from the system.
Ensure backups are taken and restore passphrase is known before proceeding.


Identifying the issue:

On each VMware NSX Manager Appliance, check which directories are present in the /repository directory:
As root user run: ls -l /repository
We may see either of the 3 below:

  • If the environment has been upgraded, then we expect to see a from and to version directory structure, that is a directory with the previous VMware NSX version as the name and a directory with the current VMware NSX version as the name, for example:
    • drwxrwx--- 7 uuc grepodir 4096 <date> 4.1.0.0.0.21332672
    • drwxrwx--- 7 uuc grepodir 4096 <date> 4.1.1.0.0.22224312
       
  • If the environment has not been upgraded, then we expect to see a from version directory structure, that is a directory with the current VMware NSX version as the name, for example:
    • drwxrwx--- 7 uuc grepodir 4096 <date> 4.1.0.0.0.21332672
  • In some instance there may be no VMware NSX directory version in the repository.


Based on the above results, you will need to then complete one or more of the below options:

  1. If the environment was freshly deployed and not upgraded and the from VMware NSX directory is missing, we need complete the steps in 'Option: Deploy OVA file in /repository' below.
  2. If the environment was upgraded and the from version is missing, we need use use the steps in 'Option: Deploy MUB file in /repository' below.
  3. If the environment was upgraded and the to VMware NSX directory is missing, we need use use the steps in 'Option: Deploy MUB file in /repository' below.
  4. If the environment was upgraded and the to and from VMware NSX directories are missing, we need complete the steps in 'Deploy MUB file in /repository' below and 'Option: Deploy OVA file in /repository' below.

Option: Deploy MUB file in /repository:

  1. Download VMware-NSX-upgrade-bundle-<version>.mub MUB file following these instructions: Download Broadcom products and software
       The downloaded version should match the version reported NOT found in the logs, in this example 4.1.1.0.0.22224312.
  2. To identify the Orchestrator node, log into any Manager as admin and run: 
       nsx-mngr> get service install-upgrade
       Service name:      install-upgrade
       Service state:     stopped
       Enabled on:        xxx.xxx.xxx.xxx   <<< orchestrator node
  3.    Copy the downloaded mub file to /image directory of orchestrator node.
  4. As root user, extract MUB file on the orchestrator node:
       # cd /image
       # tar -xf VMware-NSX-upgrade-bundle-<version>.mub
  5.    This will create a new file with the same name and .tar.gz extension.
  6. Delete the folder for your current version under /repository
       For example in this example the system runs 4.1.1
       # rm -rf /repository/4.1.1.0.0.22224312
  7. Extract tar.gz to /repository
         # tar -xzf /image/VMware-NSX-upgrade-bundle-<version>.tar.gz -C /repository
  8. Set proper permissions and ownership of the /repository files by executing the following
         /opt/vmware/proton-tomcat/bin/reposync_helper.sh
  9. From the UI Resolve the REPO_SYNC on the orchestrator node: System -> Appliances -> View Details and Click Resolve for REPO_SYNC and wait for this to complete.
  10. Once completed, repeat for each of the other 2 Managers.
  11. Clean up the downloaded mub file and extracted tar.gz file from /image:
       rm -f /image/VMware-NSX-upgrade-bundle-<version>.mub
       rm -f /image/VMware-NSX-upgrade-bundle-<version>.tar.gz
       rm -f /image/VMware-NSX-upgrade-bundle-<version>.tar.gz.sig


Option: Deploy OVA file in /repository:

  1. Download nsx-unified-appliance-<version>.ova MUB file following these instructions: Download Broadcom products and software. The downloaded version should match the version missing in the repository as identified above from the 'Identifying the issue' section.
  2. Deploy this manager as a separate appliance in vCenter and do not connect to the cluster.
  3. From this newly deployed manager, copy the /repository/<version> directory to all 3 existing managers missing the directory.
  4. As root user, run the command “/opt/vmware/proton-tomcat/bin/reposync_helper.sh” on all the 3 existing managers, not the newly deployed one.
  5. From the UI Resolve the REPO_SYNC on the orchestrator node: System -> Appliances -> View Details and Click Resolve for REPO_SYNC and wait for this to complete.
  6. Now resolve the repo-sync failure on the other 2 nodes, from “System” -> “Appliances” page and wait for this to complete.
  7. The newly deployed manager can now be powered off and deleted once the REPO_SYNC is working. 
     

Option: Advanced LB (AVI):

It is possible that this same issue can be caused if NSX ALB files are missing from the repository.
This typically occurs if at one time NSX ALB was deployed but later removed. If a user manually deletes the ALB files from the repository, for example to free disk space, then it can cause this sync failure. Logs will explicitly refer to ALB files e.g.

2024-03-19T09:41:34.557Z  INFO RepoSyncThread-1710841232019 RepoSyncFileHelper 85527 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Command to get server info for https://xxx.xxx.xxx.xxx:443/repository/22.1.6-9191/Alb_controller/ovf/controller.cert returned result CommandResultImpl [commandName=null, pid=1677285, status=SUCCESS, errorCode=0, errorMessage=null, commandOutput=HTTP/1.1 404 Not Found
2024-03-19T09:42:08.746Z  INFO RepoSyncThread-1710841232019 RepoSyncFileHelper 85527 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Command to get server info for https://xxx.xxx.xxx.xxx:443/repository/22.1.6-9191/Alb_controller/ovf/controller-disk1.vmdk returned result CommandResultImpl [commandName=null, pid=1677876, status=SUCCESS, errorCode=0, errorMessage=null, commandOutput=HTTP/1.1 404 Not Found

/var/log/proton/nsxapi.log

2024-05-29T14:32:15.898Z INFO http-nio-127.0.0.1-7440-exec-23 RepoSyncServiceImpl 117206 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" reqId="<UUID>" subcomp="manager" username="uproton"] Starting Repository sync process, current result is RepoSyncResult [nodeId=<NODE UUID>, status=FAILED, statusMessage=, failureMessage=Unable to connect to File /repository/21.1.2-9124/Alb_controller/ovf/controller.ovf on source xxx.xxx.xxx.yyy. Please verify that file exists on source and install-upgrade service is up., errorCode=21057, percentage=0.0]
  1. Identify the NSX ALB version, in the example above it is 21.1.4
  2. Download the NSX ALB Controller ova from the VMware customer connects portal and copy it to the orchestrator node
  3. Create the directory if it does not exist
    #mkdir /repository/22.1.6-9191/Alb_controller/ovf
  4. Extract the ova files
         # tar -xvf /image/Controller.ova -C /repository/22.1.6-9191
  5. Ensure there are 4 files

     controller.ovf
     controller.mf
     controller.cert
     controller-disk1.vmdk
  6. Set proper permissions and ownership of the /repository files by executing the following -
         /opt/vmware/proton-tomcat/bin/reposync_helper.sh
  7. From the UI Resolve the REPO_SYNC on the orchestrator node
        System -> Appliances -> View Details Click Resolve for REPO_SYNC
  8.    Once completed, repeat for each of the other 2 Managers.

Alternate option for ALB controller ova file if the customer does not intend to use ALB:

The ALB controller file check can be bypassed during Repo sync by resetting the AlbControllerVmFabricModule values to default following the below steps:

  1. Remove the Alb directory from /repository using:
        # rm -rf /repository/22.1.2-9086
  2. Get the ALB details with the below API call:
        GET  https://<nsx-manager-ip>/api/v1/fabric/modules/<alb_fabric_id>
  3. We can get the ALB fabric ID with API:
    GET https://<nsx-manager-ip>/api/v1/fabric/modules  >>>> note the ID present in section: "fabric_module_name" : "AlbControllerVmFabricModule",
  4. Reset the values of 'AlbControllerVmFabricModule' using the below PUT API call:
    PUT https://<nsx-manager-ip>/api/v1/fabric/modules/<alb-fabric-idalong with adding the header "Content-Type:Application.Json"
    {
    "fabric_module_name" : "AlbControllerVmFabricModule",
        "current_version" : "1.0",
        "deployment_specs" : [ {
          "fabric_module_version" : "1.0",
          "versioned_deployment_specs" : [ {
            "host_version" : "",
            "service_vm_ovf_url" : [ "ALB_CONTROLLER_OVF" ],
            "host_type" : "ESXI"
          } ]
        } ],
        "source_authentication_mode" : "NO_AUTHENTICATION",
        "disk_provisioning" : "THIN",
        "resource_type" : "FabricModule",
        "id" : "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
        "display_name" : "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
       "_revision" : 1
    }'