After replacing Managers or while running Upgrade prechecks, Repo

search cancel

After replacing Managers or while running Upgrade prechecks, Repo_Sync is Failed

book

Article ID: 322436

calendar_today

Updated On:

Products

VMware NSX VMware Avi Load Balancer

Issue/Introduction

After 1 or more NSX Managers are deployed/redeployed, REPO_SYNC is in Failed state

Entries similar to the below are observed in the NSX Manager log /var/log/proton/nsxapi.log

<timestamp>  INFO RepoSyncThread-1707748646882 RepoSyncServiceImpl 4841 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Starting Repo sync thread RepoSyncThread-12345678964321

<timestamp>  INFO RepoSyncThread-1707748646882 RepoSyncFileHelper 4841 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Command to get server info for https://#.#.#.#:443/repository/4.1.1.0.0.22224312/HostComponents/rhel77_x86_64_baremetal_server/upgrade.sh returned result CommandResultImpl [commandName=null, pid=2227086, status=SUCCESS, errorCode=0, errorMessage=null, commandOutput=HTTP/1.1 404 Not Found

<timestamp>  INFO RepoSyncThread-1707748646882 RepoSyncFileHelper 4841 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Command to check if remote file exists for https://#.#.#.#:443/repository/4.1.1.0.0.22224312/Manager/vmware-mount/libvixMntapi.so.1 returned result CommandResultImpl [commandName=null, pid=2228965, status=SUCCESS, errorCode=0, errorMessage=null, commandOutput=HTTP/1.1 404 Not Found

<timestamp> ERROR RepoSyncThread-1707748646882 RepoSyncServiceImpl 4841 SYSTEM [nsx@6876 comp="nsx-manager" errorCode="MP21057" level="ERROR" subcomp="manager"] Unable to start repository sync operation.See logs for more details.

While preparing for an upgrade the Check Upgrade Readiness UI shows an error:
"Upgrade-coordinator upgrade failed. Error - Repository Sync status is not success on node <node IP>."
"Repository sync is not complete"

Entries similar to the below are observed in NSX Manager log /var/log/syslog

<timestamp> NSX_Manager NSX 98866 SYSTEM [nsx@6876 comp="nsx-manager" errorCode="MP30487" level="ERROR" subcomp="upgrade-coordinator"] Repository sync is not successful on <Managers IPs>. Please ensure Repository Sync Status is successful on all MP cluster nodes.2024-02-24T12:00:52.800Z NSX_Manager NSX 98866 SYSTEM [nsx@6876 comp="nsx-manager" errorCode="MP30040" level="ERROR" subcomp="upgrade-coordinator"] Error while updating upgrade-coordinator due to error Repository Sync status is not success on node <Managers IPs>. Please ensure Repository Sync status is success on all MP nodes before proceeding..

After Replacing an NSX Manager

<timestamp>  INFO RepoSyncResultTsdbListener-2-1 RepoSyncResultTsdbListener 5032 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] perform FullSync in RepoSyncResultTsdbListener, repoSyncResultMsg managed_resource {
}
status: REPO_SYNC_STATUS_FAILED
status_message {
}failure_message {
  value: "Unable to connect to File /repository/4.2.1.0.0.24304122/Manager/dry-run/dry_run.py on source <Manager IP>. Please verify that file exists on source and install-upgrade service is up."
}
error_code: 21057

Environment

VMware NSX-T Data Center
VMware NSX

Cause

This is a known issue impacting VMware NSX. It is due to missing files within the /repository directory within each NSX Manager.

Resolution

This issue is resolved for upgrades from VMware NSX 4.2.0 to higher versions.

Workaround:

Warning this procedure involves the use of the "rm" command which irreversibly removes files from the system.
Ensure backups are taken and restore passphrase is known before proceeding.

Identifying the issue:

On each VMware NSX Manager Appliance, check which directories are present in the /repository directory:
As root user run: ls -l /repository
You may see either of the 3 below:

If the environment has been upgraded, then you expect to see a from and to version directory structure, that is a directory with the previous VMware NSX version as the name and a directory with the current VMware NSX version as the name, for example:
- drwxrwx--- 7 uuc grepodir 4096 <date> 4.1.0.0.0.21332672
- drwxrwx--- 7 uuc grepodir 4096 <date> 4.1.1.0.0.22224312
If the environment has not been upgraded, then you expect to see a from version directory structure, that is a directory with the current VMware NSX version as the name, for example:
- drwxrwx--- 7 uuc grepodir 4096 <date> 4.1.0.0.0.21332672
In some instance there may be no VMware NSX directory version in the repository.

Based on the above results, you will need to then complete one or more of the below options:

If the environment was freshly deployed and not upgraded and the from VMware NSX directory is missing, you need to complete the steps in 'Option: Deploy OVA file in /repository' below.
If the environment was upgraded and the from version is missing, you need to use use the steps in 'Option: Deploy MUB file in /repository' below.
If the environment was upgraded and the to VMware NSX directory is missing, you need to use use the steps in 'Option: Deploy MUB file in /repository' below.
If the environment was upgraded and the to and from VMware NSX directories are missing, you need to complete the steps in 'Deploy MUB file in /repository' below and 'Option: Deploy OVA file in /repository' below.
If the required files are present for both to and from versions but have been replaced incorrectly there may only be missing permissions; In this case follow the 'Deploy MUB file in /repository' guide below from step 8 onwards.

Option: Correcting User and Group permission recursively for the /repository directory after copying (scp) it from a know good source manager.

The user and group the whole of the /repository directory should be user: uuc and group: grepodir for the directory and all subdirectories and files.
The permission should be wrx wrx.
This was not the case when the directory was copied with scp to the newly replaced manager/s.
To ensure the correct user, group, and permission the following command is executed at the cli of each replacement manager.

Copy /repository directory to new manager.

Open an SSH session to the known good host.
#scp -r /repository <remote User>@<IP of Remote Server>:/
Example command:
#scp -r /repository [email protected]:/

This command copies the /repository directory recursively to the root directory (/) of host A.B.C.D.
Now the user, group, and permission will need to be check and corrected.

This will recursively set the user and group:
#chown -R uuc:grepodir /repository

This will recursively set the required permissions:
#chmod -R 770 /repository

Example:

The cannot connect to dry-run.py error was corrected by setting these attributes
Check that the REPO_SYNC FAIL state has been cleared.

Option: Deploy MUB file in /repository:

Download VMware-NSX-upgrade-bundle-<version>.mub MUB file following these instructions: Download Broadcom products and software.
The downloaded version should match the version reported NOT found in the logs, in this example 4.1.1.0.0.22224312.

To identify the Orchestrator node, log into any Manager as admin and run:

nsx-mngr> get service install-upgrade
Service name:      install-upgrade
Service state:     stopped
Enabled on:        #.#.#.#   <<< orchestrator node

Copy the downloaded mub file to /image directory of orchestrator node.

As root user, extract MUB file on the orchestrator node:

# cd /image
# tar -xf VMware-NSX-upgrade-bundle-<version>.mub

This will create a new file with the same name and .tar.gz extension.
Delete the folder for your current version under /repository.
For example in this example the system runs 4.1.1

# rm -rf /repository/4.1.1.0.0.22224312
Extract tar.gz to /repository
# tar -xzf /image/VMware-NSX-upgrade-bundle-<version>.tar.gz -C /repository
Set proper permissions and ownership of the /repository files by executing the following

/opt/vmware/proton-tomcat/bin/reposync_helper.sh
From the UI Resolve the REPO_SYNC on the orchestrator node: System -> Appliances -> View Details and click Resolve for REPO_SYNC and wait for this to complete.
Once completed, press Resolve for each of the other 2 Managers.

Clean up the downloaded mub file and extracted tar.gz file from /image:

rm -f /image/VMware-NSX-upgrade-bundle-<version>.mub
rm -f /image/VMware-NSX-upgrade-bundle-<version>.tar.gz
rm -f /image/VMware-NSX-upgrade-bundle-<version>.tar.gz.sig

Option: Deploy OVA file in /repository:

Download nsx-unified-appliance-<version>.ova file following these instructions: Download Broadcom products and software. The downloaded version should match the version missing in the repository as identified above from the 'Identifying the issue' section.
Deploy this manager as a separate appliance in vCenter and do not connect to the cluster.
From this newly deployed manager, copy the /repository/<version> directory to all 3 existing managers missing the directory.
As root user, run the command /opt/vmware/proton-tomcat/bin/reposync_helper.sh on all the 3 existing managers, not the newly deployed one.
From the UI Resolve the REPO_SYNC on the orchestrator node: System -> Appliances -> View Details and click Resolve for REPO_SYNC and wait for this to complete.
Now resolve the repo-sync failure on the other 2 nodes, from System-> Appliances page and wait for this to complete.
The newly deployed manager can now be powered off and deleted once the REPO_SYNC is working.

Option: Advanced LB (AVI):

It is possible that this same issue can be caused if NSX ALB files are missing from the repository.
This typically occurs if at one time NSX ALB was deployed but later removed. If a user manually deletes the ALB files from the repository, for example to free disk space, then it can cause this sync failure. Logs will explicitly refer to ALB files e.g.

2024-03-19T09:41:34.557Z INFO RepoSyncThread-1710841232019 RepoSyncFileHelper 85527 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Command to get server info for https://#.#.#.#:443/repository/21.1.2-9124/Alb_controller/ovf/controller.cert returned result CommandResultImpl [commandName=null, pid=1677285, status=SUCCESS, errorCode=0, errorMessage=null, commandOutput=HTTP/1.1 404 Not Found
2024-03-19T09:42:08.746Z INFO RepoSyncThread-1710841232019 RepoSyncFileHelper 85527 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Command to get server info for https://#.#.#.#:443/repository/22.1.6-9191/Alb_controller/ovf/controller-disk1.vmdk returned result CommandResultImpl [commandName=null, pid=1677876, status=SUCCESS, errorCode=0, errorMessage=null, commandOutput=HTTP/1.1 404 Not Found

/var/log/proton/nsxapi.log

2024-05-29T14:32:15.898Z INFO http-nio-127.0.0.1-7440-exec-23 RepoSyncServiceImpl 117206 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" reqId="<UUID>" subcomp="manager" username="uproton"] Starting Repository sync process, current result is RepoSyncResult [nodeId=<NODE UUID>, status=FAILED, statusMessage=, failureMessage=Unable to connect to File /repository/21.1.2-9124/Alb_controller/ovf/controller.ovf on source #.#.#.#. Please verify that file exists on source and install-upgrade service is up., errorCode=21057, percentage=0.0]

Identify the NSX ALB version, in the example above it is 21.1.2
Download the NSX ALB Controller ova from the VMware customer connects portal and copy it to the orchestrator node
Create the directory if it does not exist

#mkdir /repository/21.1.2-9124/Alb_controller/ovf
Extract the ova files

# tar -xvf /image/Controller.ova -C /repository/21.1.2-9124
Ensure there are 4 files

controller.ovf
controller.mf
controller.cert
controller-disk1.vmdk
Set proper permissions and ownership of the /repository files by executing the following:

/opt/vmware/proton-tomcat/bin/reposync_helper.sh
From the UI Resolve the REPO_SYNC on the orchestrator node: System -> Appliances -> View Details click Resolve for REPO_SYNC
Once completed, repeat for each of the other 2 Managers.

Alternate option for ALB controller ova file if the customer does not intend to use ALB:

The ALB controller file check can be bypassed during Repo sync by resetting the AlbControllerVmFabricModule values to default following the below steps:

Remove the Alb directory from /repository using:

# rm -rf /repository/21.1.2-9124
Get the ALB fabric ID with the bellow API:
- GET https://<nsx-manager-ip>/api/v1/fabric/modules >>>> note the ID present in section: "fabric_module_name" : "AlbControllerVmFabricModule",
Get the ALB details with the below API call:
- GET https://<nsx-manager-ip>/api/v1/fabric/modules/<alb_fabric_id>
Reset the values of 'AlbControllerVmFabricModule' using the below PUT API call:
- PUT https://<nsx-manager-ip>/api/v1/fabric/modules/<alb-fabric-id> along with adding the header "Content-Type:Application.Json"
  {
  "fabric_module_name" : "AlbControllerVmFabricModule",
      "current_version" : "1.0",
      "deployment_specs" : [ {
        "fabric_module_version" : "1.0",
        "versioned_deployment_specs" : [ {
          "host_version" : "",
          "service_vm_ovf_url" : [ "ALB_CONTROLLER_OVF" ],
          "host_type" : "ESXI"
        } ]
      } ],
      "source_authentication_mode" : "NO_AUTHENTICATION",
      "disk_provisioning" : "THIN",
      "resource_type" : "FabricModule",
      "id" : "######-####-####-####-##########",
      "display_name" : "######-####-####-####-##########",
     "_revision" : 1
  }'

Additional Information

If you are contacting Broadcom support about this issue, please provide the following:

The current version of NSX .
The version being upgraded to.
The state of the REPO_SYNC on all three managers

Handling Log Bundles for offline review with Broadcom support

Feedback

thumb_up Yes

thumb_down No