Global Manager Deployment fails with Installation Failed: "The cluster node VM failed to register itself with the MP within the allotted wait time"

Products

VMware NSX

Issue/Introduction

You are operating in a Federated NSX environment where the CARR script has been executed to remediate expired certificates across all Local Managers (LMs) and Global Managers (GMs).

After successfully applying all certificate fixes, it is observed that all Manager nodes (LMs and GMs) have unusually high uptimes.

When proceeding with a controlled reboot of the first Global Manager in the cluster, all services return to a running state except for the DATASTORE service. Despite waiting an extended period of time, the DATASTORE service does not start, and the Manager cluster remains in a DEGRADED state.

Re-running the CARR script in dry-run mode does not identify any additional certificate issues or remediation actions.

An attempt to delete and redeploy the affected third Manager node through the NSX UI fails during the redeployment process at approximately 55% completion after around 40 minutes, returning the error below:

A manual redeployment using the OVA template also fails. When deploying the Manager manually and executing the join-cluster command via CLI, the process fails with the same error, preventing the node from successfully joining the cluster.

When running get cluster status from admin of one of the healthy GM's you may see the Status of UNKNOWN for the failed deployed manager on the DATASTORE service but not present in any of the other services.

Environment

VMware NSX Federation 4.1.2.1

Cause

In NSX 4.1, there is a known issue during certificate replacement caused by a race condition between the Certificate Replacement Task and the Periodic Sync Task. This condition was resolved in NSX 4.2.

Due to this race condition, the certificate may be successfully updated in the local keystore but fail to update in the Clustered Backup Manager (CBM) certificate table. This creates an inconsistent state where the keystore reflects the new certificate, while the cluster-level certificate metadata does not. As a result, cluster services may rely on outdated certificate information even though the keystore has been updated.

The CARR script mitigates this condition by synchronizing certificates from the keystore to all truststores across the cluster. In most scenarios, this restores consistency and resolves the issue. However, this mitigation is not fully resilient in certain workflows.

If a node is detached from the cluster and a new node is later introduced, the issue may reoccur. Truststore synchronization is triggered only when the cluster configuration version changes. Although a join operation updates the cluster configuration and revision, it does not always increment the configuration version. Because of this, the mechanism responsible for distributing peer certificates from the cluster configuration into each node’s local truststore may not execute.

When peer certificates are not properly propagated, nodes may not trust each other during inter-node communication. This results in TLS handshake failures, commonly observed as PKIX path building or certificate validation errors. Consequently, the join operation may fail, or the node may remain stuck in a JOINING or Install Failed state, leading to a DEGRADED cluster condition.

Resolution

Workaround

If the failed Global Manager (GM) deployment is part of the ACTIVE cluster, first perform a failover to the Standby site (if operationally possible). This stabilizes the ACTIVE GM cluster and allows the remediation steps below to be performed while the affected cluster is in STANDBY state.

Step 1: Remove the Failed Manager Node

From the NSX UI, select the failed third Manager node deployment.
Click DELETE Appliance and allow the removal process to complete.

Step 2: Redeploy a New Manager Node

Deploy a new NSX Manager VM using the OVA template from vCenter.
Power on the VM after successful deployment.
Do not attempt to join the cluster yet.

Step 3: Export Corfu Certificates from Healthy Managers

Perform the following steps on each healthy GM node in the cluster.

On Manager 1:

ssh root@<manager1-IP>

cd /config/cluster-manager/corfu/private/ keytool -export -rfc -alias "self" -keystore keystore.jks -storepass `cat keystore.password` 2>/dev/null > cert_manager1.pem

scp cert_manager1.pem root@<new-manager-IP>:/tmp/

On Manager 2:

ssh root@<manager2-IP>

cd /config/cluster-manager/corfu/private/ keytool -export -rfc -alias "self" -keystore keystore.jks -storepass `cat keystore.password` 2>/dev/null > cert_manager2.pem

scp cert_manager2.pem root@<new-manager-IP>:/tmp/

Step 4: Import Certificates into the Joining Node Truststores

SSH into the newly deployed Manager:

ssh root@<new-manager-IP>

For each of the following directories under /config/cluster-manager/, navigate to the public folder and import both certificates:

corfu
gm
messaging-manager
mp
upgrade-coordinator
ar
cluster-manager
idps-reporting
monitoring
site-manager
cm-inventory
ccp

Example (repeat per service directory):

cd /config/cluster-manager/<service>/public/

keytool -import -alias "manager1-cert" -file /tmp/cert_manager1.pem -keystore truststore.jks -storepass `cat truststore.password` -noprompt

keytool -import -alias "manager2-cert" -file /tmp/cert_manager2.pem -keystore truststore.jks -storepass `cat truststore.password` -noprompt

Step 5: Disable Certificate Expiry Check (If Required)

On all GM nodes, verify whether the following file exists:

/usr/share/corfu/conf/DISABLE_CERT_EXPIRY_CHECK

If it exists → No action required.
If it does not exist → Disable expiry check:

touch /usr/share/corfu/conf/DISABLE_CERT_EXPIRY_CHECK

Step 6: Restart Required Services

Restart Corfu on all Manager nodes:

# /etc/init.d/corfu-server restart

On the newly deployed Manager, restart the following services:

# /etc/init.d/proton restart
# /etc/init.d/async-replicator-service restart
# /etc/init.d/site-manager-service restart
# /etc/init.d/global-manager restart

Step 7: Perform Manual Cluster Join

From the API Leader GM:

ssh root@<API_Leader_GM-IP>

su admin -c "get certificate api thumbprint"
su admin -c "get cluster status"

Collect:

API thumbprint
Cluster ID

From the Newly Deployed Node:

ssh root@<new-manager-IP>
su admin

Run:

join <API_Leader_GM-IP> cluster-id <cluster-id> username admin password <password> thumbprint <thumbprint>

After the manual join completes successfully, verify:

get cluster status shows STABLE
All services (including DATASTORE) are running
The cluster state is no longer DEGRADED

This procedure restores truststore consistency and resolves the TLS handshake failure that prevents the Manager node from successfully joining the cluster.

Step 8: Perform Failover to ACTIVE GM cluster

Select Action under Location Manager of the Standby GM UI > Make Active
Wait till the failover is successful and you see all Locations as Synced and Successful.

Note:

After all services show as UP and the cluster status is STABLE for the newly joined Manager node, the REPO_SYNC service may appear in a Failed state. If this occurs, simply select Resolve to manually initiate repository synchronization.