You are operating in a Federated NSX environment where the CARR script has been executed to remediate expired certificates across all Local Managers (LMs) and Global Managers (GMs).
After successfully applying all certificate fixes, it is observed that all Manager nodes (LMs and GMs) have unusually high uptimes.
When proceeding with a controlled reboot of the first Global Manager in the cluster, all services return to a running state except for the DATASTORE service. Despite waiting an extended period of time, the DATASTORE service does not start, and the Manager cluster remains in a DEGRADED state.
Re-running the CARR script in dry-run mode does not identify any additional certificate issues or remediation actions.
An attempt to delete and redeploy the affected third Manager node through the NSX UI fails during the redeployment process at approximately 55% completion after around 40 minutes, returning the error below:
A manual redeployment using the OVA template also fails. When deploying the Manager manually and executing the join-cluster command via CLI, the process fails with the same error, preventing the node from successfully joining the cluster.
When running get cluster status from admin of one of the healthy GM's you may see the Status of UNKNOWN for the failed deployed manager on the DATASTORE service but not present in any of the other services.
VMware NSX Federation 4.1.2.1
In NSX 4.1, there is a known issue during certificate replacement caused by a race condition between the Certificate Replacement Task and the Periodic Sync Task. This condition was resolved in NSX 4.2.
Due to this race condition, the certificate may be successfully updated in the local keystore but fail to update in the Clustered Backup Manager (CBM) certificate table. This creates an inconsistent state where the keystore reflects the new certificate, while the cluster-level certificate metadata does not. As a result, cluster services may rely on outdated certificate information even though the keystore has been updated.
The CARR script mitigates this condition by synchronizing certificates from the keystore to all truststores across the cluster. In most scenarios, this restores consistency and resolves the issue. However, this mitigation is not fully resilient in certain workflows.
If a node is detached from the cluster and a new node is later introduced, the issue may reoccur. Truststore synchronization is triggered only when the cluster configuration version changes. Although a join operation updates the cluster configuration and revision, it does not always increment the configuration version. Because of this, the mechanism responsible for distributing peer certificates from the cluster configuration into each node’s local truststore may not execute.
When peer certificates are not properly propagated, nodes may not trust each other during inter-node communication. This results in TLS handshake failures, commonly observed as PKIX path building or certificate validation errors. Consequently, the join operation may fail, or the node may remain stuck in a JOINING or Install Failed state, leading to a DEGRADED cluster condition.
If the failed Global Manager (GM) deployment is part of the ACTIVE cluster, first perform a failover to the Standby site (if operationally possible). This stabilizes the ACTIVE GM cluster and allows the remediation steps below to be performed while the affected cluster is in STANDBY state.
From the NSX UI, select the failed third Manager node deployment.
Click DELETE Appliance and allow the removal process to complete.
Deploy a new NSX Manager VM using the OVA template from vCenter.
Power on the VM after successful deployment.
Do not attempt to join the cluster yet.
Perform the following steps on each healthy GM node in the cluster.
On Manager 1:
ssh root@<manager1-IP>
cd /config/cluster-manager/corfu/private/ keytool -export -rfc -alias "self" -keystore keystore.jks -storepass `cat keystore.password` 2>/dev/null > cert_manager1.pem
scp cert_manager1.pem root@<new-manager-IP>:/tmp/
On Manager 2:
ssh root@<manager2-IP>
cd /config/cluster-manager/corfu/private/ keytool -export -rfc -alias "self" -keystore keystore.jks -storepass `cat keystore.password` 2>/dev/null > cert_manager2.pem
scp cert_manager2.pem root@<new-manager-IP>:/tmp/
SSH into the newly deployed Manager:
ssh root@<new-manager-IP>
For each of the following directories under /config/cluster-manager/, navigate to the public folder and import both certificates:
corfu
gm
messaging-manager
mp
upgrade-coordinator
ar
cluster-manager
idps-reporting
monitoring
site-manager
cm-inventory
ccp
Example (repeat per service directory):
cd /config/cluster-manager/<service>/public/
keytool -import -alias "manager1-cert" -file /tmp/cert_manager1.pem -keystore truststore.jks -storepass `cat truststore.password` -noprompt
keytool -import -alias "manager2-cert" -file /tmp/cert_manager2.pem -keystore truststore.jks -storepass `cat truststore.password` -noprompt
On all GM nodes, verify whether the following file exists:
/usr/share/corfu/conf/DISABLE_CERT_EXPIRY_CHECK
If it exists → No action required.
If it does not exist → Disable expiry check:
touch /usr/share/corfu/conf/DISABLE_CERT_EXPIRY_CHECK
Restart Corfu on all Manager nodes:
# /etc/init.d/corfu-server restart
On the newly deployed Manager, restart the following services:
# /etc/init.d/proton restart
# /etc/init.d/async-replicator-service restart
# /etc/init.d/site-manager-service restart
# /etc/init.d/global-manager restart
From the API Leader GM:
ssh root@<API_Leader_GM-IP>
su admin -c "get certificate api thumbprint"
su admin -c "get cluster status"
Collect:
API thumbprint
Cluster ID
From the Newly Deployed Node:
ssh root@<new-manager-IP>
su admin
Run:
join <API_Leader_GM-IP> cluster-id <cluster-id> username admin password <password> thumbprint <thumbprint>
After the manual join completes successfully, verify:
get cluster status shows STABLE
All services (including DATASTORE) are running
The cluster state is no longer DEGRADED
This procedure restores truststore consistency and resolves the TLS handshake failure that prevents the Manager node from successfully joining the cluster.
Note:
After all services show as UP and the cluster status is STABLE for the newly joined Manager node, the REPO_SYNC service may appear in a Failed state. If this occurs, simply select Resolve to manually initiate repository synchronization.