During an upgrade of VMware NSX, the upgrade process fails on one or more NSX Manager nodes at the start_manager step.
Symptoms include:
Host and Edge nodes may upgrade successfully, but the NSX Manager cluster upgrade stalls or fails.
The Management Plane runs in a degraded state on the partially upgraded nodes.
Checking /var/log/cbm/tanuki.log or cbm.log reveals Netty exceptions and indicates services are down due to an inability to connect to the Corfu server.
Running keytool -list against the truststore.jks and keystore.jks reveals a SHA-256 certificate fingerprint mismatch between nodes.
VMware NSX 4.1.x
VMware NSX 4.2.x
This issue occurs due to a certificate mismatch between the keystore and truststore on the affected NSX Manager nodes. A race condition during a previous mass certificate update can prevent the new certificate from successfully committing to the local truststore.jks on specific nodes.
This mismatch remains dormant until a Cluster Boot Manager (CBM) restart is triggered (which occurs during the upgrade process). Upon restart, the SSL handshake between the manager services and the Corfu database fails, preventing services from initializing.
Workaround:
To resolve the issue, manually export the correct certificate from the functional node and import it into the truststore of the problematic nodes.
SSH into the functional NSX Manager node (the node with the correct, updated certificates) as root.
Export the correct certificate from the keystore to a temporary file:cd /config/cluster-manager/mp/privatekeytool -export -rfc -alias "self" -keystore keystore.jks -storepass $(cat keystore.password) 2>/dev/null > /tmp/cert_79.pem
Securely copy the /tmp/cert_79.pem file to the /tmp/ directory of the problematic NSX Manager nodes.
SSH into the problematic NSX Manager node(s) as root.
Import the exported certificate into the local truststore: keytool -import -alias "-79-cert" -file /tmp/cert_79.pem -keystore truststore.jks -storepass $(cat truststore.password) -noprompt 2>/dev/null
Verify the certificate fingerprints now match across the cluster.
Retry the NSX Manager upgrade from the NSX UI. All nodes should now successfully complete the start_manager step and bring services online.
Note:
To prevent this issue proactively, run the Certificate Alignment and Reporting/Remediation (CARR) script prior to initiating an upgrade. This script validates and remediates certificate synchronization issues across the cluster.
Note: The CARR script execution is integrated as a mandatory Pre-Upgrade Check (PUB) step in NSX versions 4.2.3.3, 4.2.4, and later.
No Data Plane (DP) impact occurs while applying the certificate import workaround.