NSX Manager upgrade fails at "start_manager" step due to certificate mismatch

Products

VMware NSX

Issue/Introduction

During an upgrade of VMware NSX, the upgrade process fails on one or more NSX Manager nodes at the start_manager step.

Symptoms include:

Host and Edge nodes may upgrade successfully, but the NSX Manager cluster upgrade stalls or fails.
The Management Plane runs in a degraded state on the partially upgraded nodes.
Checking /var/log/cbm/tanuki.log or cbm.log reveals Netty exceptions and indicates services are down due to an inability to connect to the Corfu server.
Running keytool -list against the truststore.jks and keystore.jks reveals a SHA-256 certificate fingerprint mismatch between nodes.
Error below may be seen during the process of the Upgrade of one of the managers and fail to continue.

error while upgrading upgrade unit: [MPP] Node upgrade failed : MP upgrade failed at start_manager step with error : /image/VMware-NSX-unified-appliance-4.2.3.0.0.24866352/scripts/start_manager.py:12: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives from distutils.version import LooseVersion ..
Following KB https://knowledge.broadcom.com/external/article?articleNumber=415679 and retrying upgrade fails with same error.
1 or more Managers may have updated successfully prior to this error failing on the next Manager slated to be upgraded.

Environment

VMware NSX 4.1.x
VMware NSX 4.2.x

Cause

This issue occurs due to a certificate mismatch between the keystore and truststore on the affected NSX Manager nodes. A race condition during a previous mass certificate update can prevent the new certificate from successfully committing to the local truststore.jks on specific nodes.

This mismatch remains dormant until a Cluster Boot Manager (CBM) restart is triggered (which occurs during the upgrade process). Upon restart, the SSL handshake between the manager services and the Corfu database fails, preventing services from initializing.

Resolution

Workaround:

To resolve the issue, manually export the correct certificate from the functional node and import it into the truststore of the problematic nodes.

SSH into the functional NSX Manager node (the node with the correct, updated certificates) as root.
Export the correct certificate from the keystore to a temporary file:

cd /config/cluster-manager/mp/private

keytool -export -rfc -alias "self" -keystore keystore.jks -storepass $(cat keystore.password) 2>/dev/null > /tmp/cert_79.pem
Securely copy the /tmp/cert_79.pem file to the /tmp/ directory of the problematic NSX Manager nodes.
SSH into the problematic NSX Manager node(s) as root.
Import the exported certificate into the local truststore:

keytool -import -alias "-79-cert" -file /tmp/cert_79.pem -keystore truststore.jks -storepass $(cat truststore.password) -noprompt 2>/dev/null
Verify the certificate fingerprints now match across the cluster.
Retry the NSX Manager upgrade from the NSX UI. All nodes should now successfully complete the start_manager step and bring services online.

Note:

To prevent this issue proactively, run the Certificate Alignment and Reporting/Remediation (CARR) script prior to initiating an upgrade. This script validates and remediates certificate synchronization issues across the cluster.

Note: The CARR script execution is integrated as a mandatory Pre-Upgrade Check (PUB) step in NSX versions 4.2.3.3, 4.2.4, and later.

Additional Information

No Data Plane (DP) impact occurs while applying the certificate import workaround.