Certificate replacement script throws error while executing: Keystore is not updated post replacement of 'CBM

Products

VMware NSX

Issue/Introduction

Running the script replace_certs.py fails with the below erros (v1.7 or below):

Keystore is not updated post replacement of CBM_MP cert on node ##.##.##.##. [Expected thumbprint : ###################, Keystore thumbprint : ###################]. Retrying in 5 secs
[...]
Keystore is not updated post replacement of CBM_MP cert on node ##.##.##.##. [Expected thumbprint : ###################, Keystore thumbprint : ###################]. Exiting after retrying for 15 minutes

Multiple entries for clustering reconfiguration found under /config/corfu/ (or in system/ls_-althR_config from the support bundle):

# ls -1 /config/corfu/LAYOUT*
/config/corfu/LAYOUTS_0.ds
/config/corfu/LAYOUTS_1.ds
/config/corfu/LAYOUTS_2.ds
/config/corfu/LAYOUTS_3.ds
/config/corfu/LAYOUTS_4.ds
/config/corfu/LAYOUT_CURRENT.ds

In the log /var/log/proton/nsxapi.log, we see logs confirming that the concerned certificate is applied successfully:

INFO http-nio-127.0.0.1-7440-exec-32 TrustStoreServiceImpl 2546967 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" reqId="########-####-####-####-############" subcomp="manager" username="admin"] Apply certificate ########-####-####-####-############ for service-type CBM_###### and nodeId ########-####-####-####-############
INFO http-nio-127.0.0.1-7440-exec-38 TrustStoreServiceImpl 2546967 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" reqId="########-####-####-####-############" subcomp="manager" username="admin"] Apply certificate ########-####-####-####-############ for service-type CBM_###### and nodeId ########-####-####-####-############

During the same time as shown above JVM SIGKILL events seen for multiple services as shown below:

# grep "JVM received a signal SIGKILL" /var/log/*/*tomcat-wrapper.log
/var/log/upgrade-coordinator/upgrade-coordinator-tomcat-wrapper.log:54:STATUS | wrapper  | 2023/11/07 15:50:44 | JVM received a signal SIGKILL (9).
/var/log/upgrade-coordinator/upgrade-coordinator-tomcat-wrapper.log:72:STATUS | wrapper  | 2023/11/14 19:19:43 | JVM received a signal SIGKILL (9).
/var/log/proton/proton-tomcat-wrapper.log.1:27805:STATUS | wrapper  | 2023/11/14 19:19:05 | JVM received a signal SIGKILL (9).
/var/log/proton/proton-tomcat-wrapper.log.1:41989:STATUS | wrapper  | 2024/05/12 02:54:36 | JVM received a signal SIGKILL (9).
/var/log/proton/proton-tomcat-wrapper.log.1:42643:STATUS | wrapper  | 2024/05/12 02:56:49 | JVM received a signal SIGKILL (9).
/var/log/cm-inventory/cm-inventory-tomcat-wrapper.log:32:STATUS | wrapper  | 2023/11/14 19:19:43 | JVM received a signal SIGKILL (9).
/var/log/cm-inventory/cm-inventory-tomcat-wrapper.log:81:STATUS | wrapper  | 2024/05/12 03:01:22 | JVM received a signal SIGKILL (9).
/var/log/cm-inventory/cm-inventory-tomcat-wrapper.log:97:STATUS | wrapper  | 2024/05/12 03:04:35 | JVM received a signal SIGKILL (9).
/var/log/proxy/proxy-tomcat-wrapper.log:981:STATUS | wrapper  | 2023/04/11 20:08:14 | JVM received a signal SIGKILL (9).
/var/log/proxy/proxy-tomcat-wrapper.log:2037:STATUS | wrapper  | 2023/04/11 20:26:10 | JVM received a signal SIGKILL (9).

In the log /var/log/cbm/tanuki.log, CBM was not restarted like others:

# grep ' Launching a JVM' /var/log/cbm/tanuki.log
STATUS | wrapper  | 2022/05/05 16:24:27 | Launching a JVM...
STATUS | wrapper  | 2023/04/11 20:24:34 | Launching a JVM...
STATUS | wrapper  | 2023/11/14 19:31:33 | Launching a JVM...

Also from log /var/log/cbm/cbm.log, we see Corfu shutdown exception:

Caused by: com.vmware.nsx.platform.clustering.persistence.exceptions.CorfuShutdownException: Disconnected from database. Terminating thread.
at com.vmware.nsx.cbm.factory.CorfuSystemDownHandler.run(CorfuSystemDownHandler.java:12) ~[libcbm.jar:?]
at org.corfudb.runtime.view.AbstractView.layoutHelper(AbstractView.java:176) ~[runtime-4.1.20230509211150.7973.1.jar:?]
at org.corfudb.runtime.view.AbstractView.layoutHelper(AbstractView.java:61) ~[runtime-4.1.20230509211150.7973.1.jar:?]
at org.corfudb.runtime.view.SequencerView.lambda$query$4(SequencerView.java:59) ~[runtime-4.1.20230509211150.7973.1.jar:?]
at io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:57) ~[libmicrometer.jar:?]
at org.corfudb.common.metrics.micrometer.MicroMeterUtils.lambda$time$6(MicroMeterUtils.java:121) ~[corfudb-common-4.1.20230509211150.7973.1.jar:?]
at java.util.Optional.map(Optional.java:215) ~[?:1.8.0_362]

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX

Cause

Multiple cluster reconfiguration resulted in Corfu connection issues invoking systemDownHandler after which multiple JVMs were restarted but CBM did not. This resulted in missed DCN causing certificate replacement issue.

Resolution

The fix for this issue will be added in future NSX releases.

Workaround:

Manually restart CBM process in all nodes, one at a time, from root: /etc/init.d/nsx-cluster-boot-manager restart
Rerun the certificate replacement script.