replace_certs.py fails with the below errors (v1.7 or below):
Keystore is not updated post replacement of CBM_MP cert on node ##.##.##.##. [Expected thumbprint : ###################, Keystore thumbprint : ###################]. Retrying in 5 secs
[...]
Keystore is not updated post replacement of CBM_MP cert on node ##.##.##.##. [Expected thumbprint : ###################, Keystore thumbprint : ###################]. Exiting after retrying for 15 minutes
/config/corfu/ (or in system/ls_-althR_config from the support bundle):
# ls -1 /config/corfu/LAYOUT*
/config/corfu/LAYOUTS_0.ds
/config/corfu/LAYOUTS_1.ds
/config/corfu/LAYOUTS_2.ds
/config/corfu/LAYOUTS_3.ds
/config/corfu/LAYOUTS_4.ds
/config/corfu/LAYOUT_CURRENT.ds
/var/log/proton/nsxapi.log, we see logs confirming that the concerned certificate is applied successfully:
INFO http-nio-127.0.0.1-7440-exec-32 TrustStoreServiceImpl 2546967 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" reqId="########-####-####-####-############" subcomp="manager" username="admin"] Apply certificate ########-####-####-####-############ for service-type CBM_###### and nodeId ########-####-####-####-############
INFO http-nio-127.0.0.1-7440-exec-38 TrustStoreServiceImpl 2546967 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" reqId="########-####-####-####-############" subcomp="manager" username="admin"] Apply certificate ########-####-####-####-############ for service-type CBM_###### and nodeId ########-####-####-####-############
# grep "JVM received a signal SIGKILL" /var/log/*/*tomcat-wrapper.log
/var/log/upgrade-coordinator/upgrade-coordinator-tomcat-wrapper.log:54:STATUS | wrapper | 2023/11/07 15:50:44 | JVM received a signal SIGKILL (9).
/var/log/upgrade-coordinator/upgrade-coordinator-tomcat-wrapper.log:72:STATUS | wrapper | 2023/11/14 19:19:43 | JVM received a signal SIGKILL (9).
/var/log/proton/proton-tomcat-wrapper.log.1:27805:STATUS | wrapper | 2023/11/14 19:19:05 | JVM received a signal SIGKILL (9).
/var/log/proton/proton-tomcat-wrapper.log.1:41989:STATUS | wrapper | 2024/05/12 02:54:36 | JVM received a signal SIGKILL (9).
/var/log/proton/proton-tomcat-wrapper.log.1:42643:STATUS | wrapper | 2024/05/12 02:56:49 | JVM received a signal SIGKILL (9).
/var/log/cm-inventory/cm-inventory-tomcat-wrapper.log:32:STATUS | wrapper | 2023/11/14 19:19:43 | JVM received a signal SIGKILL (9).
/var/log/cm-inventory/cm-inventory-tomcat-wrapper.log:81:STATUS | wrapper | 2024/05/12 03:01:22 | JVM received a signal SIGKILL (9).
/var/log/cm-inventory/cm-inventory-tomcat-wrapper.log:97:STATUS | wrapper | 2024/05/12 03:04:35 | JVM received a signal SIGKILL (9).
/var/log/proxy/proxy-tomcat-wrapper.log:981:STATUS | wrapper | 2023/04/11 20:08:14 | JVM received a signal SIGKILL (9).
/var/log/proxy/proxy-tomcat-wrapper.log:2037:STATUS | wrapper | 2023/04/11 20:26:10 | JVM received a signal SIGKILL (9).
/var/log/cbm/tanuki.log, CBM was not restarted like others:
# grep ' Launching a JVM' /var/log/cbm/tanuki.log
STATUS | wrapper | 2022/05/05 16:24:27 | Launching a JVM...
STATUS | wrapper | 2023/04/11 20:24:34 | Launching a JVM...
STATUS | wrapper | 2023/11/14 19:31:33 | Launching a JVM...
/var/log/cbm/cbm.log, we see Corfu shutdown exception:
Caused by: com.vmware.nsx.platform.clustering.persistence.exceptions.CorfuShutdownException: Disconnected from database. Terminating thread.
at com.vmware.nsx.cbm.factory.CorfuSystemDownHandler.run(CorfuSystemDownHandler.java:12) ~[libcbm.jar:?]
at org.corfudb.runtime.view.AbstractView.layoutHelper(AbstractView.java:176) ~[runtime-4.1.20230509211150.7973.1.jar:?]
at org.corfudb.runtime.view.AbstractView.layoutHelper(AbstractView.java:61) ~[runtime-4.1.20230509211150.7973.1.jar:?]
at org.corfudb.runtime.view.SequencerView.lambda$query$4(SequencerView.java:59) ~[runtime-4.1.20230509211150.7973.1.jar:?]
at io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:57) ~[libmicrometer.jar:?]
at org.corfudb.common.metrics.micrometer.MicroMeterUtils.lambda$time$6(MicroMeterUtils.java:121) ~[corfudb-common-4.1.20230509211150.7973.1.jar:?]
at java.util.Optional.map(Optional.java:215) ~[?:1.8.0_362]
VMware NSX
Multiple cluster reconfiguration resulted in Corfu connection issues invoking systemDownHandler after which multiple JVMs were restarted but CBM did not. This resulted in missed DCN causing certificate replacement issue.
The fix for this issue will be added in future NSX releases.
Workaround:
/etc/init.d/nsx-cluster-boot-manager restartNote that The CARR script can be used to resolve this issue. See Using Certificate Analyzer Resolver (CARR) Script to fix certificate related issues in NSX.