replace_certs.py
fails with the below erros (v1.7 or below):
Keystore is not updated post replacement of CBM_MP cert on node ##.##.##.##. [Expected thumbprint : ###################, Keystore thumbprint : ###################]. Retrying in 5 secs
[...]
Keystore is not updated post replacement of CBM_MP cert on node ##.##.##.##. [Expected thumbprint : ###################, Keystore thumbprint : ###################]. Exiting after retrying for 15 minutes
/config/corfu/
(or in system/ls_-althR_config
from the support bundle):
# ls -1 /config/corfu/LAYOUT*
/config/corfu/LAYOUTS_0.ds
/config/corfu/LAYOUTS_1.ds
/config/corfu/LAYOUTS_2.ds
/config/corfu/LAYOUTS_3.ds
/config/corfu/LAYOUTS_4.ds
/config/corfu/LAYOUT_CURRENT.ds
/var/log/proton/nsxapi.log
, we see logs confirming that the concerned certificate is applied successfully:
INFO http-nio-127.0.0.1-7440-exec-32 TrustStoreServiceImpl 2546967 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" reqId="########-####-####-####-############" subcomp="manager" username="admin"] Apply certificate ########-####-####-####-############ for service-type CBM_###### and nodeId ########-####-####-####-############
INFO http-nio-127.0.0.1-7440-exec-38 TrustStoreServiceImpl 2546967 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" reqId="########-####-####-####-############" subcomp="manager" username="admin"] Apply certificate ########-####-####-####-############ for service-type CBM_###### and nodeId ########-####-####-####-############
# grep "JVM received a signal SIGKILL" /var/log/*/*tomcat-wrapper.log
/var/log/upgrade-coordinator/upgrade-coordinator-tomcat-wrapper.log:54:STATUS | wrapper | 2023/11/07 15:50:44 | JVM received a signal SIGKILL (9).
/var/log/upgrade-coordinator/upgrade-coordinator-tomcat-wrapper.log:72:STATUS | wrapper | 2023/11/14 19:19:43 | JVM received a signal SIGKILL (9).
/var/log/proton/proton-tomcat-wrapper.log.1:27805:STATUS | wrapper | 2023/11/14 19:19:05 | JVM received a signal SIGKILL (9).
/var/log/proton/proton-tomcat-wrapper.log.1:41989:STATUS | wrapper | 2024/05/12 02:54:36 | JVM received a signal SIGKILL (9).
/var/log/proton/proton-tomcat-wrapper.log.1:42643:STATUS | wrapper | 2024/05/12 02:56:49 | JVM received a signal SIGKILL (9).
/var/log/cm-inventory/cm-inventory-tomcat-wrapper.log:32:STATUS | wrapper | 2023/11/14 19:19:43 | JVM received a signal SIGKILL (9).
/var/log/cm-inventory/cm-inventory-tomcat-wrapper.log:81:STATUS | wrapper | 2024/05/12 03:01:22 | JVM received a signal SIGKILL (9).
/var/log/cm-inventory/cm-inventory-tomcat-wrapper.log:97:STATUS | wrapper | 2024/05/12 03:04:35 | JVM received a signal SIGKILL (9).
/var/log/proxy/proxy-tomcat-wrapper.log:981:STATUS | wrapper | 2023/04/11 20:08:14 | JVM received a signal SIGKILL (9).
/var/log/proxy/proxy-tomcat-wrapper.log:2037:STATUS | wrapper | 2023/04/11 20:26:10 | JVM received a signal SIGKILL (9).
/var/log/cbm/tanuki.log
, CBM was not restarted like others:
# grep ' Launching a JVM' /var/log/cbm/tanuki.log
STATUS | wrapper | 2022/05/05 16:24:27 | Launching a JVM...
STATUS | wrapper | 2023/04/11 20:24:34 | Launching a JVM...
STATUS | wrapper | 2023/11/14 19:31:33 | Launching a JVM...
/var/log/cbm/cbm.log
, we see Corfu shutdown exception:
Caused by: com.vmware.nsx.platform.clustering.persistence.exceptions.CorfuShutdownException: Disconnected from database. Terminating thread.
at com.vmware.nsx.cbm.factory.CorfuSystemDownHandler.run(CorfuSystemDownHandler.java:12) ~[libcbm.jar:?]
at org.corfudb.runtime.view.AbstractView.layoutHelper(AbstractView.java:176) ~[runtime-4.1.20230509211150.7973.1.jar:?]
at org.corfudb.runtime.view.AbstractView.layoutHelper(AbstractView.java:61) ~[runtime-4.1.20230509211150.7973.1.jar:?]
at org.corfudb.runtime.view.SequencerView.lambda$query$4(SequencerView.java:59) ~[runtime-4.1.20230509211150.7973.1.jar:?]
at io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:57) ~[libmicrometer.jar:?]
at org.corfudb.common.metrics.micrometer.MicroMeterUtils.lambda$time$6(MicroMeterUtils.java:121) ~[corfudb-common-4.1.20230509211150.7973.1.jar:?]
at java.util.Optional.map(Optional.java:215) ~[?:1.8.0_362]
VMware NSX
Multiple cluster reconfiguration resulted in Corfu connection issues invoking systemDownHandler after which multiple JVMs were restarted but CBM did not. This resulted in missed DCN causing certificate replacement issue.
The fix for this issue will be added in future NSX releases.
Workaround:
/etc/init.d/nsx-cluster-boot-manager restart