VCF Operations for Logs UI is not available after upgrade to 9.0.2

Products

VCF Operations VMware Cloud Foundation

Issue/Introduction

One or more Aria Operations for Logs nodes appear in Unknown status in the Cluster Management page
Running the following command nodetool-no-pass status on an affected node shows that Cassandra is down
"Page Not Found" or "Service Unavailable" errors occur when you access the UI.
The User Interface (UI) is inaccessible after you upgrade to VCF Operations for Logs 9.0.2. You find the Cassandra service fails to start on one or more nodes, resulting in a degraded cluster state.

systemctl status reports a degraded state:
```
State: degraded
```

nodetool-no-pass status shows Cassandra on one or more nodes is down (DN):

Status=Up/Down|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns (effective)  Host ID            Rack
UN  ##.##.##.##  18.32 MiB  256     100.0%            [UUID]             rack1
DN  ##.##.##.##  ?          256     100.0%            [UUID]             rack1
DN  ##.##.##.##  ?          256     100.0%            [UUID]             rack1

or

Cassandra is not running

Inventory Sync through Fleet Manager for VCF Operations for logs, the operation fails with one the following errors :

Error Code: LCMVRLICONFIG40100 

or 

Error Code: LCMVRLICONFIG40119

or

Error Code: LCMVRLISYSTEM45034

Operations-logs host is unreachable. Either the host name is incorrect or the virtual machine is not reachable.
Unable to connect to host. Check host details and retry.

You will see similar exception below in /storage/var/loginsight/cassandra.log

ERROR [Messaging-EventLoop-#-#] ####-##-##T##:##,OutboundConnectionInitiator.java:### - Failed to handshake with peer /<VCFOperationsForLogs_WorkerIp>:7000(/<VCFOperationsForLogs_WorkerIp>:7000)
at io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown

or

ERROR [Messaging-EventLoop-3-3] ####-##-##T##:##:##, InboundConnectionInitiator.java:### - Failed to properly handshake with peer /##.###.##.##:39412. Closing the channel.
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors

or 

INFO: [client ### @#######] raised fatal (2) certificate_unknown (46) alert: Failed to process record
org. bouncycastle. tls. TlsFatalAlert: certificate_unknown (46)

When running systemctl status loginsight you may see the following in the output:
```
JENTROPY-ERROR: OSSL_provider_init(): 610
```
During an upgrade to VCF 9.1.0.x, after successfully deploying VCF Management Services, there are failures or warnings at the "Import VCF Operations in Fleet Lifecycle" stage and is preventing upgrade progress.

Environment

VCF Operations for Logs 9.0.2
VCF Operations 9.1.0.x

Cause

This issue occurs due to a keystore and trust store mismatch between the Primary and worker nodes, preventing secure communication between the Cassandra instances.

Resolution

To resolve this issue, you must synchronize the certificates across the cluster nodes:

Log in to the primary node via SSH as root.

Determine if FIPS is enabled by running:

/usr/lib/loginsight/application/sbin/fips.sh --all --status

Follow the steps below based on the FIPS status of the cluster

For FIPS Enabled Clusters

Run the following command on both the primary node and all worker node(s) to get the keystore password:

pw=$(grep 'syslog-ssl-keystore-password' $(ls -1 /storage/core/loginsight/config/loginsight-config* | tail -n 1) | cut -d\" -f2)

Compare the keystore and truststore results between nodes to verify the mismatch

keytool -list -storetype bcfks -providerpath /usr/lib/loginsight/application/lib/lib/bc-fips-*.jar -provider org.bouncycastle.jcajce.provider.BouncyCastleFipsProvider -storepass $pw -keystore /usr/lib/loginsight/application/etc/3rd_config/keystore.bcfks

keytool -list -storetype bcfks -providerpath /usr/lib/loginsight/application/lib/lib/bc-fips-*.jar -provider org.bouncycastle.jcajce.provider.BouncyCastleFipsProvider -storepass $pw -keystore /usr/lib/loginsight/application/etc/truststore.bcfks

Copy the following certificate files from the primary node to each worker node using a file transfer utility like WinSCP, replacing the existing files

/usr/lib/loginsight/application/etc/3rd_config/keystore.bcfks

/usr/lib/loginsight/application/etc/truststore.bcfks

/storage/core/loginsight/cidata/cassandra/config/cacert.pem

Restart the Log Insight service on all nodes:
```
systemctl restart loginsight
```
Run nodetool-no-pass status and verify all nodes show UN for the status in the first column.
Verify the UI is accessible and check the cluster status at Management > Cluster
In the case of this issue causing upgrade failures in VCF 9.1, return to the Build > Tasks > VCF Instances page and confirm the task moves to completion

For Non-FIPS Enabled Clusters

Run the following command on both the primary node and all worker node(s) to get the keystore password:

pw=$(grep 'syslog-ssl-keystore-password' $(ls -1 /storage/core/loginsight/config/loginsight-config* | tail -n 1) | cut -d\" -f2)

Compare the keystore and truststore results between nodes to verify the mismatch

keytool -list -storepass $pw -keystore /usr/lib/loginsight/application/etc/3rd_config/keystore

keytool -list -storepass $pw -keystore /usr/lib/loginsight/application/etc/truststore

Copy the following certificate files from the primary node to each worker node using a file transfer utility like WinSCP, replacing the existing files

/usr/lib/loginsight/application/etc/3rd_config/keystore

/usr/lib/loginsight/application/etc/truststore

/storage/core/loginsight/cidata/cassandra/config/cacert.pem

Restart the Log Insight service on all nodes:
```
systemctl restart loginsight
```
Run nodetool-no-pass status and verify all nodes show UN for the status in the first column.
Verify the UI is accessible and check the cluster status at Management > Cluster
In the case of this issue causing upgrade failures in VCF 9.1, return to the Build > Tasks > VCF Instances page and confirm the task moves to completion

Additional Information

Replace a corrupted truststore in VCF/Aria Operations for Logs

"Import VCF Operations in Fleet Lifecycle" stage of "Deploy VCF Management Components" workflow fails or completes with warnings