Symptoms:
We are able to see that the NSX Manager is unable to connect to the datastore causing logs that look similar to
2022-07-14T18:05:06.429Z WARN pool-32-thread-1 DataStoreDisconnectHandler 26600 - [nsx@6876 comp="global-manager" level="WARNING" subcomp="global-manager"] Disconnected from the database, restarting the service 2022-07-14T18:05:06.429Z INFO pool-32-thread-1 ContainerConfigServiceImpl 26600 - [nsx@6876 comp="global-manager" level="INFO" subcomp="global-manager"] Restart application after 0 ms. 2022-07-14T18:05:06.719Z ERROR localhost-startStop-1 CorfuRuntime 26600 connect: Couldn't connect to server. java.util.concurrent.TimeoutException: null
the log is found within /var/log/gmanager/gmanager.log
when we look at corfu/LAYOUT_CURRENT.ds by running
cat config/corfu/LAYOUT_CURRENT.ds
within the global manager logs we are able to see unresponsiveServers that look like
"unresponsiveServers": [ "manager-ip-address:9000" <<<<<<<<<<<<<<<<<<<<<<<<<<<<<< ],
within the corfu.9000.log file we are able to see
2022-07-14T21:19:00.213Z | WARN | worker-0 | i.n.c.DefaultChannelPipeline | An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
and also the error
java.nio.file.FileSystemException: /config/cluster-manager/corfu/private/keystore.password: Too many open files
another way to verify that this issue is occurring is by looking for
Corfu Compactor Out of Memory Error for AlarmMsg
nsx_global_manager_########-####-####-####-########7a60_20220714_222106/var/log/corfu$ less corfu-compactor-audit.log.gz
which looks like this
# java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError="gzip -f /image/core/compactor_oom.hprof" # Executing /bin/sh -c "gzip -f /image/core/compactor_oom.hprof"... Aborting due to java.lang.OutOfMemoryError: Java heap space # # A fatal error has been detected by the Java Runtime Environment: # # INVALID (0xe0000000) at pc=0x0000000000000000, pid=14350, tid=0x000075b304a11700 # fatal error: OutOfMemory encountered: Java heap space # # # JRE version: OpenJDK Runtime Environment (Zulu 8.55.0.14-SA-linux64) (8.0_301-b02) (build 1.8.0_301-b02) # Java VM: OpenJDK 64-Bit Server VM (25.301-b02 mixed mode linux-amd64 compressed oops) # Core dump written. Default location: //core or core.14350
this will cause the compactor to fail and make NSX upgrade unable to continue.