Unable to continue with NSX Manager upgrade related to Corfu compaction failure due to Alarm table
search cancel

Unable to continue with NSX Manager upgrade related to Corfu compaction failure due to Alarm table

book

Article ID: 336797

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:
We are able to see that the NSX Manager is unable to connect to the datastore causing logs that look similar to

2022-07-14T18:05:06.429Z WARN pool-32-thread-1 DataStoreDisconnectHandler 26600 - [nsx@6876 comp="global-manager" level="WARNING" subcomp="global-manager"] Disconnected from the database, restarting the service
2022-07-14T18:05:06.429Z INFO pool-32-thread-1 ContainerConfigServiceImpl 26600 - [nsx@6876 comp="global-manager" level="INFO" subcomp="global-manager"] Restart application after 0 ms.
2022-07-14T18:05:06.719Z ERROR localhost-startStop-1 CorfuRuntime 26600 connect: Couldn't connect to server.
java.util.concurrent.TimeoutException: null

the log is found within /var/log/gmanager/gmanager.log

when we look at corfu/LAYOUT_CURRENT.ds by running

cat config/corfu/LAYOUT_CURRENT.ds

within the global manager logs we are able to see unresponsiveServers that look like

    "unresponsiveServers": [
      "manager-ip-address:9000" <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    ],

within the corfu.9000.log file we are able to see

2022-07-14T21:19:00.213Z | WARN | worker-0 | i.n.c.DefaultChannelPipeline | An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.

and also the error

java.nio.file.FileSystemException: /config/cluster-manager/corfu/private/keystore.password: Too many open files

another way to verify that this issue is occurring is by looking for

Corfu Compactor Out of Memory Error for AlarmMsg

nsx_global_manager_########-####-####-####-########7a60_20220714_222106/var/log/corfu$ less corfu-compactor-audit.log.gz

which looks like this

# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="gzip -f /image/core/compactor_oom.hprof"
#   Executing /bin/sh -c "gzip -f /image/core/compactor_oom.hprof"...
Aborting due to java.lang.OutOfMemoryError: Java heap space
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  INVALID (0xe0000000) at pc=0x0000000000000000, pid=14350, tid=0x000075b304a11700
#  fatal error: OutOfMemory encountered: Java heap space
#
#
# JRE version: OpenJDK Runtime Environment (Zulu 8.55.0.14-SA-linux64) (8.0_301-b02) (build 1.8.0_301-b02)
# Java VM: OpenJDK 64-Bit Server VM (25.301-b02 mixed mode linux-amd64 compressed oops)
# Core dump written. Default location: //core or core.14350

this will cause the compactor to fail and make NSX upgrade unable to continue.

Environment

VMware NSX-T Data Center
VMware NSX-T Data Center 3.x

Cause

In 3.2.0 there are GPRR because GPRR doesn't have realized object ID.

Resolution

This issue is fixed in version 3.2.2

Workaround:
if you see these symptoms please reach out to GSS support for assistance confirming this issue and applying the workaround necessary. note this article

Additional Information

Impact/Risks:
Unable to continue with NSX Manager upgrade