NSX Manager cluster is intermittently degraded
search cancel

NSX Manager cluster is intermittently degraded

book

Article ID: 320300

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

When using NSX, you may receive the error on vCenter:

  • NSX cluster is degraded

You may observe driver aborts in the host vmkernel log: 

/var/run/log/vmkernel.log

[TIMESTAMP] WARNING: nfnic: <#>: fnic_abort_cmd: 3889: Abort for cmd tag: 0x83 in pending state
[TIMESTAMP] WARNING: nfnic: <#>: fnic_abort_cmd: 4027: Tag: 0x83 ABTS_STATUS: 0x47, FLAGS: 0x273, STATE: 0x3

You may also find the following in the NSX Logs, 

  • During the time of issue NSX UI is not accessible.
  • Compactor fails to run

/var/log/corfu/corfu-compactor-audit.log

[TIMESTAMP] INFO main AddressSpaceView - PrefixTrim[Token(epoch=1198, sequence=[SEQ])]
[TIMESTAMP] WARN main AbstractView - Timeout executing remote call, invalidating view and retrying in PT1Ss
[TIMESTAMP] ERROR main UfoCompactor - - [nsx@6876 comp="nsx-manager" errorCode="MP2" level="ERROR" subcomp="corfu-compactor"] UFO: Trim failed for ufo data in namespace ufo
java.lang.RuntimeException: java.util.concurrent.TimeoutException
        at org.corfudb.util.CFUtils.getUninterruptibly(CFUtils.java:70) ~[runtime-3.2.20220427021900.4538.1.jar:?]
        at org.corfudb.util.CFUtils.getUninterruptibly(CFUtils.java:104) ~[runtime-3.2.20220427021900.4538.1.jar:?]
        at java.util.ArrayList.forEach(ArrayList.java:1259) ~[?:1.8.0_322]
        at org.corfudb.util.Utils.prefixTrim(Utils.java:201) ~[runtime-3.2.20220427021900.4538.1.jar:?]
        at org.corfudb.runtime.view.AddressSpaceView.lambda$prefixTrim$16(AddressSpaceView.java:658) ~[runtime-3.2.20220427021900.4538.1.jar:?]
        at org.corfudb.runtime.view.AbstractView.layoutHelper(AbstractView.java:136) ~[runtime-3.2.20220427021900.4538.1.jar:?]
        at org.corfudb.runtime.view.AddressSpaceView.prefixTrim(AddressSpaceView.java:657) ~[runtime-3.2.20220427021900.4538.1.jar:?]
        at org.corfudb.runtime.view.AddressSpaceView.prefixTrim(AddressSpaceView.java:627) ~[runtime-3.2.20220427021900.4538.1.jar:?]
        at com.vmware.nsx.platform.ufo.UfoCompactor.trimLog(UfoCompactor.java:304) ~[libufo-tools.jar:?]
        at com.vmware.nsx.platform.ufo.UfoCompactor.trim(UfoCompactor.java:271) [libufo-tools.jar:?]
        at com.vmware.nsx.platform.ufo.UfoCompactorMain.trimAndUpdateToken(UfoCompactorMain.java:268) [libufo-tools.jar:?]
        at com.vmware.nsx.platform.ufo.UfoCompactorMain.main(UfoCompactorMain.java:136) [libufo-tools.jar:?]

...

[TIMESTAMP] INFO Runner - Failed to run compactor tool: Command 'MALLOC_TRIM_THRESHOLD_=1310720 nice -n -10 java -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:/var/log/corfu/compactor-gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -XX:+UseStringDeduplication -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/image/core/compactor_oom.hprof -XX:OnOutOfMemoryError="gzip -f /image/core/compactor_oom.hprof" -XX:+CrashOnOutOfMemoryError -Xms1931m -Xmx1931m -Djava.io.tmpdir=/image/corfu-tools/temp -Djdk.nio.maxCachedBufferSize=1048576 -Dio.netty.recycler.maxCapacityPerThread=0 -DlogFilePrefix=/var/log/corfu/corfu-compactor-audit -Dlog4j.configurationFile=/opt/vmware/ufo-tools/corfu-compactor-log4j2.xml -Dcorfu-property-file-path=/opt/vmware/cbm/etc/ufo-factory.properties -cp "/opt/vmware/ufo-tools/*" com.vmware.nsx.platform.ufo.UfoCompactorMain -hostname [IP] -hostname [IP] -hostname [IP] -port [PORT] -trim -useDistributedLock -lockCorfuHostname [IP] -lockCorfuPort [PORT] -bulkReadSize 50' returned non-zero exit status 1.

  • In cbm.log we can see datastore was inaccessible.

/var/log/cbm/cbm.log

[TIMESTAMP] ERROR NotificationThread Step 2948 - [nsx@6876 comp="nsx-manager" errorCode="CBM38" level="ERROR" subcomp="cbm"] [CBM38] Unable to read from the datastore. Datastore may be non-operational
[TIMESTAMP] ERROR NotificationThread Step 2948 - [nsx@6876 comp="nsx-manager" errorCode="CBM38" level="ERROR" subcomp="cbm"] [CBM38] Unable to read from the datastore. Datastore may be non-operational

Environment

  • VMware NSX 
  • VMware vCenter 7.0.x
  • VMware vCenter 8.0.x

Cause

This issue can occur due to underlying storage issues. You will want to check the fabric and the array for dropped frames and hardware errors. 

Resolution

Workaround:
Redeploy the problematic Manager node on a healthy datastore.