NSX manager services intermittently down and cluster degraded

Products

VMware NSX VMware NSX-T Data Center

Issue/Introduction

Symptom:

-NSX manager cluster degraded intermittently as reported in NSX manager GUI alarms

-Checked from NSX manager GUI --->System--->Appliances--->Manager XXX---> VIEW DETAILS --->Service "MANAGER" & "HTTPS" change between down and up

-NSX Manager /var/log/proton/proton-tomcat-wrapper.log has an error similar to this example

INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | java.lang.NullPointerException
INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at com.vmware.nsx.management.switching.infrastructure.ufodao.TransportZoneDaoHelper.convertToManagedTransportZone(TransportZoneDaoHelper.java:153)
INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at com.vmware.nsx.management.switching.infrastructure.dao.TransportZoneQueryDao.getTransportZonesForAType_aroundBody16(TransportZoneQueryDao.java:431)
INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at com.vmware.nsx.management.switching.infrastructure.dao.TransportZoneQueryDao$AjcClosure17.run(TransportZoneQueryDao.java:1)
INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at org.aspectj.runtime.reflect.JoinPointImpl.proceed(JoinPointImpl.java:149)
INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at io.micrometer.core.aop.TimedAspect.processWithTimer(TimedAspect.java:119)
INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at io.micrometer.core.aop.TimedAspect.ajc$inlineAccessMethod$io_micrometer_core_aop_TimedAspect$io_micrometer_core_aop_TimedAspect$processWithTimer(TimedAspect.java:1)
INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at io.micrometer.core.aop.TimedAspect.timedMethod(TimedAspect.java:97)
INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at com.vmware.nsx.management.switching.infrastructure.dao.TransportZoneQueryDao.getTransportZonesForAType(TransportZoneQueryDao.java:417)
INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at com.vmware.nsx.management.switching.infrastructure.service.DefaultTransportZones.init(DefaultTransportZones.java:59)
INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at com.vmware.nsx.management.policy.policyframework.sharding.PolicyInitializer.lambda$processInitializerQueue$0(PolicyInitializer.java:141)
INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown Source)
INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at com.vmware.nsx.util.concurrent.Executors$MeteredRunnable.run(Executors.java:353)
INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at java.base/java.lang.Thread.run(Unknown Source)
ERROR | wrapper | YYYY/MM/DD HH:MM:SS | Shutdown failed: Timed out waiting for signal from JVM.

Environment

VMware NSX-T Data Center

VMware NSX

Cause

Transport zone has reference for transport zone profile, however, transport zone profile is not present in the system. Hence causing null pointer exception which lead to manager proton service reboot.

While comparing corfu transport zone and transport zone profile printout, you can see the referred transport zone profile does not exist

Use below command to show transport zone and transport zone profile

/opt/vmware/bin/corfu_tool_runner.py --tool corfu-browser -n nsx -t PolicyTransportZone -o showTable > PolicyTransportZone.txt
/opt/vmware/bin/corfu_tool_runner.py --tool corfu-editor -n nsx -t PolicyTransportZoneProfile -o showTable > PolicyTransportZoneProfile.txt

-Then check the generated PolicyTransportZone.txt and see there is transport zone which is referring to stale transportZoneProfile
Key:
{
"stringId": "/infra/sites/default/enforcement-points/default/transport-zones/xxxx-xxxx-xxxx-xxxx"
}
"transportZoneProfilePath": ["/infra/transport-zone-profiles/vRNI-BFD_Profile_yyyy\yyyy\yyy"],

Above stale transportZoneProfile vRNI-BFD_Profile_yyyy\yyyy\yyyy does not exist in PolicyTransportZoneProfile.txt

Resolution

This issue will be solved in NSX 4.2.2

Workaround:

Delete the stale transportZoneProfile from transport zone, or replace with an existing transportZoneProfile

If you believe you have encountered this issue, please open a support case with Broadcom Support and refer to this KB article.
For more information, see Creating and managing Broadcom support cases.