Symptom:
-NSX manager cluster degraded intermittently as reported in NSX manager GUI alarms
-Checked from NSX manager GUI --->System--->Appliances--->Manager XXX---> VIEW DETAILS --->Service "MANAGER" & "HTTPS" change between down and up
-NSX Manager /var/log/proton/proton-tomcat-wrapper.log has an error similar to this example
INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | java.lang.NullPointerExceptionINFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at com.vmware.nsx.management.switching.infrastructure.ufodao.TransportZoneDaoHelper.convertToManagedTransportZone(TransportZoneDaoHelper.java:153)INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at com.vmware.nsx.management.switching.infrastructure.dao.TransportZoneQueryDao.getTransportZonesForAType_aroundBody16(TransportZoneQueryDao.java:431)INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at com.vmware.nsx.management.switching.infrastructure.dao.TransportZoneQueryDao$AjcClosure17.run(TransportZoneQueryDao.java:1)INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at org.aspectj.runtime.reflect.JoinPointImpl.proceed(JoinPointImpl.java:149)INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at io.micrometer.core.aop.TimedAspect.processWithTimer(TimedAspect.java:119)INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at io.micrometer.core.aop.TimedAspect.ajc$inlineAccessMethod$io_micrometer_core_aop_TimedAspect$io_micrometer_core_aop_TimedAspect$processWithTimer(TimedAspect.java:1)INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at io.micrometer.core.aop.TimedAspect.timedMethod(TimedAspect.java:97)INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at com.vmware.nsx.management.switching.infrastructure.dao.TransportZoneQueryDao.getTransportZonesForAType(TransportZoneQueryDao.java:417)INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at com.vmware.nsx.management.switching.infrastructure.service.DefaultTransportZones.init(DefaultTransportZones.java:59)INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at com.vmware.nsx.management.policy.policyframework.sharding.PolicyInitializer.lambda$processInitializerQueue$0(PolicyInitializer.java:141)INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown Source)INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at com.vmware.nsx.util.concurrent.Executors$MeteredRunnable.run(Executors.java:353)INFO | jvm 590 | YYYY/MM/DD HH:MM:SS | at java.base/java.lang.Thread.run(Unknown Source)ERROR | wrapper | YYYY/MM/DD HH:MM:SS | Shutdown failed: Timed out waiting for signal from JVM.
VMware NSX-T Data Center
VMware NSX
Transport zone has reference for transport zone profile, however, transport zone profile is not present in the system. Hence causing null pointer exception which lead to manager proton service reboot.
While comparing corfu transport zone and transport zone profile printout, you can see the referred transport zone profile does not exist
/opt/vmware/bin/corfu_tool_runner.py --tool corfu-browser -n nsx -t PolicyTransportZone -o showTable > PolicyTransportZone.txt
/opt/vmware/bin/corfu_tool_runner.py --tool corfu-editor -n nsx -t PolicyTransportZoneProfile -o showTable > PolicyTransportZoneProfile.txt
-Then check the generated PolicyTransportZone.txt and see there is transport zone which is referring to stale transportZoneProfile
Key:
{
"stringId": "/infra/sites/default/enforcement-points/default/transport-zones/xxxx-xxxx-xxxx-xxxx"
}
"transportZoneProfilePath": ["/infra/transport-zone-profiles/vRNI-BFD_Profile_yyyy\yyyy\yyy"],
This issue will be solved in NSX 4.2.2
Workaround:
Delete the stale transportZoneProfile from transport zone, or replace with an existing transportZoneProfile
If you believe you have encountered this issue, please open a support case with Broadcom Support and refer to this KB article.
For more information, see Creating and managing Broadcom support cases.