NSX Manager fails and disconnects from the vCenter Server after a CPU spike
search cancel

NSX Manager fails and disconnects from the vCenter Server after a CPU spike

book

Article ID: 321110

calendar_today

Updated On:

Products

VMware NSX for vSphere

Issue/Introduction

Symptoms:
  • NSX Manager fails and disconnects from the vCenter Server after a CPU spike.
  • NSX Manager User Interface (UI) indicates High CPU alerts with Firewall among top consumers.
  • NSX Manager restarts automatically during which it is disconnected from the vCenter Server temporarily.
  • When NSX Manager gets disconnected from the vCenter Server, the NGC UI displays the error similar to:

    No NSX Manager available
     
  • In the vsm.log file, you see entries similar to:

    SimpleTaskManager:171 - Error during publish Task AppNotificationHandler.
    java.lang.OutOfMemoryError: GC overhead limit exceeded
            at
    org.hibernate.type.descriptor.sql.DecimalTypeDescriptor.getExtractor(DecimalTypeDescriptor.java:60)
            at
    org.hibernate.type.AbstractStandardBasicType.nullSafeGet(AbstractStandardBasicType.java:260)
            at
    org.hibernate.type.AbstractStandardBasicType.nullSafeGet(AbstractStandardBasicType.java:256)
            at
    org.hibernate.type.AbstractStandardBasicType.nullSafeGet(AbstractStandardBasicType.java:246)
            at
    org.hibernate.loader.hql.QueryLoader.getResultRow(QueryLoader.java:453)
            at
    org.hibernate.loader.hql.QueryLoader.getResultColumnOrRow(QueryLoader.java:436)
            at org.hibernate.loader.Loader.getRowFromResultSet(Loader.java:769)
            at org.hibernate.loader.Loader.processResultSet(Loader.java:985)
            at org.hibernate.loader.Loader.doQuery(Loader.java:943)
            at
    org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:349)
            at org.hibernate.loader.Loader.doList(Loader.java:2615)
            at org.hibernate.loader.Loader.doList(Loader.java:2598)
            at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2430)
            at org.hibernate.loader.Loader.list(Loader.java:2425)
            at org.hibernate.loader.hql.QueryLoader.list(QueryLoader.java:502)
            at
    org.hibernate.hql.internal.ast.QueryTranslatorImpl.list(QueryTranslatorImpl.java:379)
            at
    org.hibernate.engine.query.spi.HQLQueryPlan.performList(HQLQueryPlan.java:216)
            at org.hibernate.internal.SessionImpl.list(SessionImpl.java:1488)
            at
    org.hibernate.query.internal.AbstractProducedQuery.doList(AbstractProducedQuery.java:1445)
            at
    org.hibernate.query.internal.AbstractProducedQuery.list(AbstractProducedQuery.java:1414)
            at org.hibernate.query.Query.getResultList(Query.java:146)
            at
    com.vmware.vshield.vsm.dao.AbstractTranslationDao.getTargetNodesByDynamicCriteria_aroundBody8(AbstractTranslationDao.java:147)
            at
    com.vmware.vshield.vsm.dao.AbstractTranslationDao$AjcClosure9.run(AbstractTranslationDao.java:1)
            at
    org.aspectj.runtime.reflect.JoinPointImpl.proceed(JoinPointImpl.java:149)
            at
    com.vmware.vshield.vsm.aspects.elapsedtime.ElapsedTimeAspect.doPerformance(ElapsedTimeAspect.java:61)
            at
    com.vmware.vshield.vsm.aspects.elapsedtime.ElapsedTimeAspect.ajc$inlineAccessMethod$com_vmware_vshield_vsm_aspects_elapsedtime_ElapsedTimeAspect$com_vmware_vshield_vsm_aspects_elapsedtime_ElapsedTimeAspect$doPerformance(ElapsedTimeAspect.java:1)
            at
    com.vmware.vshield.vsm.aspects.elapsedtime.ElapsedTimeAspect.annotatedMethod(ElapsedTimeAspect.java:56)
            at
    com.vmware.vshield.vsm.dao.AbstractTranslationDao.getTargetNodesByDynamicCriteria(AbstractTranslationDao.java:139)
            at
    com.vmware.vshield.vsm.dynamicmembership.service.translate.DynamicSetTranslator.translateCriteria(DynamicSetTranslator.java:151)
            at
    com.vmware.vshield.vsm.dynamicmembership.service.translate.DynamicSetTranslator.translateInternal(DynamicSetTranslator.java:131)
            at
    com.vmware.vshield.vsm.securitygroup.service.translate.AbstractTranslator.translate(AbstractTranslator.java:83)
            at
    com.vmware.vshield.vsm.securitygroup.service.translate.TranslationServiceVersion2.translate(TranslationServiceVersion2.java:63)...

     
  • In the nsx-wrapper logs, you see entries similar to:

    INFO | jvm 38 | 2018/01/04 10:41:34 | java.lang.OutOfMemoryError:
    GC overhead limit exceeded
    STATUS | wrapper | 2018/01/04 10:41:34 | The JVM has run out of memory.
    Requesting thread dump.
    STATUS | wrapper | 2018/01/04 10:41:34 | Dumping JVM state.
    STATUS | wrapper | 2018/01/04 10:41:34 | The JVM has run out of memory.
    Restarting JVM.

     
  • In the nsx-wrapper logs, you see a thread having stack traces with following lines in runnable state similar to:

    Line 10060: INFO | jvm 38 | 2018/01/04 10:35:08 | at com.vmware.vshield.app.firewall.containers.IPListCollecter.getIpList(IPListCollecter.java:101)
    Line 10061: INFO | jvm 38 | 2018/01/04 10:35:08 | at com.vmware.vshield.app.firewall.containers.IPListCollecter.getIpList(IPListCollecter.java:90)
    Line 10062: INFO | jvm 38 | 2018/01/04 10:35:08 | at com.vmware.vshield.app.firewall.containers.IPListHandler.collectIpListForNode_aroundBody2(IPListHandler.java:77)
    Line 10063: INFO | jvm 38 | 2018/01/04 10:35:08 | at com.vmware.vshield.app.firewall.containers.IPListHandler$AjcClosure3.run(IPListHandler.java:1)
    Line 10068: INFO | jvm 38 | 2018/01/04 10:35:08 | at com.vmware.vshield.app.firewall.containers.IPListHandler.collectIpListForNode(IPListHandler.java:76)
    Line 10069: INFO | jvm 38 | 2018/01/04 10:35:08 | at com.vmware.vshield.firewall.messaging.service.FirewallProtobufConverter.collectIpListForNode_aroundBody0(FirewallProtobufConverter.java:550)
    Line 10070: INFO | jvm 38 | 2018/01/04 10:35:08 | at com.vmware.vshield.firewall.messaging.service.FirewallProtobufConverter$AjcClosure1.run(FirewallProtobufConverter.java:1)
    Line 10075: INFO | jvm 38 | 2018/01/04 10:35:08 | at com.vmware.vshield.firewall.messaging.service.FirewallProtobufConverter.toContainerList(FirewallProtobufConverter.java:550)
    Line 10076: INFO | jvm 38 | 2018/01/04 10:35:08 | at com.vmware.vshield.firewall.messaging.service.FirewallProtobufConverter.toRuleSet(FirewallProtobufConverter.java:635)
    Line 10077: INFO | jvm 38 | 2018/01/04 10:35:08 | at com.vmware.vshield.firewall.messaging.service.FirewallMessagingManager.publishRuleSetToCluster(FirewallMessagingManager.java:116)
    Line 10078: INFO | jvm 38 | 2018/01/04 10:35:08 | at com.vmware.vshield.firewall.service.impl.ConfigurationPublisher.readAndSendRules_aroundBody0(ConfigurationPublisher.java:135)
    Line 10079: INFO | jvm 38 | 2018/01/04 10:35:08 | at com.vmware.vshield.firewall.service.impl.ConfigurationPublisher$AjcClosure1.run(Co...


Environment

VMware NSX for vSphere 6.4.x

Cause

This issue occurs when there are multiple virtual machines each configured with multiple IP addresses in the system, and are added to a Security Group through static / dynamic membership. When IP translations of such SGs are triggered, this issue will be hit.

Resolution

This is a known issue affecting VMware NSX for vSphere 6.4.x.

Currently, there is no resolution.

Workaround:
To work around this issue:
  1. Change the definition of the SGs containing all such VMs which are configured by multiple ip addresses, by splitting/segregating it into smaller SGs.
  2. The firewall rules, in which the older SGs were consumed, need to be modified to consume the newly created smaller SGs.
To recover from this issue:

Once the system is restarted (automatically, after the issue is hit), delete the SGs containing all such VMs which are configured by multiple ip addresses and then follow the workaround steps.

Note: These virtual machines will not be secured at this time. When this issue is encountered, the security of these virtual machines are lost until a recovery is done.