NSX manager with VIP become unresponsive and/or Application has crashed alarms are generated with proxy_oom core dump file on the VIP NSX Manager
search cancel

NSX manager with VIP become unresponsive and/or Application has crashed alarms are generated with proxy_oom core dump file on the VIP NSX Manager

book

Article ID: 389498

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • vIDM authentication is used for NSX Manager for automation and other clients.
  • You may see "Application on NSX node <node_id> has crashed" alarm triggers on NSX Manager.
  • The NSX manager which is the VIP leader, may have a core dump located in /image/core:
    -rw------- 1 uproxy uproxy 453M Feb 13 10:40 proxy_oom.hprof
  • In the impacted Manager's log /var/log/proxy/proxy-tomcat-wrapper.log, the below logs may be observed:
    INFO | jvm 1 | 2025/02/13 10:40:25 | "grpc-default-executor-735532" #1144511 daemon prio=5 os_prio=0 tid=0x00001b82248e0000 nid=0xd9f44 waiting on condition [0x##########ca4000]

    INFO | jvm 1 | 2025/02/13 10:40:25 | java.lang.Thread.State: WAITING (parking)

    INFO | jvm 1 | 2025/02/13 10:40:25 | at sun.misc.Unsafe.park(Native Method)

    INFO | jvm 1 | 2025/02/13 10:40:25 | - parking to wait for <0x############e158> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)

    INFO | jvm 1 | 2025/02/13 10:40:25 | at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)

    INFO | jvm 1 | 2025/02/13 10:40:25 | at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)

     
  • Along with logs similar to the logs below:
    INFO | jvm 1 | 2025/02/13 10:40:31 | "Processing request ########-1d4f-4883-b4a3-############" #1132391 daemon prio=5 os_prio=0 tid=0x##########e3b000 nid=0x3f455d runnable [0x##########389000]

    INFO | jvm 1 | 2025/02/13 10:40:31 | java.lang.Thread.State: RUNNABLE

    INFO | jvm 1 | 2025/02/13 10:40:31 | at java.net.SocketInputStream.socketRead0(Native Method)

    INFO | jvm 1 | 2025/02/13 10:40:31 | at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)

    ...

    INFO | jvm 1 | 2025/02/13 10:40:31 | at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:717)

    INFO | jvm 1 | 2025/02/13 10:40:31 | at org.springframework.web.client.RestTemplate.exchange(RestTemplate.java:608)

    INFO | jvm 1 | 2025/02/13 10:40:31 | at com.vmware.nsx.management.rp.security.oauth2.VidmTokenServices.initDiscoveryEndPoint(VidmTokenServices.java:234)


  • In the logs, you may observe high number of threads being busy processing authentication:
    └─$ grep "Processing request" /var/log/proxy/proxy-tomcat-wrapper.log | wc -l
    99 
  • You will observe many authentication requests in pending state:
    └─$ zgrep "grpc-default-" /var/log/proxy/proxy-tomcat-wrapper.log* | wc -l

    12546

Environment

VMware NSX

Cause

Root cause of this issue is lack (or slowness) on response from vIDM server (e.g. due to slow network, or vIDM being busy), leading to overload of authentication requests on NSX Manager, which will exhaust JVM of proxy service, and will cause proxy service to run out of memory. 

Resolution

In VMware NSX 4.2.1, available at Broadcom downloads, improvements where introduced to avoid/prevent proxy service running out of memory due to slowness on vIDM side, further improvements are planned for a future version of NSX.

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

 

Workaround

  • To avoid this issue, use local user, or principal identity for authentication to NSX Manager. 

  • If you are already impacted and the NSX UI is unresponsive:

    1. Log in as root user on the impacted manager
    2. Restart the proxy/envoy service using the below command confirming the update status:
      #systemctl restart envoy; systemctl --no-pager status envoy | grep Active

           Active: active (running) since Mon 2025-03-17 12:52:37 UTC; 12ms ago
    3. Login using the user admin
    4. Confirm the service is back up again with
      >get service http 
    5. Check the cluster is healthy again after the service is restarted by running
      >get cluster status
    6. If the service does not restart, a reboot of the NSX Manager node will be required. 

 

Additional Information

If you are contacting Broadcom support about this issue, please provide the following:

  • NSX Manager support bundles
  • Text of any error messages seen in NSX GUI or command lines pertinent to the investigation

Handling Log Bundles for offline review with Broadcom support

If the steps here have not resolved the issue for you, you can refer to the following KB which can provide further troubleshooting steps:

Troubleshooting NSX issues

Application on NSX node has crashed alarm