Proton service repeatedly crashing on NSX after NSX onboarded with large Directory Users on SSP
search cancel

Proton service repeatedly crashing on NSX after NSX onboarded with large Directory Users on SSP

book

Article ID: 403824

calendar_today

Updated On:

Products

VMware vDefend Firewall with Advanced Threat Prevention VMware vDefend Firewall

Issue/Introduction

After onboarding Security Services Platform (SSP) 5.0 to an NSX setup (version 4.2.1 and above), environments where LDAP is configured within SSP and the number of Directory Users exceeds 100,000, users may experience frequent crashes or restarts of the Proton service on NSX Manager nodes.

This behaviour is caused by streaming of Directory User information from NSX to SSP during initial configuration sync. In setups without LDAP integration, this issue does not occur.

In some cases, Proton instability may persist even after offboarding SSP, due to residual memory pressure or lingering config states. The issue typically manifests within minutes or a few hours of onboarding SSP.

Symptoms:

  • Repeated Proton restarts on NSX Manager nodes.
  • Memory-related errors or OOM alerts in /var/log/proton/proton-tomcat-wrapper.log.
  • Large .hprof heap dump files generated. (Location →> /image/core/proton_oom.hprof)

Environment

SSP 5.0 onboarded to NSX 4.2.1 and above.

Cause

When SSP is onboarded, it initiates a full config sync from NSX Manager - this includes DirectoryUser entries synced from Active Directory (AD). If your AD contains hundreds of thousands of users, Proton attempts to ingest them all at once.
This causes a sudden memory spike in the Proton service, triggering Java Heap OutOfMemoryError and eventual service crash. This process may repeat continuously, leading to frequent restarts and system instability.

A restart of proton service is attempted each time when OOM is observed to auto heal the system. But each time when the large number of such user information is streamed, there is a sudden spike in memory causing OOM.

To check the number of directory users in system, on NSX manager node navigate to,

cd /opt/vmware/bin
 
At this path execute below command which will give the total count of Directory Users in the corfu table:
 
corfu_tool_runner.py -n nsx -t DirectoryUser -o showTable

Navigate to below log location to validate if we hit OOM issue:

Logs location: SSH to NSX manager CLI and locate,

PATH: /var/log/proton

FILE: proton-tomcat-wrapper.log

STATUS | wrapper  | 2025/07/15 19:46:40 | The JVM has run out of memory.  Requesting thread dump.
STATUS | wrapper  | 2025/07/15 19:46:40 | Dumping JVM state.
STATUS | wrapper  | 2025/07/15 19:46:41 | JVM process exited with a code of 3, setting the Wrapper exit code to 3.
ERROR  | wrapper  | 2025/07/15 19:46:41 | JVM exited unexpectedly.
STATUS | wrapper  | 2025/07/15 19:46:47 | Launching a JVM...

Resolution

STEP 1: SSH to SSPi CLI, and edit the nsx-config config map:

k edit cm nsx-config -n nsxi-platform

STEP 2: Filter out streaming of DirectoryUsers:

Search for POV. There will be 2 matches, find entry for DirectoryUser and remove verticals field as highlighted in attached screenshot:

STEP 3: Restart the nsx-config pods:

k rollout restart sts nsx-config-0 -n nsxi-platform

k rollout restart sts nsx-config-1 -n nsxi-platform

Wait for nsx-config pods to come up.

k get pod -n nsxi-platform -w | grep nsx-config

Observe below for proton stability.

This is a known issue and will be fixed in an upcoming SSP version.

Additional Information

If the resolution mentioned in this KB does not address your issue, refer to the Master KB for NSX Onboarding Issues, which lists all known onboarding scenarios, causes, and troubleshooting methods.