Proton service restarts intermittently due to state sync timeouts with large NSGroups
search cancel

Proton service restarts intermittently due to state sync timeouts with large NSGroups

book

Article ID: 442303

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • The proton service on NSX-T Manager restarts intermittently.
  • Alarms for "Application on NSX node has crashed" may be observed in the NSX UI.
  • Users may experience temporary management plane instability or UI inaccessibility.
  • In the state sync logs, you will see a request for a delta sync that does not receive a response, followed by a proton restart.


    /var/log/proton/proton_restart.log

    Below log clearly show the proton restarted because delta sync processing took long time > 960000 (16 mins)


    2021-02-01T17:07:07.631Z INFO application-restartor restartor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] ===== APPLICATION IS GOING RESTART (No progress in delta processing was made in 960000) =====
    2021-02-08T08:40:39.702Z INFO application-restartor restartor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] ===== APPLICATION IS GOING RESTART (No progress in delta processing was made in 960000) =====
    2021-02-09T21:30:33.484Z INFO application-restartor restartor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] ===== APPLICATION IS GOING RESTART (Fatal error in Delta State Sync.) =====
    2021-02-10T17:56:52.951Z INFO application-restartor restartor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] ===== APPLICATION IS GOING RESTART (No progress in delta processing was made in 960000) =====

Environment

VMware NSX-T 2.5.1.1

Cause

  • The issue is caused by a state sync timeout within the Proton service.
  • The state sync thread requests a delta sync from the NSGroupDeltaSyncMessageProvider.
  • If a response is not received within a specific window (e.g., 16 minutes), the system triggers an automatic restart of the proton service to recover.
  • This delay occurs when the environment contains NSGroups with a very large number of members (exceeding 8,000 members). Processing these large groups causes the provider to exceed the allowed response time for sync requests.

Resolution

Resolution

This issue is addressed in  2.5.3 and later releases of NSX-T where processing efficiency for large dynamic groups is improved.

Workaround

To mitigate the frequent restarts, consider the following:

  1. Reduce NSGroup Size: Where possible, break down extremely large NSGroups into smaller, more manageable groups to reduce the sync processing load.
  2. Manual Proton Restart: If the service remains in a degraded state, perform a rolling restart of the proton service on the NSX Manager nodes.

    /etc/init.d/proton restart

    ⚠️ IMPORTANT: Perform this during a maintenance window and take a backup before proceeding.