NSX-Config Pods Enter CrashLoopBackOff Due to Prolonged SSP Connectivity Disruption Causing Full Sync Backlog
search cancel

NSX-Config Pods Enter CrashLoopBackOff Due to Prolonged SSP Connectivity Disruption Causing Full Sync Backlog

book

Article ID: 441086

calendar_today

Updated On:

Products

VMware vDefend Firewall with Advanced Threat Prevention VMware vDefend Firewall

Issue/Introduction

The following symptoms were observed in the SSP cluster:

  • nsx-config pods entered CrashLoopBackOff state.

  • High restart count observed for nsx-config pods.

  • k describe pod output showed:

    Exit code observed:
Exit Code: 137
Reason: Error
  • Events observed 
Readiness probe failed: Get "http://<pod-ip>:8080/actuator/health/readiness": context deadline exceeded 
Liveness probe failed: Get "http://<pod-ip>:8080/actuator/health/liveness": context deadline exceeded

 

nsx-config pod Logs repeatedly showed Full Sync checks without successful completion:

 

2026-05-20T12:20:12,459 INFO  [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc

2026-05-20T12:20:12,505 INFO  [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc

2026-05-20T12:20:12,519 INFO  [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc

2026-05-20T12:20:12,521 INFO  [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc

2026-05-20T12:20:12,533 INFO  [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc

2026-05-20T12:20:12,540 INFO  [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc

2026-05-20T12:20:12,576 INFO  [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc

2026-05-20T12:20:12,595 INFO  [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc

Environment

SSP 5.x

Cause

A prolonged SSP connectivity disruption can lead to synchronization instability between NSX and SSP.

If the issue persists for an extended duration (for example, during upgrades or prolonged SSP unavailability), the following conditions may occur:

NSX ↔ SSP synchronization remains disrupted.

Full synchronization workflows are unable to complete successfully.

Resolution

SSH as sysadmin/root to SSP-Installer and run the below commands to debug the issue :

 

(1) Log into cluster-api pod via the below command and execute the below command 

(a) k get pods -A | grep cluster-api

Note down the cluster-api podname

(b) k exec -it cluster-api_podname -c cluster-api -n nsxi-platform -- /bin/bash

(2) Check Kafka consumer lag/groups

/opt/kafka/bin/kafka-consumer-groups.sh \
--bootstrap-server kafka:9092 \
--offsets \
--describe \
--all-groups \
--command-config /root/adminclient.props

 

Focus on consumer group:

intelligence-nsx-config-update

and topics:

nsx2pace-config-group1
nsx2pace-config-group2

Example problematic output:

GROUPTOPICCURRENT-OFFSETLOG-END-OFFSETLAG
intelligence-nsx-config-updatensx2pace-config-group1100250000249900
intelligence-nsx-config-updatensx2pace-config-group2200300000299800

 

Large LAG values indicate that the consumer is heavily behind and still attempting to process old backlog events.

Proceed to step 2 if you witness large LAG values.

 

(3) Reset Kafka consumer offsets to latest

To skip old/stale synchronization backlog and move all consumer groups to the latest available messages:

/opt/kafka/bin/kafka-consumer-groups.sh \
--bootstrap-server kafka:9092 \
--command-config /root/adminclient.props \
--group intelligence-nsx-config-update \
--topic nsx2pace-config-group1 \
--reset-offsets \
--to-latest \
--execute

 

/opt/kafka/bin/kafka-consumer-groups.sh \
--bootstrap-server kafka:9092 \
--command-config /root/adminclient.props \
--group intelligence-nsx-config-update \
--topic nsx2pace-config-group2 \
--reset-offsets \
--to-latest \
--execute

 

This operation skips all old unread Kafka messages for the specified consumer group and starts consumption from the newest available events.

Example:

Before reset:

TopicCurrent OffsetLatest OffsetLag
nsx2pace-config-group1100250000249900

 

After reset:

TopicCurrent OffsetLatest OffsetLag
nsx2pace-config-group12500002500000

This is equivalent to marking all old backlog messages as read.

 

(4) Verify Stale Epoch Records in Postgres

Connect to the Postgres database and check unpublished/stale epochs:

(a) export KUBECONFIG=/config/clusterctl/1/workload.kubeconfig

(b) alias pg='kubectl exec -it postgresql-ha-postgresql-0 -n nsxi-platform -- /bin/bash -c "PGPASSWORD=$(kubectl get secret postgresql-password -o jsonpath={.data.postgresql-password} -n nsxi-platform | base64 -d) psql -d pace"'

(c) Execute the below query

select count(*) from nsx_config.policynsxconfigepoch
where published=false;

Large counts indicate stale synchronization state accumulation.

Example:
24851

(d) Identify Latest Healthy Epochs

Check the latest epochs:

select * from nsx_config.policynsxconfigepoch
order by epoch desc limit 5;

Example output:

epochfullsynccompletepublished
30089tt
30088tt
30087ff
30086ff

Interpretation:

Latest healthy/completed epochs:
30088
30089


Older incomplete/stale epochs:
30087 and below

(e) Delete Older Stale Epoch Records

Delete stale epochs older than the latest known healthy epoch.

Example:

delete from nsx_config.policynsxconfigepoch
where epoch < 30088;

Sample explanation:

If epochs 30088 and 30089 are healthy (published=true and fullsynccomplete=true), then older epochs are considered stale historical backlog and can be removed.

Expected result example:

DELETE 30089

This indicates that old stale synchronization records were cleaned successfully.

(5) Validate Full Sync Recovery

Monitor nsx-config logs:

k logs -f nsx-config-0-0 -n nsxi-platform

Expected healthy messages:

Received FullSyncEnd message
FULL_SYNC END in nsx-config

for:

nsx2pace-config-group1
nsx2pace-config-group2

This confirms that the synchronization workflow completed successfully.