NSX-Config Pods Enter CrashLoopBackOff Due to Prolonged SSP Connectivity Disruption Causing Full Sync Backlog

Products

VMware vDefend Firewall with Advanced Threat Prevention VMware vDefend Firewall

Issue/Introduction

The following symptoms were observed in the SSP cluster:

nsx-config pods entered CrashLoopBackOff state.
High restart count observed for nsx-config pods.
k describe pod output showed:

Exit code observed:

Exit Code: 137
Reason: Error

Events observed

Readiness probe failed: Get "http://<pod-ip>:8080/actuator/health/readiness": context deadline exceeded 
Liveness probe failed: Get "http://<pod-ip>:8080/actuator/health/liveness": context deadline exceeded

nsx-config pod Logs repeatedly showed Full Sync checks without successful completion:

2026-05-20T12:20:12,459 INFO  [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc

2026-05-20T12:20:12,505 INFO  [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc

2026-05-20T12:20:12,519 INFO  [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc

2026-05-20T12:20:12,521 INFO  [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc

2026-05-20T12:20:12,533 INFO  [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc

2026-05-20T12:20:12,540 INFO  [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc

2026-05-20T12:20:12,576 INFO  [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc

2026-05-20T12:20:12,595 INFO  [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc

Environment

SSP 5.x

Cause

A prolonged SSP connectivity disruption can lead to synchronization instability between NSX and SSP.

If the issue persists for an extended duration (for example, during upgrades or prolonged SSP unavailability), the following conditions may occur:

NSX ↔ SSP synchronization remains disrupted.

Full synchronization workflows are unable to complete successfully.

Resolution

SSH as sysadmin/root to SSP-Installer and run the below commands to debug the issue :

(1) Log into cluster-api pod via the below command and execute the below command

(a) k get pods -A | grep cluster-api

Note down the cluster-api podname

(b) k exec -it cluster-api_podname -c cluster-api -n nsxi-platform -- /bin/bash

(2) Check Kafka consumer lag/groups

/opt/kafka/bin/kafka-consumer-groups.sh \
--bootstrap-server kafka:9092 \
--offsets \
--describe \
--all-groups \
--command-config /root/adminclient.props

Focus on consumer group:

intelligence-nsx-config-update

and topics:

nsx2pace-config-group1
nsx2pace-config-group2

Example problematic output:

GROUP	TOPIC	CURRENT-OFFSET	LOG-END-OFFSET	LAG
intelligence-nsx-config-update	nsx2pace-config-group1	100	250000	249900
intelligence-nsx-config-update	nsx2pace-config-group2	200	300000	299800

Large LAG values indicate that the consumer is heavily behind and still attempting to process old backlog events.

Proceed to step 2 if you witness large LAG values.

(3) Reset Kafka consumer offsets to latest

To skip old/stale synchronization backlog and move all consumer groups to the latest available messages:

/opt/kafka/bin/kafka-consumer-groups.sh \
--bootstrap-server kafka:9092 \
--command-config /root/adminclient.props \
--group intelligence-nsx-config-update \
--topic nsx2pace-config-group1 \
--reset-offsets \
--to-latest \
--execute

/opt/kafka/bin/kafka-consumer-groups.sh \
--bootstrap-server kafka:9092 \
--command-config /root/adminclient.props \
--group intelligence-nsx-config-update \
--topic nsx2pace-config-group2 \
--reset-offsets \
--to-latest \
--execute

This operation skips all old unread Kafka messages for the specified consumer group and starts consumption from the newest available events.

Example:

Before reset:

Topic	Current Offset	Latest Offset	Lag
nsx2pace-config-group1	100	250000	249900

After reset:

Topic	Current Offset	Latest Offset	Lag
nsx2pace-config-group1	250000	250000	0

This is equivalent to marking all old backlog messages as read.

(4) Verify Stale Epoch Records in Postgres

Connect to the Postgres database and check unpublished/stale epochs:

(a) export KUBECONFIG=/config/clusterctl/1/workload.kubeconfig

(b) alias pg='kubectl exec -it postgresql-ha-postgresql-0 -n nsxi-platform -- /bin/bash -c "PGPASSWORD=$(kubectl get secret postgresql-password -o jsonpath={.data.postgresql-password} -n nsxi-platform | base64 -d) psql -d pace"'

(c) Execute the below query

select count(*) from nsx_config.policynsxconfigepoch
where published=false;

Large counts indicate stale synchronization state accumulation.

Example:
24851

(d) Identify Latest Healthy Epochs

Check the latest epochs:

select * from nsx_config.policynsxconfigepoch
order by epoch desc limit 5;

Example output:

epoch	fullsynccomplete	published
30089	t	t
30088	t	t
30087	f	f
30086	f	f

Interpretation:

Latest healthy/completed epochs:
30088
30089

Older incomplete/stale epochs:
30087 and below

(e) Delete Older Stale Epoch Records

Delete stale epochs older than the latest known healthy epoch.

Example:

delete from nsx_config.policynsxconfigepoch
where epoch < 30088;

Sample explanation:

If epochs 30088 and 30089 are healthy (published=true and fullsynccomplete=true), then older epochs are considered stale historical backlog and can be removed.

Expected result example:

DELETE 30089

This indicates that old stale synchronization records were cleaned successfully.

(5) Validate Full Sync Recovery

Monitor nsx-config logs:

k logs -f nsx-config-0-0 -n nsxi-platform

Expected healthy messages:

Received FullSyncEnd message
FULL_SYNC END in nsx-config

for:

nsx2pace-config-group1
nsx2pace-config-group2

This confirms that the synchronization workflow completed successfully.