The following symptoms were observed in the SSP cluster:
nsx-config pods entered CrashLoopBackOff state.nsx-config pods.k describe pod output showed:Exit Code: 137
Reason: Error
Readiness probe failed: Get "http://<pod-ip>:8080/actuator/health/readiness": context deadline exceeded
Liveness probe failed: Get "http://<pod-ip>:8080/actuator/health/liveness": context deadline exceeded
nsx-config pod Logs repeatedly showed Full Sync checks without successful completion:
2026-05-20T12:20:12,459 INFO [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc
2026-05-20T12:20:12,505 INFO [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc
2026-05-20T12:20:12,519 INFO [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc
2026-05-20T12:20:12,521 INFO [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc
2026-05-20T12:20:12,533 INFO [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc
2026-05-20T12:20:12,540 INFO [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc
2026-05-20T12:20:12,576 INFO [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc
2026-05-20T12:20:12,595 INFO [00000000-0000-0000-0000-be63ac903dbc] c.v.n.p.n.c.FullSyncCompletionChecker$1: INTELLIGENCE [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Full Sync check Task performed on: Wed May 20 12:20:12 GMT 2026nfor site: 00000000-0000-0000-0000-be63ac903dbc
SSP 5.x
A prolonged SSP connectivity disruption can lead to synchronization instability between NSX and SSP.
If the issue persists for an extended duration (for example, during upgrades or prolonged SSP unavailability), the following conditions may occur:
NSX ↔ SSP synchronization remains disrupted.
Full synchronization workflows are unable to complete successfully.
SSH as sysadmin/root to SSP-Installer and run the below commands to debug the issue :
(1) Log into cluster-api pod via the below command and execute the below command
(a) k get pods -A | grep cluster-api
Note down the cluster-api podname
(b) k exec -it cluster-api_podname -c cluster-api -n nsxi-platform -- /bin/bash
(2) Check Kafka consumer lag/groups
/opt/kafka/bin/kafka-consumer-groups.sh \
--bootstrap-server kafka:9092 \
--offsets \
--describe \
--all-groups \
--command-config /root/adminclient.props
Focus on consumer group:
intelligence-nsx-config-update
and topics:
nsx2pace-config-group1 nsx2pace-config-group2
Example problematic output:
| GROUP | TOPIC | CURRENT-OFFSET | LOG-END-OFFSET | LAG |
|---|---|---|---|---|
| intelligence-nsx-config-update | nsx2pace-config-group1 | 100 | 250000 | 249900 |
| intelligence-nsx-config-update | nsx2pace-config-group2 | 200 | 300000 | 299800 |
Large LAG values indicate that the consumer is heavily behind and still attempting to process old backlog events.
Proceed to step 2 if you witness large LAG values.
(3) Reset Kafka consumer offsets to latest
To skip old/stale synchronization backlog and move all consumer groups to the latest available messages:
/opt/kafka/bin/kafka-consumer-groups.sh \
--bootstrap-server kafka:9092 \
--command-config /root/adminclient.props \
--group intelligence-nsx-config-update \
--topic nsx2pace-config-group1 \
--reset-offsets \
--to-latest \
--execute
/opt/kafka/bin/kafka-consumer-groups.sh \
--bootstrap-server kafka:9092 \
--command-config /root/adminclient.props \
--group intelligence-nsx-config-update \
--topic nsx2pace-config-group2 \
--reset-offsets \
--to-latest \
--execute
This operation skips all old unread Kafka messages for the specified consumer group and starts consumption from the newest available events.
Example:
Before reset:
| Topic | Current Offset | Latest Offset | Lag |
| nsx2pace-config-group1 | 100 | 250000 | 249900 |
After reset:
| Topic | Current Offset | Latest Offset | Lag |
| nsx2pace-config-group1 | 250000 | 250000 | 0 |
This is equivalent to marking all old backlog messages as read.
(4) Verify Stale Epoch Records in Postgres
Connect to the Postgres database and check unpublished/stale epochs:
(a) export KUBECONFIG=/config/clusterctl/1/workload.kubeconfig
(b) alias pg='kubectl exec -it postgresql-ha-postgresql-0 -n nsxi-platform -- /bin/bash -c "PGPASSWORD=$(kubectl get secret postgresql-password -o jsonpath={.data.postgresql-password} -n nsxi-platform | base64 -d) psql -d pace"'
(c) Execute the below query
select count(*) from nsx_config.policynsxconfigepoch
where published=false;
Large counts indicate stale synchronization state accumulation.
Example:
24851
(d) Identify Latest Healthy Epochs
Check the latest epochs:
select * from nsx_config.policynsxconfigepoch
order by epoch desc limit 5;Example output:
| epoch | fullsynccomplete | published |
| 30089 | t | t |
| 30088 | t | t |
| 30087 | f | f |
| 30086 | f | f |
Interpretation:
Latest healthy/completed epochs:
30088
30089
Older incomplete/stale epochs:
30087 and below
(e) Delete Older Stale Epoch Records
Delete stale epochs older than the latest known healthy epoch.
Example:
delete from nsx_config.policynsxconfigepoch
where epoch < 30088;
Sample explanation:
If epochs 30088 and 30089 are healthy (published=true and fullsynccomplete=true), then older epochs are considered stale historical backlog and can be removed.
Expected result example:
DELETE 30089
This indicates that old stale synchronization records were cleaned successfully.
(5) Validate Full Sync Recovery
Monitor nsx-config logs:
k logs -f nsx-config-0-0 -n nsxi-platform
Expected healthy messages:
Received FullSyncEnd message
FULL_SYNC END in nsx-config
for:
nsx2pace-config-group1
nsx2pace-config-group2
This confirms that the synchronization workflow completed successfully.