Reconnecting NSX manager post restore failure, Infrastructure sync is DOWN

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Problem:

After refreshing the Platform CA certificate and Ingress certificate, a backup of the SSP was taken.

Before removing the SSP, the Platform CA was refreshed again, generating a new internal certificate chain.

The SSP was then force-deleted without NSX cleanup and redeployed as a new instance.

When attempting to restore the backup (taken prior to the second CA refresh), the restore process failed during the NSX reconnect phase.

When you try to manually reconnect NSX Manager, the Infrastructure Sync status remained DOWN, and the following error was observed in the site status:

Symptom: Infrastructure sync is DOWN on SSP UI.

From SSP-I CLI, run:

k -n nsxi-platform get sites -oyaml

And look for the error:

COMMON_FULLSYNC failed due to: java.lang.Exception: produceCertMsgs

Environment

Security Services Platform(SSP) 5.0 and 5.1 with

Onboarded NSX Manager versions 4.2.0, 4.2.1, 4.2.2, 4.2.3, 9.0

Cause

The Common Agent status API reports that Full Sync has failed, even though the synchronization actually completed successfully.

No functional impact observed in data exchange, but the status reporting remains incorrect.

Stale background threads remained active after certificate refresh operations.

These stale threads continued to report outdated Common Agent status information to the status API, leading to an incorrect “COMMON_FULLSYNC failed due to: java.lang.Exception: produceCertMsgs” message even though the full synchronization was completed successfully.

Resolution

Since this issue occurred after a Platform CA certificate and Ingress certificate refresh, the Kafka server and client certificates must be verified before proceeding with the workaround.

Step 1: Identify the Messaging FQDN

Run the following command on the SSP-I CLI to retrieve the messaging configuration:

k -n nsxi-platform get infra -o yaml

Example Output:

formFactor: Advanced
helmRepo: oci://<repo-path>
ingressFQDN: <ingress-fqdn.example.com>
messagingFQDN: <messaging-fqdn.example.com>

Ensure that the messaging FQDN (for example, messaging-fqdn.example.com) is reachable from the NSX manager.

On the NSX Manager CLI, you may try executing

nc -vz <messaging-fqdn.example.com> 9092

The expected output

Connection to <messaging-fqdn.example.com> (<resolved-ip-address>) 9092 port [tcp/*] succeeded!

Step 2: Verify Kafka Server Certificate Fingerprint

Run the command below on the NSX Manager:

openssl s_client -showcerts -connect <messaging-fqdn.example.com>:9092 < /dev/null 2>/dev/null | openssl x509 -fingerprint -sha256 -noout

Example (masked) output:

SHA256 Fingerprint=XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX

Step 3: Verify Kafka Client Truststore

List the certificates in the client truststore and verify that the Kafka certificate fingerprint matches the one from Step 2:

keytool -list -keystore /home/secureall/secureall/.store/.client_truststore -storepass $(cat /config/http/.http_cert_pw)

Example (masked):

Next step is to ensure that certificates for napp-common-agent and napp-pace-agent are present and valid in the Kafka truststore on SSP.

Step 4: Validate Certificates Between NSX Manager and SSP

On NSX Manager:

Navigate to System → Certificates.
Check for certificates where “Issued By” and “Issued To” are napp-common-agent and napp-pace-agent.
Ensure the “Used By” field is non-zero.
Copy the UUID shown in the certificate details.

On SSP:

Navigate to System → Certificates.
Verify that the corresponding certificates show matching UUIDs for both napp-common-agent and napp-pace-agent in the format:

NSX_UA_KAFKA_CLIENT_<UUID_FROM_NSX_MANAGER>

Once Kafka certificates are confirmed to match, proceed with the workaround below.

Workaround Steps

Identify the NSX Manager Leader for Common Agent Service:

su admin -c "get cluster status verbose" | grep COMMON_AGENT_SERVICE
Get the Manager IP using the UUID from above:

su admin -c "get cluster status" | grep <uuid-of-manager-from-step-1>
SSH into the Identified NSX Manager Node:

ssh root@<nsx-manager-ip>
Restart the NSX Manager Service:

systemctl restart proton

Additional Information

This issue typically occurs after certificate replacement (Platform CA or Ingress cert) when stale threads persist in the agent service.
Restarting the NSX Manager on the leader node for Common Agent Service refreshes the internal thread state and corrects reporting.
Ensure Kafka certificate integrity before restarting to avoid messaging handshake errors.