Duplicate NSX manager node FQDNs are getting stamped in vCenter Extension MOB causing the NCP pods to go down

Products

VMware NSX

Issue/Introduction

Tanzu deployment running NCP 4.x.
NCP pods are constantly restarting:
# kubeclt get pods -A | grep ncp
vmware-system-nsx nsx-ncp-<id> 1/2 CrashLoopBackOff 6 (42s ago) 7m22s
vmware-system-nsx nsx-ncp-<id> 1/2 CrashLoopBackOff 6 (57s ago) 7m23s
NCP logs indicate SSL thumbprint mismatch:
# kubectl logs -n vmware-system-nsx nsx-ncp-<id> | tail -n100 | grep "Fingerprints did not match"
[...]
[ncp GreenThread-9 W] vmware_nsxlib.v3.cluster Failed to validate API cluster endpoint '[DOWN] https://<NSX-MGR>.domain:443' due to: HTTPSConnectionPool(host='<NSX-MGR>', port=443): Max retries exceeded with url: /api/v1/reverse-proxy/node/health (Caused by SSLError('Fingerprints did not match. Expected "ABC", got "XYZ'".'))
One of the NSX Manager nodes were eventually detached/re-joined from the cluster.
NCP Config Map shows duplicate NSX node names:
# kubectl get configmaps nsx-ncp-config -n vmware-system-nsx -o yaml | grep -v "apiVersion" | grep -E "nsx_api_managers|thumbprint" -A1
= True\nncp_enforced_pool_member_limit = ACTIVATE\nnsx_api_managers = nsx-00.domain.tld:443,nsx-01.domain.tld:443,nsx-02.domain.tld:443,nsx-02.domain.tld:443\nthumbprint
Also, in vCenter Extension MOB you will see duplicate URLs/FQDNs string associated with two different 'serverThumbprint' strings:

1. Open vCenter MOB URL using https://<vCenter Name/IP address>/mob
2. Give sso admin credentials for authentication.
3. Click on "content"
4. Search for "Extension Manager" and click on it.
5. Click on "more" of Extension list to list all the extensions
6. Look for "extensionList["com.vmware.nsx.management.nsxt"]" and click on it.
7. Click "server"
8. Look for "serverThumbprint" and "url string"

You can see similar entries from vCenter Extension MOB:

(1) ExtensionServerInfo NAME TYPE VALUE
adminEmail string[] "[email protected]"
company string "VMware"
description Description NAME TYPE VALUE
label string "NSX Compute Manager Id"
summary string "ABC"
serverCertificate string Unset
serverThumbprint string "ABC..."
type string ""
url string "https://NSX-02.domain:443"

(2) ExtensionServerInfo NAME TYPE VALUE
adminEmail string[] "[email protected]"
company string "VMware"
description Description NAME TYPE VALUE
label string "NSX Compute Manager Id"
summary string "ABC"
serverCertificate string Unset
serverThumbprint string "XYZ..."
type string ""
url string "https://NSX-02.domain:443"

Environment

VMware NSX with NCP version 4.1.x and 4.2.x

Cause

NSX Manager nodes are configured with mixed certificate types, for example:
- NSX-01: CA signed certificate
- NSX-02: CA signed certificate
- NSX-03: Self-Signed Certificate
In this mixed Certificate types, NSX skips the DNS lookup for NSX-03 since the self-signed certificate doesn't require the DNS/FQDN filled in (fqdnRequired='false'), however, in a CA Signed certificate the DNS server must provide for forward and reverse lookups of the Manager's IP address and Manager's hostname (fqdnRequired='true'). Hence, duplicate NSX manager node FQDN are getting stamped in vCenter Extension.
As per workflow, Tanzu pulls the thumbprints from vCenter instead of directly from the NSX managers. Therefore, NCP pods are crashing due to API calls being sent from NCP towards the duplicated NSX FQDN pulled from vCenter MOB.

Resolution

This issue is resolved in VMware NSX 4.2.1.3 and 4.2.2 onwards, available at Broadcom downloads.

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Workaround:

Apply the CA signed certificate on all three NSX manager nodes or use self-signed certificate on all nodes, but do not mix the certificate types.