Duplicate NSX manager node FQDNs are getting stamped in vCenter Extension MOB causing the NCP pods to go down
search cancel

Duplicate NSX manager node FQDNs are getting stamped in vCenter Extension MOB causing the NCP pods to go down

book

Article ID: 386410

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • Tanzu deployment running NCP 4.x.
  • NCP pods are constantly restarting: 
    # kubeclt get pods -A | grep ncp
    vmware-system-nsx                 nsx-ncp-<id>                       1/2     CrashLoopBackOff   6 (42s ago)       7m22s
    vmware-system-nsx                 nsx-ncp-<id>                       1/2     CrashLoopBackOff   6 (57s ago)       7m23s

  • NCP logs indicate SSL thumbprint mismatch: 
    # kubectl logs -n vmware-system-nsx nsx-ncp-<id> | tail -n100 | grep "Fingerprints did not match" 
    [...] 
    [ncp GreenThread-9 W] vmware_nsxlib.v3.cluster Failed to validate API cluster endpoint '[DOWN] https://<NSX-MGR>.domain:443' due to: HTTPSConnectionPool(host='<NSX-MGR>', port=443): Max retries exceeded with url: /api/v1/reverse-proxy/node/health (Caused by SSLError('Fingerprints did not match. Expected "ABC", got "XYZ'".'))

  • One of the NSX Manager nodes were eventually detached/re-joined from the cluster.

  • NCP Config Map shows duplicate NSX node names:
    # kubectl get configmaps nsx-ncp-config -n vmware-system-nsx -o yaml | grep -v "apiVersion" | grep -E "nsx_api_managers|thumbprint" -A1 
        = True\nncp_enforced_pool_member_limit = ACTIVATE\nnsx_api_managers = nsx-00.domain.tld:443,nsx-01.domain.tld:443,nsx-02.domain.tld:443,nsx-02.domain.tld:443\nthumbprint 

  • Also, in vCenter Extension MOB you will see duplicate URLs/FQDNs string associated with two different 'serverThumbprint' strings:

1. Open vCenter MOB URL using https://<vCenter Name/IP address>/mob
2. Give sso admin credentials for authentication.
3. Click on "content"
4. Search for "Extension Manager" and click on it.
5. Click on "more" of Extension list to list all the extensions
6. Look for "extensionList["com.vmware.nsx.management.nsxt"]" and click on it.
7. Click "server"
8. Look for "serverThumbprint" and "url string"

  • You can see similar entries from vCenter Extension MOB:

(1) ExtensionServerInfo NAME TYPE VALUE
adminEmail string[] "[email protected]"
company string "VMware"
description Description NAME TYPE VALUE
label string "NSX Compute Manager Id"
summary string "ABC"
serverCertificate string Unset
serverThumbprint string "ABC..." <<<<<<<<<<<
type string ""
url string "https://NSX-02.domain:443" <<<<<<<<<<<

(2) ExtensionServerInfo NAME TYPE VALUE
adminEmail string[] "[email protected]"
company string "VMware"
description Description NAME TYPE VALUE
label string "NSX Compute Manager Id"
summary string "ABC"
serverCertificate string Unset
serverThumbprint string "XYZ..."  <<<<<<<<<<<
type string ""
url string "https://NSX-02.domain:443" <<<<<<<<<<<

Environment

VMware NSX with NCP version 4.1.x and 4.2.x

Cause

  • This is a known issue impacting VMware NSX.

Workaround:

  • NSX Manager nodes are configured with mixed certificate types, for example:
    • NSX-01: CA signed certificate
    • NSX-02: CA signed certificate
    • NSX-03: Self-Signed Certificate
  • In this mixed Certificate types, NSX skips the DNS lookup for NSX-03 since the self-signed certificate doesn't require the DNS/FQDN filled in (fqdnRequired='false'), however,  in a CA Signed certificate the DNS server must provide for forward and reverse lookups of the Manager's IP address and Manager's hostname (fqdnRequired='true'). Hence, duplicate NSX manager node FQDN are getting stamped in vCenter Extension.
  • As per workflow, Tanzu pulls the thumbprints from vCenter instead of directly from the NSX managers. Therefore, NCP pods are crashing due to API calls being sent from NCP towards the duplicated NSX FQDN pulled from vCenter MOB.

Resolution

  • Apply the CA signed certificate on all three NSX manager nodes or use self-signed certificate on all nodes, but do not mix the certificate types.