NSX-NCP pods in CrashLoopBackOff state after NSX compute manager unable to trust VCenter certificates.

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

vCenter machine certificates expired and were replaced.
Describing the crashing nsx-ncp shows similar to:
Events:

Type    Reason    Age    From    Message

--- ------    ---- ----    -------

Warning Unhealthy 6m58s (x235 over 9h) kubelet Liveness probe failed: CLI server is not ready

Warning BackOff 112s (x509 over 9h)   kubelet Back-off restarting failed container nsx-ncp in pod nsx-ncp-xxxxxxxxxxx-xxxx_vmware-system-nsx(xxx-xxxxx-xxxxx-xxxx-xxx)
NSX-NCP pods logs:

[ncp MainThread W] vmware_nsxlib.v3.cluster [7f1af4b7c730] Request failed due to: Certificate not trusted
[ncp MainThread W] vmware_nsxlib.v3.cluster [7f1af4b7c730] Request failed due to an exception that calls for regeneration. Re-generating pool.
[ncp MainThread I] nsx_ujo.ncp.vc.session Refreshing token and re-instantiating TESSession
[ncp MainThread I] nsx_ujo.ncp.vc.session VC credentials were not changed
[ncp MainThread I] nsx_ujo.ncp.vc.session Successfully retrieved JWT token: OR kubectl logs -n vmware-system-nsx -l component=nsx-ncp -c nsx-operator --follow shows the follwing. YYYY-MM-DD HH:MM:SS.899 ERROR util/utils.go:245 handle http response {"status": 401, "requestUrl": "https://vcsa-fqdn:443/rest//vcenter/tokenservice/token-exchange", "responseError": "json: unsupported type: func() (io.ReadCloser, error)", "error": "received HTTP Error"} YYYY-MM-DD HH:MM:SS.899 ERROR jwt/tesclient.go:75 failed to exchange JWT {"error": "received HTTP Error"} YYYY-MM-DD HH:MM:SS.899 ERROR jwt/jwtcache.go:78 JWT cache failed to refresh JWT {"error": "failed to exchange JWT due to error :received HTTP Error"}
NSX compute manager edit setting error occurred while trying to trust the thumbprint of the vCenter machine SSL.

Failed to enable trust on Compute Manager due to error There already exists an OIDC end-point with Issuer https://vCenter_fqdn/openldconnect/vsphere.local.. Please check https://vCenter_FQDN/openidconnect/vsphere.local/.well-known/openld-configuration Is reachable from NSX manager nodes. (Error code: 90011)
/var/log/vmware/wcp/wcpsvc.log

A general system error occurred. Error message: failed to create WCP Service Principal Identity: NSX Principal Identity creation failed: error sending HTTP request: Post "http://localhost:1080/external-cert/http1/NSXT_FQDN/443/api/v1/trust-management/token-principal-identities": context deadline exceeded (Client.Timeout exceeded while awaiting headers) error sending HTTP request: Post "http://localhost:1080/external-cert/http1/NSXT_FQDN/443/api/v1/trust-management/token-principal-identities": context deadline exceeded (Client.Timeout exceeded while awaiting headers) error sending HTTP request: Post "http://localhost:1080/external-cert/http1/NSXT_FQDN/443/api/v1/trust-management/token-principal-identities": context deadline exceeded (Client.Timeout exceeded while awaiting headers) error sending HTTP request:Post "http://localhost:1080/external-cert/http1/NSXT_FQDN/443/api/v1/trust-management/token-principal-identities": context deadline exceeded (Client.Timeout exceeded while awaiting headers).

Environment

vSphere Kubernetes Service

vCenter Server 8.x

vCenter Server 9.x

NSX-T Manager 4.x

NSX- T Manager 9.x

Cause

Difference/mismatch in thumbprint between the newly generated vCenter machine SSL certificate and the existing thumbprint saved within the NSX compute manager. API calls from the NSX Container Plugin (NCP) to NSX Manager failed TLS validation, causing CrashLoopBackOff.

Command to review the thumbprint in the vCenter Server Appliance Shell, run the following command, and we see the latest thumbprint:

# echo | openssl s_client -connect localhost:443 2>/dev/null | openssl x509 -noout -fingerprint -sha256

Curl command to validate the trust from NSX Manager SSH and existing thumbprint, with the below error.

# curl -k -u admin -X GET 'https://localhost/api/v1/trust-management/oidc-uris'

Enter host password for user 'admin':

{
"results" : [ {
    "oidc_uri" : "https://vCenter_FQDN/openidconnect/.well-known/openid-configuration",
    "thumbprint" : "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "oidc_type" : "vcenter",
    "scim_endpoints" : [ ],
    "claim_map" : [ ],
    "serviced_domains" : [ ],
    "restrict_scim_search" : false,
    "end_session_endpoint_uri" : "https://vCenter_FQDN/openidconnect/logout/vsphere.local",
    "issuer" : "https://vCenter_FQDN/openidconnect/vsphere.local",
    "jwks_uri" : "https://vCenter_FQDN/openidconnect/jwks/vsphere.local",
    "token_endpoint" : "https://vCenter_FQDN/openidconnect/token/vsphere.local",
    "claims_supported" : [ ],
    "override_roles" : [ ],
    "csp_config" : {
      "customer_org_id" : "",
      "additional_org_ids" : [ ]
    },

Resolution

To address the issue please follow the steps below:

Complete the steps outlined in the KB article: Failed to enable trust on Compute Manager in NSX

After completing the steps in the KB article, restart the NSX-NCP pods by scaling down the nsx-ncp deployment in the vmware-system-nsx namespace. Use the following command to scale the deployment to 0 replicas and then back to the desired number (e.g., 1 or 2 replicas):

Check the deployment status (e,g, number of replicas)

kubectl get deployments.apps -n vmware-system-nsx
Scale the deployment to 0 replicas and back to the desired number.

kubectl scale deployment nsx-ncp --replicas=0 -n vmware-system-nsx
kubectl scale deployment nsx-ncp --replicas=1 -n vmware-system-nsx
Check the deployment status to confirm the pods are running:

kubectl get deployments.apps -n vmware-system-nsx

If the issue persists after restarting the pods, involve Broadcom support for further assistance. Refer to Creating and managing Broadcom support cases for guidance on opening a support case.