After vCenter certificate replacement, reconciliation of TKGS cluster is failing
search cancel

After vCenter certificate replacement, reconciliation of TKGS cluster is failing

book

Article ID: 326429

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

Symptoms:
  • Certificates of the vCenter Server were changed (e.g. machine certificate or STS certificate).
  • Reconciliation of TKGS cluster is failing.
  • The nsx-ncp pods remain stuck in "Init" state:
# kubectl get pods -n vmware-system-nsx

output:
NAME                       READY   STATUS     RESTARTS         AGE
nsx-ncp-6b975548cb-jwdxv   0/2     Init:0/1   14 (8m58s ago)   3h51m
  • NSX-NCP pods might be constantly crashing/restarting and be in CrashLoop BackOff state.
    • Verifiable via kubectl, e.g. kubectl describe pod/nsx-ncp-[UNIQUE ID] -n vmware-system-nsx
  • Logs on the supervisor cluster in /var/log/pods/ indicate issues that certificate is not trusted. Logs might be similar to:
    • (Logs can also be checked via kubectl, e.g. kubectl logs pod/nsx-ncp-[UNIQUE-ID] -n vmware-system-nsx)
[wcp-migrator MainThread I] nsx_ujo.ncp.vc.session Refreshing token and re-instantiating TESSession
[wcp-migrator MainThread I] nsx_ujo.ncp.vc.session VC credentials were not changed
[wcp-migrator MainThread I] nsx_ujo.ncp.vc.session Successfully retrieved JWT token: eyJraWQi[...]w1nO
[wcp-migrator MainThread W] vmware_nsxlib.v3.utils Finished retry of vmware_nsxlib.v3.cluster.ClusteredAPI._proxy.<locals>._proxy_internal for the 10th time after 31.602 (s) with args: Unknown
[wcp-migrator MainThread E] vmware_nsxlib.v3.lib Unable to read maximum tags. Reason: Certificate not trusted
 
...OR...

[wcp-migrator MainThread W] vmware_nsxlib.v3.cluster [7f0bbb44af50] Request failed due to: Certificate not trusted
[wcp-migrator MainThread W] vmware_nsxlib.v3.cluster [7f0bbb44af50] Request failed due to an exception that calls for regeneration. Re-generating pool.


Environment

VMware vCenter Server 8.0.x
VMware vCenter Server 7.0.x

Cause

After vCenter certificates were replaced (especially machine certificate and STS certificate), NSX Manager expectedly loses trust with vCenter Server. Due to the certificate changing, NSX Manager cannot differentiate between an expected certificate change or a malicious attempt (e.g. Man-in-the-middle attack) and refuses to further communicate with the vCenter API for security reasons.

Manually re-establishing trust with validation of the certificate thumbprint is required to re-establish the trust relationship and connectivity between both components.

Resolution

For re-establishing trust between NSX Manager and its Compute Manager (vCenter), please follow 'Resolution' outlined in this KB article: After VMware vCenter Server certificate is replaced, compute manager connection is "Down" on NSX UI.

When this is performed, NSX-NCP pods on the Supervisor Cluster should re-establish connectivity after some minutes automatically. If not, please involve VMware Support with reference to this KB article.

Additional Information

Impact/Risks:
The reconciliation of TKGS cluster is failing. As this is usually caused due to broken trust relationship between NSX Manager and vCenter Server, it might have a broader impact which involve both components - such as pods creation failing, changes to network policy, etc.