After vCenter certificate replacement, reconciliation of TKGS cluster is failing
book
Article ID: 326429
calendar_today
Updated On:
Products
VMware vCenter Server
Issue/Introduction
Symptoms:
Certificates of the vCenter Server were changed (e.g. machine certificate or STS certificate).
Reconciliation of TKGS cluster is failing.
The nsx-ncp pods remain stuck in "Init" state:
# kubectl get pods -n vmware-system-nsx
output: NAME READY STATUS RESTARTS AGE nsx-ncp-6b975548cb-jwdxv 0/2 Init:0/1 14 (8m58s ago) 3h51m
NSX-NCP pods might be constantly crashing/restarting and be in CrashLoop BackOff state.
Verifiable via kubectl, e.g. kubectl describe pod/nsx-ncp-[UNIQUE ID] -n vmware-system-nsx
Logs on the supervisor cluster in /var/log/pods/ indicate issues that certificate is not trusted. Logs might be similar to:
(Logs can also be checked via kubectl, e.g. kubectl logs pod/nsx-ncp-[UNIQUE-ID] -n vmware-system-nsx)
[wcp-migrator MainThread I] nsx_ujo.ncp.vc.session Refreshing token and re-instantiating TESSession [wcp-migrator MainThread I] nsx_ujo.ncp.vc.session VC credentials were not changed [wcp-migrator MainThread I] nsx_ujo.ncp.vc.session Successfully retrieved JWT token: eyJraWQi[...]w1nO [wcp-migrator MainThread W] vmware_nsxlib.v3.utils Finished retry of vmware_nsxlib.v3.cluster.ClusteredAPI._proxy.<locals>._proxy_internal for the 10th time after 31.602 (s) with args: Unknown [wcp-migrator MainThread E] vmware_nsxlib.v3.lib Unable to read maximum tags. Reason: Certificate not trusted
...OR...
[wcp-migrator MainThread W] vmware_nsxlib.v3.cluster [7f0bbb44af50] Request failed due to: Certificate not trusted [wcp-migrator MainThread W] vmware_nsxlib.v3.cluster [7f0bbb44af50] Request failed due to an exception that calls for regeneration. Re-generating pool.
Environment
VMware vCenter Server 8.0.x VMware vCenter Server 7.0.x
Cause
After vCenter certificates were replaced (especially machine certificate and STS certificate), NSX Manager expectedly loses trust with vCenter Server. Due to the certificate changing, NSX Manager cannot differentiate between an expected certificate change or a malicious attempt (e.g. Man-in-the-middle attack) and refuses to further communicate with the vCenter API for security reasons.
Manually re-establishing trust with validation of the certificate thumbprint is required to re-establish the trust relationship and connectivity between both components.
When this is performed, NSX-NCP pods on the Supervisor Cluster should re-establish connectivity after some minutes automatically. If not, please involve VMware Support with reference to this KB article.
Additional Information
Impact/Risks: The reconciliation of TKGS cluster is failing. As this is usually caused due to broken trust relationship between NSX Manager and vCenter Server, it might have a broader impact which involve both components - such as pods creation failing, changes to network policy, etc.