After Repointing vCenter Single Sign-On (SSO) Domain, vSphere Supervisor Cluster Stuck Configuring, Error or Removing State

Products

VMware vSphere Kubernetes Service vSphere with Tanzu Tanzu Kubernetes Runtime VMware NSX for vSphere VMware NSX Networking VMware NSX

Issue/Introduction

After performing a vCenter SSO Domain repoint in a NSX networking setup, the vSphere Supervisor Cluster is stuck in Error, Configuring or Removing state.

From the vSphere web client, under Workload Management, the Supervisor cluster shows Configuring, Removing or Error state.

While connected to the vCenter Server Appliance (VCSA), the wcpsvc log shows multiple errors similar to the following:

cat /var/log/wcp/wcpsvc.log

Sending HTTP request 'GET' to NSX managers for Principal Identities
Error sending HTTP request to NSX Manager

http request failed. URL: http://localhost:1080/external-cert/http1/<NSX manager IP>/443/api/v1/trust-management/token-principal-identities. 
Status Code: 403. Status: 403 Forbidden

The wcpsvc log may also show only entries for the older/previous LOCAL domain, indicating that the SSO domain has not been updated:

INFO  AuthorizationService.AuditLog  opId=<opId>] Action performed by principal(name=<old LOCAL domain>\vpxd-extension-<id>,isGroup=false):Add global access [ Principal=Name=<old LOCAL domain>\NsxAdministrators,isGroup=true,roles=[<id>],propagating=true ]

While connected to the Supervisor cluster context, all NSX-NCP pods are in CrashLoopBackOff state, where the status values may vary depending on the state of the container restarts:

kubectl get pods -n vmware-system-nsx

NAME                     READY       STATUS             RESTARTS
<nsx-ncp-pod-name-a>     0/2         CrashLoopBackOff   ###(MmSSs ago)
<nsx-ncp-pod-name-b>     1/2         CrashLoopBackOff   ###(MmSSs ago)

This issue can also cause deactivating a Supervisor cluster to become stuck Removing with the below error:

Unable to disable cluster domain-c<id>. Err failed cleaning NSX-T resources due to failure to fetch supervisor <supervisor cluster id> principal identity: failed to get Principal Identity 'wcp-cluster-user-domain-c<id>-<supervisor cluster id>' for Supervisor '<supervisor cluster id>': error listing Principal Identities from NSX managers: error listing Principal Identities: GET http request failed. URL: http://localhost:1080/external-cert/http1/<NSX Manager IP>/443/api/v1/trust-management/token-principal-identities. Status Code: 403. Status: 403 Forbidden

Environment

vSphere Supervisor 8.0

vSphere Supervisor 9.0

NSX-T 4.X

Cause

The SSO domain change needs to be manually updated for the NSX-NCP pods in the Supervisor cluster.

Resolution

Confirm that the NSX Groups have the appropriate permissions

From the vSphere web client, navigate to Administration.
Locate the following NSX groups under your local domain (Default: vsphere.local)
```
NsxAdministrators
NsxViAdministrators
NsxAuditors
```

If there are no specific NSX roles in the environment, CREATE them accordingly under Administration -> Roles:

Role name	Description	Privileges
NSX Administrator	Allows vSphere user to view and modify NSX configuration	NSX - Modify NSX configuration
NSX Auditor	Allows vSphere user to view NSX configuration	NSX - Read NSX Configuration
NSX VI Administrator	Allows vSphere user to manage NSX	NSX - Modify NSX configuration

If any of the above NSX groups do not have Global Permissions, ADD them accordingly:

User/Group	Role	Defined in	Propagate to children
VSPHERE.LOCAL\NsxAdministrators	NSX Administrator	Global Permission	√
VSPHERE.LOCAL\NsxAuditors	NSX Auditor	Global Permission	√
VSPHERE.LOCAL\NsxViAdministrators	NSX VI Administrator	Global Permission	√

If any changes were needed for the NSX groups above, restart wcpsvc service from the vCenter Server Appliance (VCSA):
```
service-control --restart wcp
```

Update the NSX-NCP configmap object to use the repointed SSO domain

Connect into the Supervisor cluster context

Take a back up of the nsx-ncp-config configmap:

kubectl get configmap nsx-ncp-config -n vmware-system-nsx -o yaml > nsx-ncp-config-backup.yaml

Edit the configmap to update the "sso_domain" value to the <repointed SSO domain> under vc_endpoint:

kubectl edit configmap nsx-ncp-config -n vmware-system-nsx

apiVersion: v1
data:
  ...
}\nvc_endpoint
      = <vcenter.FQDN>\nsso_domain = <repointed sso domain>\nhttps_port = <port>\n\n"

Save the changes made.

Restart the NCP pods to pick up the change:

kubectl rollout restart deploy -n vmware-system-nsx nsx-ncp

Confirm that the NCP pods are no longer crashing:
```
kubectl get pods -n vmware-system-nsx
```