After Repointing vCenter Single Sign-On (SSO) Domain, vSphere Supervisor Cluster Stuck Configuring, Error or Removing State
search cancel

After Repointing vCenter Single Sign-On (SSO) Domain, vSphere Supervisor Cluster Stuck Configuring, Error or Removing State

book

Article ID: 393264

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service vSphere with Tanzu Tanzu Kubernetes Runtime VMware NSX for vSphere VMware NSX Networking VMware NSX

Issue/Introduction

After performing a vCenter SSO Domain repoint in a NSX networking setup, the vSphere Supervisor Cluster is stuck in Error, Configuring or Removing state.

From the vSphere web client, under Workload Management, the Supervisor cluster shows Configuring, Removing or Error state.

While connected to the vCenter Server Appliance (VCSA), the wcpsvc log shows multiple errors similar to the following:

cat /var/log/wcp/wcpsvc.log

Sending HTTP request 'GET' to NSX managers for Principal Identities
Error sending HTTP request to NSX Manager

http request failed. URL: http://localhost:1080/external-cert/http1/<NSX manager IP>/443/api/v1/trust-management/token-principal-identities. 
Status Code: 403. Status: 403 Forbidden

The wcpsvc log may also show only entries for the older/previous LOCAL domain, indicating that the SSO domain has not been updated:

INFO  AuthorizationService.AuditLog  opId=<opId>] Action performed by principal(name=<old LOCAL domain>\vpxd-extension-<id>,isGroup=false):Add global access [ Principal=Name=<old LOCAL domain>\NsxAdministrators,isGroup=true,roles=[<id>],propagating=true ]

While connected to the Supervisor cluster context, all NSX-NCP pods are in CrashLoopBackOff state, where the status values may vary depending on the state of the container restarts:

kubectl get pods -n vmware-system-nsx

NAME                     READY       STATUS             RESTARTS
<nsx-ncp-pod-name-a>     0/2         CrashLoopBackOff   ###(MmSSs ago)
<nsx-ncp-pod-name-b>     1/2         CrashLoopBackOff   ###(MmSSs ago)

This issue can also cause deactivating a Supervisor cluster to become stuck Removing with the below error:

Unable to disable cluster domain-c<id>. Err failed cleaning NSX-T resources due to failure to fetch supervisor <supervisor cluster id> principal identity: failed to get Principal Identity 'wcp-cluster-user-domain-c<id>-<supervisor cluster id>' for Supervisor '<supervisor cluster id>': error listing Principal Identities from NSX managers: error listing Principal Identities: GET http request failed. URL: http://localhost:1080/external-cert/http1/<NSX Manager IP>/443/api/v1/trust-management/token-principal-identities. Status Code: 403. Status: 403 Forbidden

Environment

vSphere Supervisor 8.0

vSphere Supervisor 9.0

NSX-T 4.X

Cause

The SSO domain change needs to be manually updated for the NSX-NCP pods in the Supervisor cluster.

Resolution

Confirm that the NSX Groups have the appropriate permissions

  1. From the vSphere web client, navigate to Administration.
  2. Locate the following NSX groups under your local domain (Default: vsphere.local)
    NsxAdministrators
    NsxViAdministrators
    NsxAuditors
  3. If there are no specific NSX roles in the environment, CREATE them accordingly under Administration -> Roles:
    Role name Description Privileges
    NSX Administrator Allows vSphere user to view and modify NSX configuration

    NSX - Modify NSX configuration

    NSX Auditor Allows vSphere user to view NSX configuration NSX - Read NSX Configuration
    NSX VI Administrator Allows vSphere user to manage NSX NSX - Modify NSX configuration
  4. If any of the above NSX groups do not have Global Permissions, ADD them accordingly:
    User/Group Role Defined in Propagate to children
    VSPHERE.LOCAL\NsxAdministrators NSX Administrator Global Permission
    VSPHERE.LOCAL\NsxAuditors NSX Auditor Global Permission
    VSPHERE.LOCAL\NsxViAdministrators NSX VI Administrator Global Permission
  5. If any changes were needed for the NSX groups above, restart wcpsvc service from the vCenter Server Appliance (VCSA):
    service-control --restart wcp

Update the NSX-NCP configmap object to use the repointed SSO domain

  1. Connect into the Supervisor cluster context
  2. Take a back up of the nsx-ncp-config configmap:
    kubectl get configmap nsx-ncp-config -n vmware-system-nsx -o yaml > nsx-ncp-config-backup.yaml
  3. Edit the configmap to update the "sso_domain" value to the <repointed SSO domain> under vc_endpoint:
    kubectl edit configmap nsx-ncp-config -n vmware-system-nsx
    
    apiVersion: v1
    data:
      ...
    }\nvc_endpoint
          = <vcenter.FQDN>\nsso_domain = <repointed sso domain>\nhttps_port = <port>\n\n"
  4. Save the changes made.
  5. Restart the NCP pods to pick up the change:
    kubectl rollout restart deploy -n vmware-system-nsx nsx-ncp
  6. Confirm that the NCP pods are no longer crashing:
    kubectl get pods -n vmware-system-nsx