VKS Supervisor stuck in "Removing" state at "Supervisor deleted- Waiting for workloads in the Supervisor to be deleted"
search cancel

VKS Supervisor stuck in "Removing" state at "Supervisor deleted- Waiting for workloads in the Supervisor to be deleted"

book

Article ID: 437907

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service VMware vCenter Server

Issue/Introduction

  • When deactivating the Supervisor cluster, it gets stuck in "Removing" state. 
  • All the supervisor namespaces, Supervisor VMs, and Guest Clusters still exist within the environment.
  • The user is able to login to the supervisor cluster, check the status of pods, namespaces and other objects associated with the supervisor cluster.
  • On checking the vCenter Server's wcpsvc.log at /var/log/vmware/wcp/wcpsvc.log it shows multiple instances of the log events below.

          /var/log/vmware/wcp/wcpsvc.log

     "Failed to destroy the object resgroup-####1: ServerFaultCode: Permission to perform this operation was denied"
     "Failed to destroy the object resgroup-####2: ServerFaultCode: Permission to perform this operation was denied"
     "Failed to destroy the object resgroup-####3: ServerFaultCode: Permission to perform this operation was denied"

          error wcp [workload/vcinvnt.go:294] [opID=svc-velero-domain-c<ID>] Failed to delete ResourcePool resgroup-#####: ServerFaultCode: Permission to perform this operation was denied.
          debug wcp [logger/trace.go: 77] [BEGIN] [workload. (*Workload). updateStatePhase: 1754] setting workload svc-velero-domain-c<ID> phase to PENDING
          error wcp [vc#####/client.go: 1072] [opID=svc-tkg-domain-c<ID>] Failed to destroy the object resgroup-#####: ServerFaultCode: Permission to perform this operation was denied.
          error wcp [workload/vcinvnt.go:294] [opID=svc-tkg-domain-c<ID>] Failed to delete ResourcePool resgroup-#####: ServerFaultCode: Permission to perform this operation was denied.
          debug wcp [logger/trace.go: 77] [BEGIN] [workload. (*Workload). updateStatePhase: 1754] setting workload svc-tkg-domain-c<ID> phase to PENDING

  • On executing the solution_users_fixer.py script as per KB - Fixing missing SSO Group Memberships for vSphere Solution Users with the solution_users_fixer script, it says the group memberships are correctly updated.  However, the Error "Failed to destroy the object resgroup-####1: ServerFaultCode: Permission to perform this operation was denied" still continues to show up in the /var/log/vmware/wcp/wcpsvc.log. Below is how the expected output of the solution_users_fixer.py script might look like.

             root@vCenter [~]# python solution_users_fixer.py --fix
          Enter your sso administrator password:
          Have you taken a snapshot of this vCenter and all other vCenters in its ELM group? (yes/no): yes
          Checking group memberships for vsphere-ui-bb2#####-####-####-####-###########
              Removing vsphere-ui-bb2#####-####-####-####-########### from group: cn=systemconfiguration. administrators, dc=vsphere, dc=local
          Checking group memberships for topologysvc-bb2#####-####-####-####-###########
          Checking group memberships for wcp-storage-user-ea0a869b-df16-46bc-8cc6-cbf5af5cd2f7-bb2#####-####-####-####-###########
          Checking group memberships for observability-vapi-bb2#####-####-####-####-###########
          Checking group memberships for vmware-vsm-bb2#####-####-####-####-###########
          Checking group memberships for vmware-applmgmtservice-bb2#####-####-####-####-###########
          Checking group memberships for certificateauthority-bb2#####-####-####-####-###########
          Checking group memberships for vsphere-webclient-bb2#####-####-####-####-###########
              Removing vsphere-webclient-bb2#####-####-####-####-########### from group: cn=systemconfiguration.administrators, dc=vsphere, dc=local
          Checking group memberships for machine-bb2#####-####-####-####-###########
          Checking group memberships for vpxd-svcs-user-bb2#####-####-####-####-###########
          Checking group memberships for perfcharts-bb2#####-####-####-####-###########
          Checking group memberships for vmware-scaservice-bb2#####-####-####-####-###########
          Checking group memberships for content-library-user-bb2#####-####-####-####-###########
          Checking group memberships for hvc-bb2#####-####-####-####-###########
          Checking group memberships for trustmanagement-bb2#####-####-####-####-###########
          Checking group memberships for sps-bb2#####-####-####-####-###########
          Checking group memberships for cms-bb2#####-####-####-####-###########
          Checking group memberships for vpxd-svc-acct-bb2#####-####-####-####-###########
          Checking group memberships for serviceaccountmgmt-bb2#####-####-####-####-###########
          Checking group memberships for sts-bb2#####-####-####-####-###########
          Checking group memberships for hvc-svc-bb2#####-####-####-####-###########
          Checking group memberships for vpxd-extension-bb2#####-####-####-####-###########
          Checking group memberships for vpxd-bb2#####-####-####-####-###########
          Group memberships updated. Please restart services

  • On further using the "authz-doctor" tool to identify vCenter permission issues, it confirms that some vpxd solution users are "direct or indirect members of Administrators group and should be fixed". However, when using the command python authz-doctor.py solution_users --action fix to fix the anomaly, it does remove the direct members of the Administrators group but the indirect members of the same are still present and therefore need to be manually removed. Below is how the expected output of the authz-doctor.py script might look like.

                   root@vCenter [ /usr/lib/vmware-vpx/scripts/authz-doctor ]# python authz-doctor.py solution_users --action check
          authz-doctor version: 9.0.0.0-14454563
          Following users are direct or indirect members of Administrators group and should be fixed
          vpxd-bb2#####-####-####-####-###########: Administrators
          vpxd-bb2#####-####-####-####-###########: SystemConfiguration. Administrators => Administrators
          vpxd-svc-acct-bb2#####-####-####-####-###########: Administrators
          vpxd-svc-acct-bb2#####-####-####-####-###########: SystemConfiguration.Administrators => Administrators
          vpxd-extension-bb2#####-####-####-####-###########: Administrators
          vpxd-extension-bb2#####-####-####-####-###########: SystemConfiguration.Administrators => Administrators

          root@vCenter [ /usr/lib/vmware-vpx/scripts/authz-doctor ]# python authz-doctor.py solution_users --action fix
          authz-doctor version: 9.0.0.0-14454563
          -- Checking direct members of Administrators group ...
          Removing direct members of Administrators group
          Fix Administrators group: True
          -- Checking indirect members of Administrators group ...
          vpxd-bb2#####-####-####-####-########### is indirect member of group Administrators
          vpxd-bb2#####-####-####-####-###########: SystemConfiguration.Administrators => Administrators
          vpxd-svc-acct-bb2#####-####-####-####-########### is indirect member of group Administrators
          vpxd-svc-acct-bb2#####-####-####-####-###########: SystemConfiguration.Administrators => Administrators
          vpxd-extension-bb2#####-####-####-####-########### is indirect member of group Administrators
          vpxd-extension-bb2#####-####-####-####-###########: SystemConfiguration.Administrators => Administrators
          -- Checking vpxd-extension-XXXX user
          vpxd-extension-XXXX user is OK
          -- Result:
          Group membership changed, please restart VCSA services.
            # service-control -- stop -- all
            # service-control -- start -- all 

Environment

  • vCenter 8.x
  • vCenter 9.x
  • VMware vSphere Kubernetes Service

Cause

The vpxd-extension solution user should not be part of the Administrators group (directly or in-directly) because Administrators group only has read only permissions on the VKS Supervisor inventory. Additionally, the SystemConfiguration.Administrators member should not be part of the Administrators group.

Since the vpxd service accounts are directly or indirectly part of the Administrators group, their privileges are reduced, preventing the wcp workflow from modifying vSphere objects.

Resolution

     1. Remove the "direct" members from the Administrators group:

              1. Execute the  "/usr/lib/vmware-vpx/scripts/authz-doctor solution_user --action check" command to check the direct membership.

              2. Execute the  "/usr/lib/vmware-vpx/scripts/authz-doctor solution_user --action fix" command to fix it

                   Example output: 

                 root@vCenter [ /usr/lib/vmware-vpx/scripts/authz-doctor ]# python authz-doctor.py solution_users --action check
          authz-doctor version: 9.0.0.0-14454563
          Group membership looks OK

          root@vCenter [ /usr/lib/vmware-vpx/scripts/authz-doctor ]# python authz-doctor.py solution_users --action fix
          authz-doctor version: 9.0.0.0-14454563
          -- Checking direct members of Administrators group ...
          Administrators group is OK
          -- Checking indirect members of Administrators group ...
          -- Checking vpxd-extension-XXXX user
          vpxd-extension-XXXX user is OK
          -- Result:
          Nothing to do. Your environment is OK

     2. Remove the "indirect" member of the Administrator groups : Using the below steps   

               1.Using the CLI:

                    Run the below command to confirm the presence of the member SystemConfiguration.Administrators inside the Administrator group.

          /usr/lib/vmware-vmafd/bin/dir-cli group list --name Administrators

               2.Using the vCenter UI:

                   The same can be confirmed via the vCenter Server UI. Navigate to Menu > Administration > User and Groups> Groups and click on the Administrators group and Remove the member "SystemConfiguration.Administrators" and click OK.

Note: It is not required to restart the wcp service after performing the resolution steps.

Post implementing the steps detailed above, the VKS supervisor object removal and cleanup process should be completed as expected. 

Additional Information

Using the "authz-doctor" tool to identify vCenter permission issues

Fixing missing SSO Group Memberships for vSphere Solution Users with the solution_users_fixer script