After performing vCenter SSO Domain resync operation, PVCs fail to attach to VKS Cluster Pods

Products

VMware vSphere Kubernetes Service

Issue/Introduction

PVCs fail to attach to VKs guest cluster pods after resync operations are performed for a vCenter hosting a Tanzu Supervisor cluster and guest clusters.

Describing an affected cluster using "kubectl describe cluster <clustername> -n <namespace>" shows similar events:

Events:
  Type     Reason              Age                   From                     Message
  ----     ------              ----                  ----                     -------
  Warning  FailedScheduling    53m                   #######-#########        0/7 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/7 nodes are available: 7 No preemption victims found for incoming pod..
  Warning  FailedAttachVolume  49m (x2 over 51m)     ############-##########  AttachVolume.Attach failed for volume "###-########-####-####-####-############" : timed out waiting for ########-######## of csi.vsphere.vmware.com CSI driver to attach volume ########-####-####-####-############-########-####-####-####-############
  Warning  FailedMount         19m (x3 over 44m)     kubelet                  Unable to attach or mount volumes: unmounted volumes=[##########-##########-####### ##########-##########-########], unattached volumes=[####-###-######-##### ##########-##########-####### ##########-##########-########]: timed out waiting for the condition
  Warning  FailedAttachVolume  3m13s (x9 over 5m26s)  ############-##########  AttachVolume.Attach failed for volume "###-########-####-####-####-############" : rpc error: code = Internal desc = observed Error: "Post \"https://##########.##########.domain.com:443/sdk\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" is set on the volume "########-####-####-####-############-########-####-####-####-############" on virtualmachine "###-##-###-#######-##-#-####-######-########-##-#####-#########"
  Warning  FailedAttachVolume  3m13s (x9 over 5m25s)  ############-##########  AttachVolume.Attach failed for volume "###-########-####-####-####-############" : rpc error: code = Internal desc = observed Error: "Post \"https://##########.##########.domain.com:443/sdk\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" is set on the volume "########-####-####-####-############-########-####-####-####-############" on virtualmachine "###-##-###-#######-##-#-####-######-########-##-#####-#########"

Within the vCenter environment, wcpsvc.log shows similar failed permission errors:

YYYY-MM-DDTHH:MM:SS debug wcp [workload/controller.go:938] [opID=######-##-########=######-##] Setting permissions for workload: ######-##
YYYY-MM-DDTHH:MM:SS debug wcp [workload/controller.go:997] [opID=######-##-########=######-##] Setting permissions on entity: ########-#####
YYYY-MM-DDTHH:MM:SS error wcp [workload/controller.go:999] [opID=######-##-########=######-##] Failed to set roles for administrators in workload: ######-##, entity: ########-#####, err: ServerFaultCode: Permission to perform this operation was denied.

vmdird.log contains wcp or workload-storage-management credential errors:

YYYY-MM-DDTHH:MM:SS err vmdird  t@###############: SASLSessionStep: sasl error (-13)(SASL(-13): authentication failure: client evidence does not match what we calculated. Probably a password error)
YYYY-MM-DDTHH:MM:SS warning vmdird  t@###############: Lockout policy check - account lockout. (cn=workload_storage_management-########-####-####-####-############,cn=serviceprincipals,dc=##########,dc=##########)
YYYY-MM-DDTHH:MM:SS err vmdird  t@###############: VdirPasswordFailEvent from user(cn=workload_storage_management-########-####-####-####-############,cn=serviceprincipals,dc=##########,dc=##########), error(0)()
YYYY-MM-DDTHH:MM:SS err vmdird  t@###############: VmDirSendLdapResult: Request (Bind), Error (LDAP_INVALID_CREDENTIALS(49)), Message ((49)(SASL step failed.)), (0) socket (127.0.0.1)
YYYY-MM-DDTHH:MM:SS err vmdird  t@###############: Bind Request Failed (127.0.0.1) error 49: Protocol version: 3, Bind DN: "CN=workload_storage_management-########-####-####-####-############,cn=ServicePrincipals,dc=##########,dc=##########", Method: SASL

Environment

vCenter 8.X
vSphere with Tanzu

Cause

During the SSO domain resync operation, the solution users for Tanzu and WCP experienced errors.

Resolution

Solution user permissions can confirmed using the authz-doctor.py tool
See the following KB: Using the "authz-doctor" tool to identify vCenter permission issues

NOTE: Please make sure proper backups and snapshots of the vCenter are in place before making any changes to the environment.

Run the following command from the linked KB to confirm if there are solution-user errors on the vCenter/usr/lib/vmware-vpx/scripts/authz-doctor/authz-doctor.py solution_users # --check

The command will produce a similar output to the following:

root@############  [~ ]# /usr/lib/vmware-vpx/scripts/authz-doctor/authz-doctor.py solution_users # --checkauthz-doctor version: 9.0.0.0-14454563 Following users are direct or indirect members of Administrators group and should be fixed 
vpxd-extension-########-####-####-####-############: Administrators
vpxd-########-####-####-####-############: Administrators

If the output is similar, follow the steps in the "Fixing solution users group membership" section of the KB linked above to repair the solution users.