PVCs fail to attach to VKs guest cluster pods after resync operations are performed for a vCenter hosting a Tanzu Supervisor cluster and guest clusters.
Describing an affected cluster using "kubectl describe cluster <clustername> -n <namespace>" shows similar events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 53m #######-######### 0/7 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/7 nodes are available: 7 No preemption victims found for incoming pod..
Warning FailedAttachVolume 49m (x2 over 51m) ############-########## AttachVolume.Attach failed for volume "###-########-####-####-####-############" : timed out waiting for ########-######## of csi.vsphere.vmware.com CSI driver to attach volume ########-####-####-####-############-########-####-####-####-############
Warning FailedMount 19m (x3 over 44m) kubelet Unable to attach or mount volumes: unmounted volumes=[##########-##########-####### ##########-##########-########], unattached volumes=[####-###-######-##### ##########-##########-####### ##########-##########-########]: timed out waiting for the condition
Warning FailedAttachVolume 3m13s (x9 over 5m26s) ############-########## AttachVolume.Attach failed for volume "###-########-####-####-####-############" : rpc error: code = Internal desc = observed Error: "Post \"https://##########.##########.domain.com:443/sdk\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" is set on the volume "########-####-####-####-############-########-####-####-####-############" on virtualmachine "###-##-###-#######-##-#-####-######-########-##-#####-#########"
Warning FailedAttachVolume 3m13s (x9 over 5m25s) ############-########## AttachVolume.Attach failed for volume "###-########-####-####-####-############" : rpc error: code = Internal desc = observed Error: "Post \"https://##########.##########.domain.com:443/sdk\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" is set on the volume "########-####-####-####-############-########-####-####-####-############" on virtualmachine "###-##-###-#######-##-#-####-######-########-##-#####-#########"
Within the vCenter environment, wcpsvc.log shows similar failed permission errors:
YYYY-MM-DDTHH:MM:SS debug wcp [workload/controller.go:938] [opID=######-##-########=######-##] Setting permissions for workload: ######-##
YYYY-MM-DDTHH:MM:SS debug wcp [workload/controller.go:997] [opID=######-##-########=######-##] Setting permissions on entity: ########-#####
YYYY-MM-DDTHH:MM:SS error wcp [workload/controller.go:999] [opID=######-##-########=######-##] Failed to set roles for administrators in workload: ######-##, entity: ########-#####, err: ServerFaultCode: Permission to perform this operation was denied.
vmdird.log contains wcp or workload-storage-management credential errors:
YYYY-MM-DDTHH:MM:SS err vmdird t@###############: SASLSessionStep: sasl error (-13)(SASL(-13): authentication failure: client evidence does not match what we calculated. Probably a password error)
YYYY-MM-DDTHH:MM:SS warning vmdird t@###############: Lockout policy check - account lockout. (cn=workload_storage_management-########-####-####-####-############,cn=serviceprincipals,dc=##########,dc=##########)
YYYY-MM-DDTHH:MM:SS err vmdird t@###############: VdirPasswordFailEvent from user(cn=workload_storage_management-########-####-####-####-############,cn=serviceprincipals,dc=##########,dc=##########), error(0)()
YYYY-MM-DDTHH:MM:SS err vmdird t@###############: VmDirSendLdapResult: Request (Bind), Error (LDAP_INVALID_CREDENTIALS(49)), Message ((49)(SASL step failed.)), (0) socket (127.0.0.1)
YYYY-MM-DDTHH:MM:SS err vmdird t@###############: Bind Request Failed (127.0.0.1) error 49: Protocol version: 3, Bind DN: "CN=workload_storage_management-########-####-####-####-############,cn=ServicePrincipals,dc=##########,dc=##########", Method: SASL
vCenter 8.X
vSphere with Tanzu
During the SSO domain resync operation, the solution users for Tanzu and WCP experienced errors.
Solution user permissions can confirmed using the authz-doctor.py tool
See the following KB: Using the "authz-doctor" tool to identify vCenter permission issues
NOTE: Please make sure proper backups and snapshots of the vCenter are in place before making any changes to the environment.
Run the following command from the linked KB to confirm if there are solution-user errors on the vCenter /usr/lib/vmware-vpx/scripts/authz-doctor/authz-doctor.py solution_users # --check
The command will produce a similar output to the following:
root@############ [~ ]# /usr/lib/vmware-vpx/scripts/authz-doctor/authz-doctor.py solution_users # --checkauthz-doctor version: 9.0.0.0-14454563 Following users are direct or indirect members of Administrators group and should be fixed
vpxd-extension-########-####-####-####-############: Administrators
vpxd-########-####-####-####-############: Administrators
If the output is similar, follow the steps in the "Fixing solution users group membership" section of the KB linked above to repair the solution users.