Tanzu Kubernetes PVC Snapshots Fail with Permissions Issues

Products

VMware vSphere Kubernetes Service

Issue/Introduction

Manual or Velero snapshots fail for a TKr Workload Cluster in a Tanzu Kubernetes environment.

Errors similar to those below are observed:
For Manual Backup -
Failed to check and update snapshot content: failed to take snapshot of the volume ########-####-####-####-############-########-####-####-####-############: "rpc error: code = Internal desc = failed to get volumesnapshot with name: ########-####-####-####-############-########-####-####-####-############a on namespace: <NAMESPACE> in supervisorCluster. Error: volumesnapshots.snapshot.storage.k8s.io \"########-####-####-####-############-########-####-####-####-############\" is forbidden: User \"system:serviceaccount:<NAMESPACE>:<ACCOUNT NAME>\" cannot get resource \"volumesnapshots\" in API group \"snapshot.storage.k8s.io\" in the namespace \"<NAMESPACE>\""
For Velero Backup -
Volumesnapshotcontent snapcontent-########-####-####-####-############ has error: Failed to check and update snapshot content: failed to take snapshot of the volume ########-####-####-####-############-########-####-####-####-############: \"rpc error: code = Internal desc = failed to get volumesnapshot with name: ########-####-####-####-############-########-####-####-####-############ on namespace: <NAMESPACE> in supervisorCluster. Error: volumesnapshots.snapshot.storage.k8s.io \\\"########-####-####-####-############-########-####-####-####-############\\\" is forbidden: User \\\"system:serviceaccount:<CLUSTER>:<PROVIDER-SERVICE-ACCOUNT>-pvcsi\\\" cannot get resource \\\"volumesnapshots\\\" in API group \\\"snapshot.storage.k8s.io\\\" in the namespace \\\"<NAMESPACE>\\\"\"" backup=velero/<backup> cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/util/util.go:###" pluginName=velero-plugin-for-csi

Environment

vSphere with Tanzu

Cause

Permissions did not fully propagate to the provider service account during it's creation.

Resolution

Use the following steps and commands to correct any issues with permission propagation to the provider service account.
Values will need to be updated to reflect the values in the affected environment.

1) Access the supervisor cluster using the following documentation:
Connect to Supervisor Using the Tanzu CLI and vCenter SSO Authentication

2) Within the supervisor cluster, review the provider service account permissions
- The provider service account has a name similar to <CLUSTER NAME>-pvsci in the cluster namespace
- Use the following command to list all provider service accounts: kubectl get providerserviceaccount.vmware.infrastructure.cluster.x-k8s.io -n <NAMESPACE>
- Use the following command to review the configuration of the provider service account in question: kubectl describe providerserviceaccount.vmware.infrastructure.cluster.x-k8s.io <CLUSTER NAME>-pvsci -n <NAMESPACE>

3) Take a backup of the provider service account using the following command: kubectl get providerserviceaccount.vmware.infrastructure.cluster.x-k8s.io <CLUSTER NAME>-pvsci -n <NAMESPACE> -o yaml > <CLUSTER NAME>-PSA.yaml

4) Delete the provider service account using the following command: kubectl delete providerserviceaccount.vmware.infrastructure.cluster.x-k8s.io <CLUSTER NAME>-pvsci -n <NAMESPACE>

5) Annotate or update the TKc in question in order to trigger a reconciliation. This should be done with an innocuous or non-functional change. See the example below:
kubectl patch vspherecsiconfig <CLUSTER>-vsphere-pv-csi-package -n <NAMESPACE> \
--type='merge' \
-p '{"metadata":{"labels":{"tkg.tanzu.vmware.com/force-reconcile":"true"}}}'

6) Validate the provider service account was recreated and appears in the list of provider service accounts
kubectl get providerserviceaccount.vmware.infrastructure.cluster.x-k8s.io -n <NAMESPACE>