kapp-controller fails to reconcile vsphere-csi app after upgrading TKG from v1.4.0 to v1.5.4
search cancel

kapp-controller fails to reconcile vsphere-csi app after upgrading TKG from v1.4.0 to v1.5.4

book

Article ID: 345701

calendar_today

Updated On:

Products

VMware

Issue/Introduction

Symptoms:
After Upgrading TKG from 1.4.x to 1.5.x vsphere-csi app would fail to reconcile on dev plan clusters
ubuntu@jumpbox:~$ kubectl get apps -A
NAMESPACE    NAME                                DESCRIPTION                                                                       SINCE-DEPLOY   AGE
default      tkg-workload-kasi-kapp-controller   Reconcile succeeded                                                               59s            125m
tkg-system   antrea                              Reconcile succeeded                                                               2m38s          161m
tkg-system   metrics-server                      Reconcile succeeded                                                               2m55s          161m
tkg-system   tanzu-addons-manager                Reconcile succeeded                                                               3m20s          164m
tkg-system   tanzu-core-management-plugins       Reconcile succeeded                                                               31s            110m
tkg-system   tanzu-featuregates                  Reconcile succeeded                                                               118s           110m
tkg-system   vsphere-cpi                         Reconcile succeeded                                                               2m50s          161m
tkg-system   vsphere-csi                         Reconcile failed: Deploying: Error (see .status.usefulErrorMessage for details)   4m             161m

The upgrade cluster command will show the below error: 

Warning: Cluster is upgraded successfully, but some packages are failing. Failure while waiting for packages to be installed: package reconciliation failed: I0302 07:39:43.498254     335 request.go:665] Waited for 1.028652051s due to client-side throttling, not priority and fairness, request: GET:https://100.64.0.1:443/apis/cluster.x-k8s.io/v1alpha4?timeout=32s
kapp: Error: waiting on reconcile deployment/vsphere-csi-controller (apps/v1) namespace: kube-system:
  Finished unsuccessfully (Deployment is not progressing: ProgressDeadlineExceeded (message: ReplicaSet "vsphere-csi-controller-74fbc44755" has timed out progressing.))
Management cluster 'tkg-mgmt-kasi' successfully upgraded to TKG version 'v1.5.4' with kubernetes version 'v1.22.9+vmware.1'

 

 


Environment

VMware Tanzu Kubernetes Grid 1.x

Cause

In CSI v2.4, the default number of CSI replicas changed from 1 to 3. For clusters with fewer than 3 control plane nodes (e.g. 1, to preserve quorum), this change disables kapp-controller from matching the current number of CSI replicas to the desired state.

Resolution

Workaround: Patch deployment_replicas to match the number of control plane nodes

1. Ensure that your current kubectl context is set to the desired management cluster (kubectl config set-context <management cluster context name>).

2. Retrieve the data values file corresponding to the secret for
CLUSTER-NAME-vsphere-csi-addon in the management cluster. For example:
kubectl get secrets <CLUSTER_NAME>-vsphere-csi-addon -n tkg-system -o jsonpath={.data.values\\.yaml} | base64 -d > values.yaml

 
3. Add this line to the end of the new values.yaml file:
deployment_replicas: 1

4. Update the CLUSTER-NAME-vsphere-csi-addon secret. For example:
kubectl create secret generic <CLUSTER_NAME>-vsphere-csi-addon -n tkg-system --type=tkg.tanzu.vmware.com/addon --from-file=values.yaml=values.yaml --dry-run=client -o yaml | kubectl replace -f -

5. Add labels to CLUSTER-NAME-vsphere-csi-addon, For example
kubectl label secret <CLUSTER_NAME>-vsphere-csi-addon tkg.tanzu.vmware.com/cluster-name=<CLUSTER_NAME>
kubectl label secret <CLUSTER_NAME>-vsphere-csi-addon tkg.tanzu.vmware.com/addon-name=vsphere-csi

6. Wait about 10 minutes for kapp to reconcile the change, and then confirm that vsphere-csi has 1 replica and status changed to Reconcile succeeded,For example
ubuntu@jumpbox:~$ kubectl get pkgi vsphere-csi -n tkg-system
NAME          PACKAGE NAME                   PACKAGE VERSION        DESCRIPTION           AGE
vsphere-csi   vsphere-csi.tanzu.vmware.com   2.4.1+vmware.1-tkg.1   Reconcile succeeded   10d
Note: Replace the <CLUSTER_NAME> with the cluster name where you are facing the CSI reconciliation failure issue