CSI: x509 untrusted certificate errors cause CSI pods to hang in CrashLoopBackOff state in Supervisor Cluster
search cancel

CSI: x509 untrusted certificate errors cause CSI pods to hang in CrashLoopBackOff state in Supervisor Cluster

book

Article ID: 319398

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere with Tanzu

Issue/Introduction

Symptoms:
TKGS Guest Clusters or Supervisor Clusters fail to create or attach PersistentVolumes or PersistentVolumeClaims to new nodes or pods.

When reviewing CSI logging on the Guest Cluster and Supervisor Cluster, you see errors similar to:
 
vsphere-csi-controller log: Found in /var/log/pods/vmware-system-csi_vsphere-csi-controller-<ID>/vsphere-csi-controller/#.log:
 
2023-03-05T15:36:51.911510192Z stderr F {"level":"error","time":"2023-03-05T15:36:51.911423721Z","caller":"wcp/controller.go:112","msg":"failed to get vcenter. err=Post \"https://vcenter.domain.com:443/sdk\": x509: certificate signed by unknown authority"
2023-03-05T15:36:51.911570549Z stderr F time="2023-03-05T15:36:51Z" level=fatal msg="grpc failed" error="Post \"https://vcenter.domain.com:443/sdk\": x509: certificate signed by unknown authority"


vsphere-syncer log: Found in /var/log/pods/vmware-system-csi_vsphere-csi-controller-<ID>/vsphere-syncer/#.log
 
2023-03-05T15:37:09.059962556Z stderr F {"level":"error","time":"2023-03-05T15:37:09.05981786Z","caller":"vsphere/virtualcenter.go:536","msg":"failed to connect to VirtualCenter host: \"vcenter.domain.com\". Err: Post \"https://vcenter.domain.com:443/sdk\": x509: certificate signed by unknown authority"
2023-03-05T15:37:09.061305211Z stderr F {"level":"error","time":"2023-03-05T15:37:09.059849923Z","caller":"syncer/main.go:180","msg":"Error initializing Cns Operator. Error: Post \"https://vcenter.domain.com:443/sdk\": x509: certificate signed by unknown authority
 
When listing the CSI pod on the Supervisor Cluster or Guest Cluster, you see it in a CrashLoopBackOff state with numerous restarts flagged and READY state showing fewer than 6/6:
 
kubectl get pods -A | egrep "NAME|csi"
NAMESPACE               NAME                               READY   STATUS               RESTARTS   AGE
vmware-system-csi     vsphere-csi-controller-<ID>     5/6     CrashLoopBackOff   6294      103d


Environment

VMware vSphere 8.0 with Tanzu
VMware vSphere 7.0 with Tanzu

Cause

The CSI pods on the TKGS Guest Clusters pass requests for Provisioning, Attaching and Synching operations to the CSI pod on the Supervisor Cluster nodes. The Supervisor Cluster CSI pods authenticate to CNS (running on vCenter) using a certificate signed by the vCenter Certificate Authority. Once authentication is verified, the Supervisor VM's pass their CSI operations through to VC for action.

If the CSI controller is configured to check the certificate and doesn't reference the vCenter Root Certificate in the Supervisor Node's local filesystem, this will cause a failure to verify the cert chain. 

Resolution


See resolution details here: https://github.com/kubernetes-sigs/vsphere-csi-driver/pull/1731

Workaround:
CAUTION: The below steps should be performed with a VMware Support Engineer.  


SCOPE to determine values of "insecure-flag" and "ca-file" in Supervisor Cluster CSI Secret:
 

1. Check password managed by vsphere-config-secret:
 
  • # kubectl get secrets vsphere-config-secret -n vmware-system-csi -o jsonpath='{.data.vsphere-cloud-provider\.conf}' | base64 -d
 
Example output:
[Global]
insecure-flag = "false"
ca-file = ""  ----------> This should not be empty if insecure-flag is false
cluster-id = "domain-c8"
cnsregistervolumes-cleanup-intervalinmin = 720
cluster-distribution = "SupervisorCluster"
[VirtualCenter "vcenter.domain.com"]
user = "workload_storage_management-2927599b-1e8a-453c-a5d2-3871cbda9671@vsphere.local"
password = "@#$srfed$%s-gh"
datacenters = "datacenter-1"
port = "443"
targetvSANFileShareClusters = "" 

In the above example, the insecure-flag is set to false, indicating that CSI must exchange certs with vCenter. The ca-file is empty, leading to a failure to verify the Root CA of the vCenter cert. The ca-file must reference a valid file in order for this to work.

Copy the output of the above command for reference in the below steps.

 

If the ca-file is blank when the insecure-flag is set to "false", the secret will need to be updated with the following procedure to add the ca-file path:

1. Back-up the secret on the SV cluster to ensure we can revert if required:
 
  • # kubectl get secrets vsphere-config-secret -n vmware-system-csi -o jsonpath='{.data.vsphere-cloud-provider\.conf}' |base64 -d > /root/vsphere-config-secret_orig.bak

2. Generate a new vsphere-cloud-provider base64 encoded secret with the correct ca-file path referencing /etc/vmware/wcp/tls/vmca.pem:
 
  • To modify the ca-file, enter the entire command below into command prompt on the Supervisor Cluster. Change the fields in red to match the environment specific variables found in the SCOPING command output:
 
# echo '[Global]
insecure-flag = "false"
ca-file = "/etc/vmware/wcp/tls/vmca.pem"
cluster-id = "domain-c8"
cnsregistervolumes-cleanup-intervalinmin = 720
cluster-distribution = "SupervisorCluster"
[VirtualCenter "VCENTER_FQDN"]
user = "workload_storage_management-2927599b-1e8a-453c-a5d2-3871cbda9671@vsphere.local"
password = "@#$srfed$%s-gh"
datacenters = "datacenter-1"
port = "443"
targetvSANFileShareClusters = ""' | base64 | tr -d '\n'
 
-For 8.x Supervisor-id needs to be added

 # echo '[Global]
insecure-flag = "false"
ca-file = "/etc/vmware/wcp/tls/vmca.pem"
cluster-id = "domain-c8"
supervisor-id = "supervisor-<id>"
cnsregistervolumes-cleanup-intervalinmin = 720
cluster-distribution = "SupervisorCluster"
[VirtualCenter "<VCENTER_FQDN>"]
user = "workload storage management-<id>@<domain>"
password = "<password>"
datacenters = "datacenter-<id>"
port = "443"
targetvSANFileShareClusters = ""' | base64 | tr -d '\n'

 

  • This will modify the secret into base64 and will output the hash so we can enter it into the data.vsphere-cloud-provider.conf.
 
3. Run the following to edit the secret:
 
  • # kubectl edit secrets vsphere-config-secret -n vmware-system-csi

4. Delete the hash after vsphere-cloud-provider.conf and paste the new one you created from step 2. Ensure you have a hash that is only a single line.

5. Use :wq to write and quit the file, which will save the new secret.

6. Delete the csi pod to recreate it and instantiate the new secret:
 
  • # kubectl delete pod <vmware-csi-controller-id> -n vmware-system-csi



Additional Information

https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/2.0/vmware-vsphere-csp-getting-started/GUID-C754A510-40BC-47E5-B222-1FEE40CB8186.html