TKGi cluster VSphere Cloud Provider dynamic storage provisioning is very slow
search cancel

TKGi cluster VSphere Cloud Provider dynamic storage provisioning is very slow

book

Article ID: 298633

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

When provisioning stateful set or even creating a standalone PVC (PersistentVolumeClaim) using the standard vsan storage from vsphere using storage class of the following type, the creation of PVC takes up to 30 min to complete:

storage-class.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: demo
provisioner: kubernetes.io/vsphere-volume
parameters:
diskformat: thin

PVC creation is handled by kube-controller manager and vsphere for provisioning. When a request for new PVC is received the kube-controller is sending a provisioning request to vSphere for execution and waiting for the process to complete.

Symptoms:
Creation of PVC normally takes few seconds, but it can take up to 15 min in some cases where various reasons could apply, in this scenario I am covering backend vsphere creation of the disks.

Environment

Product Version: 1.7

Resolution

Checklist:
In TKGi, default verbosity is set to 2 and it might be insufficient to verify the complete process. Increasing the verbosity to higher levels might be required to verify, in detail, what is happening - this step has to be completed on all masters.

Identify the process kube-controller from master nodes:

ps -ef | grep kube-controller

vcap     13554 13533  1 Dec08 ?        00:20:42 /var/vcap/packages/kubernetes/bin/kube-controller-manager ...


Identify the path:

/var/vcap/jobs/kube-controller-manager/config


Find and modify the file:

/var/vcap/jobs/kube-controller-manager/config/bpm.yml


Update the verbosity option --v:

processes:
- name: kube-controller-manager
  executable: /var/vcap/packages/kubernetes/bin/kube-controller-manager
  args:

  - "--cluster-name=demo"

  - "--cluster-signing-cert-file=/var/vcap/jobs/kube-controller-manager/config/cluster-signing-ca.pem"

  - "--cluster-signing-key-file=/var/vcap/jobs/kube-controller-manager/config/cluster-signing-key.pem"

  - "--kubeconfig=/var/vcap/jobs/kube-controller-manager/config/kubeconfig"

  - "--root-ca-file=/var/vcap/jobs/kube-controller-manager/config/ca.pem"

  - "--service-account-private-key-file=/var/vcap/jobs/kube-controller-manager/config/service-account-private-key.pem"

  - "--terminated-pod-gc-threshold=100"

  - "--tls-cert-file=/var/vcap/jobs/kube-controller-manager/config/kube-controller-manager-cert.pem"

  - "--tls-private-key-file=/var/vcap/jobs/kube-controller-manager/config/kube-controller-manager-private-key.pem"

  - "--use-service-account-credentials=true"

  - "--v=5"


Restart the service:

monit restart kube-controller-manager


Verify the service is restarted:

monit summary 


Complete same process on all masters. Then you need to verify the primary kube-controller manager. Create the PVC and verify the status of the file, collect the logs and offline verify status of the PVC creation. Repeat the same procedure to revert the verbosity to 2 as the log files will grow exponentially.

Verify the prometheus statistics from masters:

curl -s localhost:10252/metrics | grep "cloudprovider_vsphere"
# HELP cloudprovider_vsphere_api_request_duration_seconds [ALPHA] Latency of vsphere api call
# TYPE cloudprovider_vsphere_api_request_duration_seconds histogram
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="0.005"} 0
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="0.01"} 0
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="0.025"} 0
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="0.05"} 0
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="0.1"} 0
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="0.25"} 2
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="0.5"} 5
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="1"} 5
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="2.5"} 5
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="5"} 5
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="10"} 5
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="+Inf"} 5
cloudprovider_vsphere_api_request_duration_seconds_sum{request="CreateVolume"} 1.2529264489999998
cloudprovider_vsphere_api_request_duration_seconds_count{request="CreateVolume"} 5
# HELP cloudprovider_vsphere_operation_duration_seconds [ALPHA] Latency of vsphere operation call
# TYPE cloudprovider_vsphere_operation_duration_seconds histogram
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="0.005"} 0
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="0.01"} 0
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="0.025"} 0
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="0.05"} 0
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="0.1"} 0
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="0.25"} 0
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="0.5"} 4
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="1"} 5
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="2.5"} 5
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="5"} 5
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="10"} 5
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="+Inf"} 5
cloudprovider_vsphere_operation_duration_seconds_sum{operation="CreateVolumeOperation"} 2.0968729269999997
cloudprovider_vsphere_operation_duration_seconds_count{operation="CreateVolumeOperation"} 5
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="0.005"} 0
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="0.01"} 0
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="0.025"} 0
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="0.05"} 1958
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="0.1"} 3640
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="0.25"} 3698
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="0.5"} 3705
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="1"} 3705
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="2.5"} 3705
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="5"} 3705
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="10"} 3705
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="+Inf"} 3705
cloudprovider_vsphere_operation_duration_seconds_sum{operation="DisksAreAttachedOperation"} 207.47464091999979
cloudprovider_vsphere_operation_duration_seconds_count{operation="DisksAreAttachedOperation"} 3705


This will provide you with counters about time taken for specific vcenter executions and can help to pinpoint the issue.