When provisioning stateful set or even creating a standalone PVC (PersistentVolumeClaim) using the standard vsan storage from vsphere using storage class of the following type, the creation of PVC takes up to 30 min to complete:
storage-class.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: demo
provisioner: kubernetes.io/vsphere-volume
parameters:
diskformat: thin
PVC creation is handled by kube-controller manager and vsphere for provisioning. When a request for new PVC is received the kube-controller is sending a provisioning request to vSphere for execution and waiting for the process to complete.
Symptoms:
Creation of PVC normally takes few seconds, but it can take up to 15 min in some cases where various reasons could apply, in this scenario I am covering backend vsphere creation of the disks.
Product Version: 1.7
Checklist:
In TKGi, default verbosity is set to 2 and it might be insufficient to verify the complete process. Increasing the verbosity to higher levels might be required to verify, in detail, what is happening - this step has to be completed on all masters.
Identify the process kube-controller from master nodes:
ps -ef | grep kube-controller vcap 13554 13533 1 Dec08 ? 00:20:42 /var/vcap/packages/kubernetes/bin/kube-controller-manager ...
Identify the path:
/var/vcap/jobs/kube-controller-manager/config
Find and modify the file:
/var/vcap/jobs/kube-controller-manager/config/bpm.yml
Update the verbosity option --v:
processes: - name: kube-controller-manager executable: /var/vcap/packages/kubernetes/bin/kube-controller-manager args: - "--cluster-name=demo" - "--cluster-signing-cert-file=/var/vcap/jobs/kube-controller-manager/config/cluster-signing-ca.pem" - "--cluster-signing-key-file=/var/vcap/jobs/kube-controller-manager/config/cluster-signing-key.pem" - "--kubeconfig=/var/vcap/jobs/kube-controller-manager/config/kubeconfig" - "--root-ca-file=/var/vcap/jobs/kube-controller-manager/config/ca.pem" - "--service-account-private-key-file=/var/vcap/jobs/kube-controller-manager/config/service-account-private-key.pem" - "--terminated-pod-gc-threshold=100" - "--tls-cert-file=/var/vcap/jobs/kube-controller-manager/config/kube-controller-manager-cert.pem" - "--tls-private-key-file=/var/vcap/jobs/kube-controller-manager/config/kube-controller-manager-private-key.pem" - "--use-service-account-credentials=true" - "--v=5"
Restart the service:
monit restart kube-controller-manager
Verify the service is restarted:
monit summary
Complete same process on all masters. Then you need to verify the primary kube-controller manager. Create the PVC and verify the status of the file, collect the logs and offline verify status of the PVC creation. Repeat the same procedure to revert the verbosity to 2 as the log files will grow exponentially.
Verify the prometheus statistics from masters:
curl -s localhost:10252/metrics | grep "cloudprovider_vsphere"
# HELP cloudprovider_vsphere_api_request_duration_seconds [ALPHA] Latency of vsphere api call
# TYPE cloudprovider_vsphere_api_request_duration_seconds histogram
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="0.005"} 0
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="0.01"} 0
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="0.025"} 0
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="0.05"} 0
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="0.1"} 0
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="0.25"} 2
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="0.5"} 5
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="1"} 5
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="2.5"} 5
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="5"} 5
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="10"} 5
cloudprovider_vsphere_api_request_duration_seconds_bucket{request="CreateVolume",le="+Inf"} 5
cloudprovider_vsphere_api_request_duration_seconds_sum{request="CreateVolume"} 1.2529264489999998
cloudprovider_vsphere_api_request_duration_seconds_count{request="CreateVolume"} 5
# HELP cloudprovider_vsphere_operation_duration_seconds [ALPHA] Latency of vsphere operation call
# TYPE cloudprovider_vsphere_operation_duration_seconds histogram
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="0.005"} 0
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="0.01"} 0
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="0.025"} 0
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="0.05"} 0
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="0.1"} 0
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="0.25"} 0
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="0.5"} 4
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="1"} 5
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="2.5"} 5
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="5"} 5
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="10"} 5
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="CreateVolumeOperation",le="+Inf"} 5
cloudprovider_vsphere_operation_duration_seconds_sum{operation="CreateVolumeOperation"} 2.0968729269999997
cloudprovider_vsphere_operation_duration_seconds_count{operation="CreateVolumeOperation"} 5
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="0.005"} 0
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="0.01"} 0
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="0.025"} 0
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="0.05"} 1958
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="0.1"} 3640
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="0.25"} 3698
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="0.5"} 3705
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="1"} 3705
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="2.5"} 3705
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="5"} 3705
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="10"} 3705
cloudprovider_vsphere_operation_duration_seconds_bucket{operation="DisksAreAttachedOperation",le="+Inf"} 3705
cloudprovider_vsphere_operation_duration_seconds_sum{operation="DisksAreAttachedOperation"} 207.47464091999979
cloudprovider_vsphere_operation_duration_seconds_count{operation="DisksAreAttachedOperation"} 3705
This will provide you with counters about time taken for specific vcenter executions and can help to pinpoint the issue.