"opening storage failed: open <device> no space left on device" error when starting a pod on a cluster deployed by Container Service Extension.

Products

VMware Cloud Director

Issue/Introduction

Symptoms:

You cannot start a Pod on a K8S Cluster deployed by CSE
Checking the Pod logs you see errors relating to disk usage.

kubectl logs <POD_NAME>

ts=2024-01-24T18:51:43.125Z caller=main.go:1166 level=error err="opening storage failed: open <device>: no space left on device"

Inspection of the Pod shows a full disk.

$ kubectl exec <POD_NAME> -- df -h

Filesystem Size Used Avail Use% Mounted on
overlay 19G 13G 5.3G 71% /
tmpfs 64M 0 64M 0% /dev
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/sda4 19G 13G 5.3G 71% /tmp
/dev/sdb 10G 10G 0M 100% /mnt
shm 64M 0 64M 0% /dev/shm
tmpfs 32G 12K 32G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 16G 0 16G 0% /proc/acpi
tmpfs 16G 0 16G 0% /proc/scsi
tmpfs 16G 0 16G 0% /sys/firmware

Environment

VMware Cloud Director 10.x

Cause

This issue occurs because the default size by which the PVCs are created may not be adequate for the long-term storage needs of the service.

Resolution

This is a known issue impacting the cluster created using Container Service Extension (CSE) up to and including version 4.2, and the Container Storage Interface (CSI) driver up to and including version 1.5.

As the current version of the CSI driver currently does not support resizing of PVCs while attached to a Pod, see the workaround section below.

Workaround:
NOTE:
Before proceeding you will need a KUBECONFIG file to access the affected cluster and VCD credentials for the cluster author.
See Manage Clusters for more information regarding obtaining a copy of the current Kube Config.

The high-level process includes four phases:

Shutdown the affected Pod.
Increase the size of the volume in VCD.
Use a temporary Pod to resize the filesystem on the volume.
Restart the affected Pod.

Identify which Deployment or StatefulSet controls the affected Pod. You will use this resource to control the affected Pod.

Shutdown the affected workloads

The Pod must be shutdown before making any changes.

If the controlling resource is managed by kapp-controller, the PackageInstall object must be paused or changes to the resource will be automatically overwritten.

$ kctrl package installed list
$ kctrl package installed pause -i <PACKAGE_INSTALL_NAME>

The Pods for the resource may now be terminated by scaling it down to zero replicas.

$ kubectl get <RESOURCE_TYPE>/<RESOURCE_NAME>
# Record the desired number of replicas for the resource
$ kubectl scale <RESOURCE_TYPE>/<RESOURCE_NAME> --replicas=0

Use kubectl to retrieve information about the affected PVC.

$ kubectl get pvc -o=custom-columns=NAME:.metadata.name,VOLUME:.spec.volumeName

NAME VOLUME
data-kafka-controller-0 pvc-4556b190-a4f7-####-####-########47b
data-postgres-postgresql-0 pvc-ab50999f-06ac-####-####-########630
grafana-pvc pvc-9736b721-c2c7-####-####-########33a
minio pvc-cd616d40-25d7-####-####-########762

Record the NAME and VOLUME of the affected PVC for later steps.

Increase the volume size

Note:
This process may need to be repeated multiple times if there are multiple replicas in the StatefulSet.
Repeat this process for each Named Disk before continuing to the next step.

This process will use the VCD UI to resize the Named Disk associated with the PVC.

Login to the VCD Tenant UI as the cluster author
Browse to the Organization VDC hosting the CSE cluster.
Click on Storage -> Named Disks.
Filter the list of the Named Disks using the PVC Volume identified earlier.
Select the disk and click Edit.
Enter a new size for the Named Disk that will satisfy your requirements.
Click Save.
Wait for the associated resize task to finish.

The underlying volume for the PVC has now been increased in size, but the filesystem has not been expanded.

Resize the filesystem

Note:
This process may need to be repeated multiple times if there are multiple replicas in the StatefulSet.
Repeat this process for each PVC before continuing to the next step.

This process will use a temporary Pod to mount the PVC so you may resize the filesystem to consume the expanded capacity.

kubectl run -it --attach --rm reformat --overrides='
{
"spec": {
"containers": [
{
"name": "reformat",
"image": "ubuntu:14.04",
"args": [
"bash"
],
"stdin": true,
"stdinOnce": true,
"tty": true,
"securityContext": {
"privileged": true
},
"volumeMounts": [{
"mountPath": "/mnt",
"name": "data"
}]
}
],
"volumes": [{
"name":"data",
"persistentVolumeClaim":{
"claimName": "<PVC_NAME>"
}
}]
}
}
' --image=ubuntu:14.04

The prompt will pause while the Pod is scheduled and started.
Click Enter a couple of times if you think it is ready but don’t see a command prompt.

Run df /mnt to identify the device associated with the mounted PVC. Record the value of Filesystem.

$ df /mnt
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdc 5074592 4796064 0 100% /mnt

Run resize2fs to resize the filesystem to consume the expanded capacity

$ resize2fs <FILESYSTEM>

Exit the shell. The Pod will be removed.

$ exit

The filesystem on the PVC has now been updated to consume the expanded capacity of the underlying volume.

Restart the affected workloads

If the controlling resource is managed by kapp-controller, then you can unpause the PackageInstall.
The package will reconcile and update the resource to the desired number of replicas.

$ kctrl package install kick -i <PACKAGE_INSTALL_NAME>

Otherwise, use kubectl to scale the resource back to the initial number of replicas.

$ kubectl scale <RESOURCE_TYPE>/<RESOURCE_NAME> --replicas=<COUNT>

Monitor the Pod status to ensure they start.
Restart the troubleshooting process if they continue to fail.

Additional Information

Impact/Risks:
Any operation impacting persistent storage should be tested before it is used in production. Improper steps may lead to the loss of production data.