Pod ephemeral storage utilisation

Products

VMware Tanzu Kubernetes Grid Integrated Edition 1.x VMware Tanzu Kubernetes Grid Integrated (TKGi)

Issue/Introduction

During cluster creation either by updating the plan or by creating a compute profile the size of the disks can be adjusted to match the desired preferences for different use case scenarios.
There are two types of disks that are created
Ephemeral disks - that will be lost during recreation of the worker
Persistent disk - that will remain over the worker recreation or stemcell update

Mount structure:
Ephemeral mounted at root (/) and /var/vcap/data
/dev/sda1 4.9G 2.7G 1.9G 59% /
/dev/sdb1 32G 4.1G 26G 14% /var/vcap/data
Persistent mounted at /var/vcap/store
/dev/sdc1 32G 2.9G 27G 10% /var/vcap/store

As the worker nodes does not use SWAP there is no SWAP partition

Persistent volume stores containerd filesystem as per below description

The Docker filesystem is organised into layers, each representing a filesystem change. Here's a summary of Docker filesystem layers:

Base Image Layer: The base image layer is the foundation of a Docker image. It contains the initial filesystem state and typically includes the operating system and basic software packages needed to run an application.
Intermediate Layers: Intermediate layers represent changes made to the base image. These changes could include installing software, updating configurations, or adding files. Each intermediate layer is created based on the changes made relative to the previous layer.
Top Read-Write Layer: The top layer is a read-write layer that sits on top of the intermediate layers. It captures any changes made to the filesystem during container runtime, such as modifications to files or directories by running processes. This layer is ephemeral and discarded when the container is deleted.
Union File System (UnionFS): Docker uses UnionFS to implement its layered filesystem. UnionFS allows multiple filesystems to be layered on top of each other, providing a unified view of the filesystem while minimizing duplication of data. Common UnionFS drivers used in Docker include OverlayFS and Aufs.

These layers are stacked on top of each other to form the complete filesystem of a Docker container. They enable Docker's lightweight and efficient approach to containerisation by allowing images to be built incrementally and shared efficiently across different containers. Additionally, Docker's copy-on-write mechanism ensures that changes to the filesystem are minimal, reducing storage overhead and improving performance.

Ephemeral disk mounted at /var/vcap/data is used for several different purposes like storing logs and some data that can and will be recreated during VM lifecycle.
This disk is also used to store pod Ephemeral volume like emptydir:
kubectl describe node <NAME>
...
Capacity:
ephemeral-storage: 32845584Ki
Allocatable:
ephemeral-storage: 30270490165

To confirm the space usage we can complete the following tests:
Confirm emotydir usage:
Create a pod:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: nginx
    resources:
      requests:
        ephemeral-storage: "20Gi"
      limits:
        ephemeral-storage: "20Gi"
    volumeMounts:
    - name: storage
      mountPath: /data
  volumes:
  - name: storage
    emptyDir:
      sizeLimit: "20Gi"

Once the pod is running exec into the pod and by using dd fill up the space of the /data mount:
kubectl exec -it pod/my-pod -- bash
root@my-pod:/# cd data/
root@my-pod:/data# dd if=/dev/urandom of=random_data_file bs=1M count=10240 status=progress

This will create a 11 GB file which will be stored in the ephemeral disk on the worker:
From worker hosting the pod under /var/vcap/data/kubelet/pods the file allocation can be seen
/var/vcap/data/kubelet/pods# du -h -d 1
11G ./dc60953a-42a1-4394-b88d-4b7fd1052967
72K ./7fc33177-2103-4aee-8d05-f5eaf9f52f36

Also comparing the volume from pod and worker we can see the same size and utilisation:
From pod
/dev/sdb1 32G 15.1G 16.9G 47% /data
from worker
/dev/sdb1 32G 15.1G 16.9G 47% /var/vcap/data

After deleting the file we created data usage shows from pod
/dev/sdb1 32G 4.1G 26G 14% /data
and from worker
/dev/sdb1 32G 4.1G 26G 14% /var/vcap/data

Confirm containerd persistent volume usage:
If we use the same example from above but instead of writing the file in the data folder we write into tmp folder.
cd tmp/
root@my-pod:/tmp# dd if=/dev/urandom of=random_data_file bs=1M count=10240 status=progress
10716446720 bytes (11 GB, 10 GiB) copied, 61 s, 176 MB/s

We will see the because we are writing into containerd file system the persistent volume will increase:
the mount from pod utilising 44 %
/dev/sdc1 32G 13G 17G 44% /etc/hostname
and persistent volume matching the size:
/dev/sdc1 32G 13G 17G 44% /var/vcap/store

We have another option where we can mount OS folder directly into the container where in this situation we can utilise any of the disks as seen in the example below:

cat diskmount.yaml
apiVersion: v1
kind: Pod
metadata:
  name: diskpod
spec:
  containers:
  - name: my-container
    image: nginx
    volumeMounts:
    - mountPath: /var/lib/twistlock
      name: data-folder
    - mountPath: /var/vcap/sys/run/containerd
      name: docker-sock-folder
    - mountPath: /var/lib/containerd
      name: cri-data
    - mountPath: /run
      name: runc-proxy-sock-folder
  volumes:
  - hostPath:
      path: /var/vcap/store/twistlock-data
      type: ""
    name: data-folder
  - hostPath:
      path: /var/vcap/sys/run/containerd
      type: ""
    name: docker-sock-folder
  - hostPath:
      path: /var/lib/containerd
      type: ""
    name: cri-data
  - hostPath:
      path: /run
      type: ""
    name: runc-proxy-sock-folder

Once the pod is started we can see that we have 3 mount points:
/dev/sdb1 32G 4.1G 26G 14% /etc/hosts
/dev/sdc1 32G 2.9G 27G 10% /etc/hostname
/dev/sda1 4.9G 2.7G 1.9G 59% /var/lib/containerd

They corresponds to the OS mounted disks:
/dev/sda1 4.9G 2.7G 1.9G 59% /
/dev/sdb1 32G 4.1G 26G 14% /var/vcap/data
/dev/sdc1 32G 2.9G 27G 10% /var/vcap/store

In conclusion
Ephemeral-space for pod if defined as emptydir will be stored in /var/vcap/data
mount and any TMP file or anything that will pile up but not pointing to the ephemeral storage will be placed under the /var/vcap/store where the containerd stores its file system layers . Exception from this rule is the case where direct path to a folder is specified in this case the data will be written to that folder and respective disk there this volume is mounted.

Environment

VMware Tanzu Kubernetes Grid Integrated Edition 1.x

Resolution

Ephemeral storage