Scenarios where Pods remain in CrashLoopBackOff with TKG 1.2.1 using PersistentVolumeClaim and securityContext.fsGroup

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

Symptoms:
Grafana Pod Scenario:

After deploying Grafana in Tanzu Kubernetes Grid (TKG) 1.2.1, the UI is inaccessible.
You might see an error stating "no healthy upstream" when accessing the Grafana UI.
You see that the grafana pod in the tanzu-system-monitoring namespace has a status of CrashLoopBackOff:

kubectl -n tanzu-system-monitoring get pod

NAME READY STATUS RESTARTS AGE
grafana-857b868b67-7jkch 1/2 CrashLoopBackOff 1 2m20s

You see messages similar to the following when examining the logs for the grafana container in the failing grafana pod:

kubectl -n tanzu-system-monitoring logs grafana-857b868b67-7jkch

grafanaGF_PATHS_DATA='/var/lib/grafana' is not writable.
You may have issues with file permissions, more information here: http://docs.grafana.org/installation/docker/#migration-from-a-previous-version-of-the-docker-container-to-5-1-or-later
mkdir: cannot create directory '/var/lib/grafana/plugins': Permission denied
t=2021-01-04T18:42:03+0000 lvl=info msg="Starting Grafana" logger=server version=7.0.3 commit=unknown-dev branch=master compiled=2020-09-07T23:00:02+0000
t=2021-01-04T18:42:03+0000 lvl=info msg="Config loaded from" logger=settings file=/usr/share/grafana/conf/defaults.ini
t=2021-01-04T18:42:03+0000 lvl=info msg="Config loaded from" logger=settings file=/etc/grafana/grafana.ini
t=2021-01-04T18:42:03+0000 lvl=info msg="Config overridden from command line" logger=settings arg="default.paths.data=/var/lib/grafana"
t=2021-01-04T18:42:03+0000 lvl=info msg="Config overridden from command line" logger=settings arg="default.paths.logs=/var/log/grafana"
t=2021-01-04T18:42:03+0000 lvl=info msg="Config overridden from command line" logger=settings arg="default.paths.plugins=/var/lib/grafana/plugins"
t=2021-01-04T18:42:03+0000 lvl=info msg="Config overridden from command line" logger=settings arg="default.paths.provisioning=/etc/grafana/provisioning"
t=2021-01-04T18:42:03+0000 lvl=info msg="Config overridden from command line" logger=settings arg="default.log.mode=console"
t=2021-01-04T18:42:03+0000 lvl=info msg="Config overridden from Environment variable" logger=settings var="GF_PATHS_DATA=/var/lib/grafana"
t=2021-01-04T18:42:03+0000 lvl=info msg="Config overridden from Environment variable" logger=settings var="GF_PATHS_LOGS=/var/log/grafana"
t=2021-01-04T18:42:03+0000 lvl=info msg="Config overridden from Environment variable" logger=settings var="GF_PATHS_PLUGINS=/var/lib/grafana/plugins"
t=2021-01-04T18:42:03+0000 lvl=info msg="Config overridden from Environment variable" logger=settings var="GF_PATHS_PROVISIONING=/etc/grafana/provisioning"
t=2021-01-04T18:42:03+0000 lvl=info msg="Config overridden from Environment variable" logger=settings var="GF_SECURITY_ADMIN_USER=admin"
t=2021-01-04T18:42:03+0000 lvl=info msg="Config overridden from Environment variable" logger=settings var="GF_SECURITY_ADMIN_PASSWORD=*********"
t=2021-01-04T18:42:03+0000 lvl=info msg="Path Home" logger=settings path=/usr/share/grafana
t=2021-01-04T18:42:03+0000 lvl=info msg="Path Data" logger=settings path=/var/lib/grafana
t=2021-01-04T18:42:03+0000 lvl=info msg="Path Logs" logger=settings path=/var/log/grafana
t=2021-01-04T18:42:03+0000 lvl=info msg="Path Plugins" logger=settings path=/var/lib/grafana/plugins
t=2021-01-04T18:42:03+0000 lvl=info msg="Path Provisioning" logger=settings path=/etc/grafana/provisioning
t=2021-01-04T18:42:03+0000 lvl=info msg="App mode production" logger=settings
t=2021-01-04T18:42:03+0000 lvl=info msg="Connecting to DB" logger=sqlstore dbtype=sqlite3
t=2021-01-04T18:42:03+0000 lvl=info msg="Starting DB migration" logger=migrator
t=2021-01-04T18:42:03+0000 lvl=eror msg="Server shutdown" logger=server reason="Service init failed: Migration failed err: unable to open database file"

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Harbor Registry as Shared Service scenario; harbor-trivy-0 Pod:

After deploying Harbor as Shared Service in Tanzu Kubernetes Grid (TKG) 1.2.1, the harbor-trivy-0 Pod remains in CrashLoopBackOff
You see that the harbor-trivy-0 pod in the tanzu-system-registry namespace has a status of CrashLoopBackOff:
kubectl get pod harbor-trivy-0 -n tanzu-system-registry
NAME READY STATUS RESTARTS AGE
harbor-trivy-0 0/1 CrashLoopBackOff 1 2m20s

You see messages similar to the following when examining the logs for the harbor-trivy-0 pod:

kubectl logs harbor-trivy-0 -n tanzu-system-registry

"level":"warning","msg":"trivy cache dir does not exist","path":"/home/scanner/.cache/trivy","time":"2021-01-17T17:01:13Z"}
{"level":"fatal","msg":"Error: checking config: creating trivy cache dir: mkdir /home/scanner/.cache/trivy: permission denied","time":"2021-01-17T17:01:13Z"}

Environment

VMware Tanzu Kubernetes Grid 1.x
VMware Tanzu Kubernetes Grid Plus 1.x

Cause

This is a known issue in TKG 1.2.1.
Refer to the "vSphere Issues" section TKG 1.2 Release Notes

"Pods using PersistentVolumeClaim do not start or remain in the CrashLoopBackOff status, and Grafana and Harbor extension deployments fail"

Resolution

This is a known issue affecting Tanzu Kubernetes Grid 1.2.1. There is currently no resolution.

Workaround:
To workaround this issue in a TKG cluster that has already been deployed, you can edit the vsphere-csi-controller deployment in the kube-system namespace. You can use the following steps to workaround this issue. This is also the equivalent of running the commands to "patch" the deployment that are laid out in the TKG 1.2 Release Notes

kubectl -n kube-system edit deployment vsphere-csi-controller

Look for the section related to the csi-provisioner, similar to the following excerpt:

- args:
- --v=4
- --timeout=300s
- --csi-address=$(ADDRESS)
- --leader-election
env:
- name: ADDRESS
value: /csi/csi.sock
image: registry.tkg.vmware.run/csi/csi-provisioner:v2.0.0_vmware.1
imagePullPolicy: IfNotPresent
name: csi-provisioner
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /csi
name: socket-dir

Add the following to the list of args:

- --default-fstype=ext4

The vsphere-csi-controller pod in the kube-system namespace will get recreated.

If Grafana or Harbor are already deployed you will need to remove them and redeploy so that the persistent volumes will be created properly.

To workaround this issue such that newly created clusters will not encounter this issue, make the same change noted previously in the .tkg/providers/infrastructure-vsphere/v0.7.1/ytt/csi.lib.yaml file prior to deploying your clusters.