vSphere with Tanzu Supervisor certificates expired and "./certmgr certificates rotate" reports "[Process exited with status 1]"

Products

VMware vSphere ESXi VMware vSphere Kubernetes Service

Issue/Introduction

Symptoms:

On vSphere with Tanzu, the Supervisor cluster certificates have expired.
When attempting to correct expired certificates with the certmgr utility, it reports:

# ./certmgr certificates rotate --debug

2023/02/03 14:03:45 [Process exited with status 1]

Further analysis of /var/log/vmware/cert/certmgr.log show nothing significant.

Environment

VMware vSphere 7.0 with Tanzu

Cause

If the Supervisor certificates have been expired for a period of time, the fluent-bit or registry-agent pod logs and journalctl logs can bloat the filesystem substantially on one or more of the Supervisor nodes. This can cause the root(/) filesystem to fill up.

If this happens, one or more of the services may fail to respond as well. Corrective action on the Supervisor node filesystem(s) may be needed as well as rebooting the nodes to clean up the kernel and operating system.

Resolution

The following steps require SSH access to the Supervisor ControlPlane VM's. This workaround should be carried out with VMware support engineers to ensure system critical resources are not adversely impacted, please reference the following KB for specifics on this process: Troubleshooting vSphere with Tanzu (TKGS) Supervisor Control Plane VMs

1. Check the root filesystem of all Supervisor nodes:

SSH into each of the Supervisor nodes via their management network IP address. Use the following commands to check for filesystem consumption on the nodes:

# df -h
# df -h | grep /dev/sda

If the root (/) filesystem is full, use the following command to identify the 10 largest files:

# find / -path /proc -prune -o -type f -exec du -Sh {} + | sort -rh | head -n 10

The most common causers of root filesystem consumption after Supervisor Cluster certificate expiration are fluent-bit, registry-agent workloads and journalctl logs. These will show up in the following directories:

# /var/log/pods/vmware-system-logging_fluentbit-*/fluentbit/
# /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/<#>/fs/tmp/
# /var/log/journal/<UNIQUE_JOURNAL_ID>

NOTE: The fluent-bit and registry-agent file systems have unique characters per environment hence the * in fluent-bit logging, the <#> in registry-agent logging and <UNIQUE_JOURNAL_ID> in journalctl logging

Example:
/var/log/pods/vmware-system-logging_fluentbit-dc4j8_13c857bf-380d-4913-b2b5-31b0d4a93130/fluentbit/0.log
/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/135/fs/tmp/registry-agent.4201a556ae70b2ca7876eccd985f1b01.root.log.ERROR.20230915-163105.1
/var/log/journal/a963ac261cb04fa3a8b7221989281ca4/system.journal

Change directory into one of the above file paths and list files to confirm their size in preparation for removal. Use the ls -lah command to show all files in the directory with their file size.
- fluent-bit will show a 0.log and potentially 0.log.<date> in the above directory
- registry-agent will show files named registry-agent.<NODE_NAME>.root.log. .INFO .WARNING and .ERROR
- journalctl logs will show as system.journal and system@<UNIQUE_ID>.journal

2. Clear space on the root filesystem

Clear current fluent-bit log and others if needed:

# cd /var/log/pods/vmware-system-logging_fluentbit-*/fluentbit/

# echo > 0.log

Clear the registry-agent .INFO .WARNING and .ERROR logs if they are consuming more than 100 MB of space:

# cd /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/<#>/fs/tmp/
# echo > registry-agent.4201a556ae70b2ca7876eccd985f1b01.root.log.ERROR.20230915-163105.1
# echo > registry-agent.4201a556ae70b2ca7876eccd985f1b01.root.log.INFO.20230915-163105.1
# echo > registry-agent.4201a556ae70b2ca7876eccd985f1b01.root.log.WARNING.20230915-163105.1

Clear the journalctl log so it retains only the most recent 500MB of logging:

# journalctl --vacuum-size=500M

3. Reboot the Supervisor node:

# reboot

4. When the Supervisor node completes restart, confirm kubelet.service and all the pods are running:

# systemctl status kubelet.service

# crictl ps -a

5. Re-run the certmgr utility.