Symptoms:
# ./certmgr certificates rotate --debug
2023/02/03 14:03:45 [Process exited with status 1]
VMware vSphere 7.0 with Tanzu
If the Supervisor certificates have been expired for a period of time, the fluent-bit or registry-agent pod logs and journalctl logs can bloat the filesystem substantially on one or more of the Supervisor nodes. This can cause the root(/) filesystem to fill up.
If this happens, one or more of the services may fail to respond as well. Corrective action on the Supervisor node filesystem(s) may be needed as well as rebooting the nodes to clean up the kernel and operating system.
The following steps require SSH access to the Supervisor ControlPlane VM's. This workaround should be carried out with VMware support engineers to ensure system critical resources are not adversely impacted, please reference the following KB for specifics on this process: Troubleshooting vSphere with Tanzu (TKGS) Supervisor Control Plane VMs
1. Check the root filesystem of all Supervisor nodes:
# df -h
# df -h | grep /dev/sda
# find / -path /proc -prune -o -type f -exec du -Sh {} + | sort -rh | head -n 10
# /var/log/pods/vmware-system-logging_fluentbit-*/fluentbit/
# /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/<#>/fs/tmp/
# /var/log/journal/<UNIQUE_JOURNAL_ID>
NOTE: The fluent-bit and registry-agent file systems have unique characters per environment hence the * in fluent-bit logging, the <#> in registry-agent logging and <UNIQUE_JOURNAL_ID> in journalctl logging
Example:
/var/log/pods/vmware-system-logging_fluentbit-dc4j8_13c857bf-380d-4913-b2b5-31b0d4a93130/fluentbit/0.log
/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/135/fs/tmp/registry-agent.4201a556ae70b2ca7876eccd985f1b01.root.log.ERROR.20230915-163105.1
/var/log/journal/a963ac261cb04fa3a8b7221989281ca4/system.journal
2. Clear space on the root filesystem
# cd /var/log/pods/vmware-system-logging_fluentbit-*/fluentbit/
# echo > 0.log
Clear the registry-agent .INFO .WARNING and .ERROR logs if they are consuming more than 100 MB of space:
# cd /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/<#>/fs/tmp/
# echo > registry-agent.4201a556ae70b2ca7876eccd985f1b01.root.log.ERROR.20230915-163105.1
# echo > registry-agent.4201a556ae70b2ca7876eccd985f1b01.root.log.INFO.20230915-163105.1
# echo > registry-agent.4201a556ae70b2ca7876eccd985f1b01.root.log.WARNING.20230915-163105.1
# journalctl --vacuum-size=500M
3. Reboot the Supervisor node:
# reboot
4. When the Supervisor node completes restart, confirm kubelet.service and all the pods are running:
# systemctl status kubelet.service
# crictl ps -a
5. Re-run the certmgr utility.