Kubernetes pods across multiple namespaces (including kube-system, prelude, and vmsp-) fail to properly start and remain in an inconsistent or defunct state. Additionally, the admin@system and vmware-system-user accounts may appear disconnected.
When attempting to collect logs, the process fails with the following API error:
com.vmware.vrealize.lcm.vmsp.common.exception.RestClientException: API failed for {}:VCF Automation
The /dev/sda3 partition filled to 100% capacity, causing a severe storage outage. This lack of disk space prevented the system from writing changes and led to unrecoverable corruption within the Kubernetes cluster state. As a result, even after disk space is cleared and expanded, the pods are unable to recover and start successfully.
Because the Kubernetes state cannot be recovered by clearing space or restarting the pods, you must redeploy the environment and restore from a known-good backup.
Verify that a valid SFTP backup exists prior to the storage outage.
Delete the current VCF Automation deployment.
Redeploy the VCF Automation environment anew.
Perform a restore operation using the verified SFTP backup to bring the system back online.
Additional Information:
Truncating apiserver logs or clearing /var/log/messages (e.g., cat /dev/null > /var/log/messages) can temporarily free up space to allow basic commands like passwd to succeed, but it does not resolve the underlying Kubernetes corruption caused by the disk full event.
Ensure that incremental backups are rolling over correctly on your SFTP server to prevent the backup destination from becoming full.