Kubernetes pods fail to start and remain in a defunct state after /dev/sda3 reaches 100% capacity in VCF Automation
search cancel

Kubernetes pods fail to start and remain in a defunct state after /dev/sda3 reaches 100% capacity in VCF Automation

book

Article ID: 434827

calendar_today

Updated On:

Products

VCF Automation

Issue/Introduction

Kubernetes pods across multiple namespaces (including kube-system, prelude, and vmsp-) fail to properly start and remain in an inconsistent or defunct state. Additionally, the admin@system and vmware-system-user accounts may appear disconnected.

When attempting to collect logs, the process fails with the following API error:

com.vmware.vrealize.lcm.vmsp.common.exception.RestClientException: API failed for {}:

Environment

VCF Automation

Cause

The /dev/sda3 partition filled to 100% capacity, causing a severe storage outage. This lack of disk space prevented the system from writing changes and led to unrecoverable corruption within the Kubernetes cluster state. As a result, even after disk space is cleared and expanded, the pods are unable to recover and start successfully.

Resolution

Because the Kubernetes state cannot be recovered by clearing space or restarting the pods, you must redeploy the environment and restore from a known-good backup.

  1. Verify that a valid SFTP backup exists prior to the storage outage.

  2. Delete the current VCF Automation deployment.

  3. Redeploy the VCF Automation environment anew.

  4. Perform a restore operation using the verified SFTP backup to bring the system back online.

Additional Information:

  • Truncating apiserver logs or clearing /var/log/messages (e.g., cat /dev/null > /var/log/messages) can temporarily free up space to allow basic commands like passwd to succeed, but it does not resolve the underlying Kubernetes corruption caused by the disk full event.

  • Ensure that incremental backups are rolling over correctly on your SFTP server to prevent the backup destination from becoming full.