Guest cluster is failing its health checks and indicates node disk pressure.
search cancel

Guest cluster is failing its health checks and indicates node disk pressure.

book

Article ID: 414488

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service VCF Private AI Services

Issue/Introduction

  • Waiting on VMware Private AI Service (PAIS) configuration to pass health checks never finishes.
  • Describing the PAIS cluster shows:
    - lastTransitionTime: "YYYY-MM-DDTHH:MM:SSZ"
        reason: StorageProviderNotInstalled
        severity: Warning
        status: "False"
  • The disk space used on the cluster's control plane VMs is between 80% and 100%
  • Describing the Tanzu guest cluster shows:
    Message:               * Machine <machine name>:
      * BootstrapConfigReady:
        * DataSecretAvailable: Condition not yet reported
        * CertificatesAvailable: Condition not yet reported
      * NodeHealthy:
        * Node.DiskPressure: kubelet has disk pressure

Environment

  • VMware vSphere Kubernetes Service
  • VMware Private AI Service

Cause

Once the disk usage on a control plane node reaches the eviction threshold limit, pods will start getting evicted and no new pods can be scheduled onto the nodes. If PAIS pods get evicted, its health checks will show as not passing.

Resolution

SSH to the control plane nodes and attempt to get the root filesystem's disk space below 80% usage. This can typically be done by reducing the amount of logs on the system:

  1. Check the current disk usage for journalctl logging:
      • journalctl --disk-usage
    • To compress and free up space:
      • journalctl --vacuum-size=500M
  2. Navigate to the /var/log directory and run the following command to check for large files, such as journal logs:
      • du -h --max-depth=1
  3. If the root filesystem usage remains above 80%, run the following command to identify the largest files on the affected node. Review and remove any unnecessary files such as outdated logs or manual backups as appropriate:
      • find / -path /proc -prune -o -type f -exec du -Sh {} + | sort -rh | head -n 10

Incase if the issue still persists, please reach out to Broadcom support for further assistance.

Additional Information