Guest Cluster Upgrade Stalls with Control Plane Node stuck in "Provisioning" State
search cancel

Guest Cluster Upgrade Stalls with Control Plane Node stuck in "Provisioning" State

book

Article ID: 429661

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime

Issue/Introduction

  • When performing a rolling upgrade of a Tanzu Kubernetes Grid (TKG) Guest Cluster the process may stall after partially completing.
  • One of the Control Plane node remains in a Provisioning or Pending state indefinitely.
  • The virtual machine for the new node is powered on, but logging into the guest OS and running crictl ps -a returns an empty list, indicating the container runtime has not started any system pods.
  • kubectl get machine -n <namespace> shows the Machine object for the third node is not progressing.

Environment

2.3.x, 2.4.x, 2.5.x

Cause

The management cluster control plane has insufficient disk space. 

Resolution

  1.  Verify and Clean Management Cluster Disk Space

    1. SSH into the Management Cluster Control Plane nodes

    2. Check disk utilization: 

      df -h

    3. If the root (/) or log (/var/log) partitions are near 100% utilization, identify and remove old logs or temporary files to bring utilization below 80%.

      Note: Photon OS uses /var/log/messages while Ubuntu uses/var/log/syslog. 

      1. Truncate the messages or syslog directory, based on the OS in use:

        truncate -s 0 /var/log/messages
        truncate -s 0 /var/log/syslog

      2. Remove compressed log files:

        rm -rf /var/log/messages.*.gz
        rm -rf /var/log/syslog.*.gz

  1.  Reconcile the Guest Cluster State

  1. Identify the stuck machine name in the Guest Cluster namespace (on the Management Cluster):

    kubectl get machine -n <guest-cluster-namespace>

  2. Delete the stuck Machine object:

    kubectl delete machine <stuck-machine-name> -n <guest-cluster-namespace>

  3. Monitor the recreation process. The Cluster API controller will detect the missing replica (count dropping from 3 to 2) and automatically provision a new VM.

Additional Information

You can also vacuum the Journal logs:

Review Disk usage:

journalctl --disk-usage

Vacuum the Journal logs:

journalctl --vacuum-size=100M