In the vSphere web UI under Workload Management, the Supervisor Cluster may be in Error or Configuring state with similar errors to the below:
Initialized vSphere resources
Deployed Control Plane VMs
Configured Control Plane VMs
Configured Load Balancer fronting the kubernetes API Server
Configured Core Supervisor Services
Service: velero.vsphere.vmware.com. Status: Configuring
Service: tkg.vsphere.vmware.com. Reason: Reconciling. Message: Reconciling.
Initialized vSphere resources
Deployed Control Plane VMs
Configured Control Plane VMs
Configured Load Balancer fronting the kubernetes API Server
Configured Core Supervisor Services
Service: tkg.vsphere.vmware.com. Reason: ReconcileFailed. Message: kapp: Error: waiting on reconcile packageinstall/tkg-controller (packaging.carvel.dev/v1alpha1) namespace: <namespace>:
Finished unsuccessfully (Reconcile failed: (message: kapp: Error: Timed out waiting after 15m0s for resources: [deployment/tkgs-plugin-server (apps/v1) namespace: <namespace>])).
Service: velero.vsphere.vmware.com. Reason: Reconciling. Message: Reconciling.
Customized guest of Supervisor Control plane VM
Configuration error (since DD/M/YYYY, H:MM:SS XM)
System error occurred on Master node with identifier <supervisor node dns name>. Details: Log forwarding sync update failed: Command '['/usr/bin/kubectl', '--kubeconfig', '/etc/kubernetes/admin.conf', 'get', 'configmap', 'fluentbit-config-system', '--namespace', 'vmware-system-logging', '--ignore-not-found=true', '-o', 'json']' returned non-zero exit status 1..
While SSH to a Supervisor Control Plane VM, the root disk space is above 80% or at 100%:
df -h /dev/root
Filesystem Size Used Avail Use% Mounted on
/dev/root ##G ##G ##G 100% /
Many system processes will fail and continue to crash while any Supervisor Control Plane VM is above 80% or at full root disk usage.
vSphere Supervisor 7
vSphere Supervisor 8
Old, unused images and replicasets leftover from previous Supervisor cluster upgrades that are not being automatically cleaned up.
WARNING: DO NOT RUN THESE SCRIPTS IF THERE IS AN ONGOING SUPERVISOR, TKC, OR SUPERVISOR SERVICE UPGRADE
---------------------------------
Prior to running the below scripts, critical system processes ETCD and kube-apiserver must be healthy.
ETCD and kube-apiserver will experience issues when disk space is above 80% on any of the Supervisor control plane nodes.
Please see the below KB article for steps on cleaning up disk space in the Supervisor cluster:
etcdctl member list -w table
+------------------+---------+------------------------------+--------------------------------+--------------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+------------------------------+--------------------------------+--------------------------------+------------+
| <etcd-member-a> | started | <supervisor node dns name 1> | https://<supervisor-ip-1>:2380 | https://<supervisor-ip-1>:2379 | false |
| <etcd-member-b> | started | <supervisor node dns name 2> | https://<supervisor-ip-2>:2380 | https://<supervisor-ip-2>:2379 | false |
| <etcd-member-c> | started | <supervisor node dns name 3> | https://<supervisor-ip-3>:2380 | https://<supervisor-ip-3>:2379 | false |
+------------------+---------+------------------------------+--------------------------------+--------------------------------+------------+
etcdctl --cluster=true endpoint health -w table
+--------------------------------+--------+-------------+-------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+--------------------------------+--------+-------------+-------+
| https://<supervisor-ip-1>:2379 | true | ##.##ms | |
| https://<supervisor-ip-2>:2379 | true | ##.##ms | |
| https://<supervisor-ip-3>:2379 | true | ##.##ms | |
+--------------------------------+--------+-------------+-------+
etcdctl --cluster=true endpoint status -w table
+--------------------------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+All ETCD members should be on the same version and the same DB size. Only one member should be considered the ETCD leader.
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://<supervisor-ip-1>:2379 | <etcd-member-a> | #.#.# | ### MB | true | false | ## | ######## | ######## | |
| https://<supervisor-ip-2>:2379 | <etcd-member-b> | #.#.# | ### MB | false | false | ## | ######## | ######## | |
| https://<supervisor-ip-3>:2379 | <etcd-member-c> | #.#.# | ### MB | false | false | ## | ######## | ######## | |
+--------------------------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
kubectl get pods -A | egrep "etcd|kube-api"
kube-system etcd-<supervisor-node-dns-name-1> 1/1 RunningIt is expected for there to be one healthy instance of ETCD and kube-apiserver per Supervisor control plane VM in the cluster.
kube-system etcd-<supervisor-node-dns-name-2> 1/1 Running
kube-system etcd-<supervisor-node-dns-name-3> 1/1 Running
kube-system kube-apiserver-<supervisor-node-dns-name-1> 1/1 Running
kube-system kube-apiserver-<supervisor-node-dns-name-2> 1/1 Running
kube-system kube-apiserver-<supervisor-node-dns-name-3> 1/1 Running
python cleanup_stale_replicasets.py --run
Note: The argument "--run" is required to perform the clean up. Without the "--run" argument, the script performs a dry-run.
python clean_stale_images.py --run
Note: The argument "--run" is required to perform the clean up. Without the "--run" argument, the script performs a dry-run.
Log rotation and disk space improvements will be available in vSphere 9, vSphere 8.0u3F and higher versions.