In the vSphere web UI under Workload Management, the Supervisor Cluster may be in Error or Configuring state with similar errors to the below:
Initialized vSphere resourcesDeployed Control Plane VMsConfigured Control Plane VMsConfigured Load Balancer fronting the kubernetes API ServerConfigured Core Supervisor ServicesService: velero.vsphere.vmware.com. Status: ConfiguringService: tkg.vsphere.vmware.com. Reason: Reconciling. Message: Reconciling.
Initialized vSphere resourcesDeployed Control Plane VMsConfigured Control Plane VMsConfigured Load Balancer fronting the kubernetes API ServerConfigured Core Supervisor ServicesService: tkg.vsphere.vmware.com. Reason: ReconcileFailed. Message: kapp: Error: waiting on reconcile packageinstall/tkg-controller (packaging.carvel.dev/v1alpha1) namespace: <namespace>:Finished unsuccessfully (Reconcile failed: (message: kapp: Error: Timed out waiting after 15m0s for resources: [deployment/tkgs-plugin-server (apps/v1) namespace: <namespace>])).Service: velero.vsphere.vmware.com. Reason: Reconciling. Message: Reconciling.Customized guest of Supervisor Control plane VMConfiguration error (since DD/M/YYYY, H:MM:SS XM)System error occurred on Master node with identifier <supervisor node dns name>. Details: Log forwarding sync update failed: Command '['/usr/bin/kubectl', '--kubeconfig', '/etc/kubernetes/admin.conf', 'get', 'configmap', 'fluentbit-config-system', '--namespace', 'vmware-system-logging', '--ignore-not-found=true', '-o', 'json']' returned non-zero exit status 1..
While SSH to a Supervisor Control Plane VM, the root disk space is above 80% or at 100%:
df -h /dev/root
Filesystem Size Used Avail Use% Mounted on
/dev/root ##G ##G ##G 100% /
Many system processes will fail and continue to crash while any Supervisor Control Plane VM is above 80% or at full root disk usage.
vSphere Supervisor 7
vSphere Supervisor 8
Old, unused images and replicasets leftover from previous Supervisor cluster upgrades that are not being automatically cleaned up.
WARNING: DO NOT RUN THESE SCRIPTS IF THERE IS AN ONGOING SUPERVISOR, TKC, OR SUPERVISOR SERVICE UPGRADE
CPVM Replica Set sync issue has been fixed in vCenter Server 8.0 Update 3e therefore this KB is no longer applicable starting from the vCenter Server 8.0 Update 3e release and should not be followed.
---------------------------------
Prior to running the below scripts, critical system processes ETCD and kube-apiserver must be healthy.
ETCD and kube-apiserver will experience issues when disk space is above 80% on any of the Supervisor control plane nodes.
Please see the below KB article for steps on cleaning up disk space in the Supervisor cluster:
etcdctl member list -w table
+------------------+---------+------------------------------+--------------------------------+--------------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+------------------------------+--------------------------------+--------------------------------+------------+
| <etcd-member-a> | started | <supervisor node dns name 1> | https://<supervisor-ip-1>:2380 | https://<supervisor-ip-1>:2379 | false |
| <etcd-member-b> | started | <supervisor node dns name 2> | https://<supervisor-ip-2>:2380 | https://<supervisor-ip-2>:2379 | false |
| <etcd-member-c> | started | <supervisor node dns name 3> | https://<supervisor-ip-3>:2380 | https://<supervisor-ip-3>:2379 | false |
+------------------+---------+------------------------------+--------------------------------+--------------------------------+------------+
etcdctl --cluster=true endpoint health -w table
+--------------------------------+--------+-------------+-------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+--------------------------------+--------+-------------+-------+
| https://<supervisor-ip-1>:2379 | true | ##.##ms | |
| https://<supervisor-ip-2>:2379 | true | ##.##ms | |
| https://<supervisor-ip-3>:2379 | true | ##.##ms | |
+--------------------------------+--------+-------------+-------+
etcdctl --cluster=true endpoint status -w table
+--------------------------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+All ETCD members should be on the same version and the same DB size. Only one member should be considered the ETCD leader.
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://<supervisor-ip-1>:2379 | <etcd-member-a> | #.#.# | ### MB | true | false | ## | ######## | ######## | |
| https://<supervisor-ip-2>:2379 | <etcd-member-b> | #.#.# | ### MB | false | false | ## | ######## | ######## | |
| https://<supervisor-ip-3>:2379 | <etcd-member-c> | #.#.# | ### MB | false | false | ## | ######## | ######## | |
+--------------------------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
kubectl get pods -A | egrep "etcd|kube-api"
kube-system etcd-<supervisor-node-dns-name-1> 1/1 RunningIt is expected for there to be one healthy instance of ETCD and kube-apiserver per Supervisor control plane VM in the cluster.
kube-system etcd-<supervisor-node-dns-name-2> 1/1 Running
kube-system etcd-<supervisor-node-dns-name-3> 1/1 Running
kube-system kube-apiserver-<supervisor-node-dns-name-1> 1/1 Running
kube-system kube-apiserver-<supervisor-node-dns-name-2> 1/1 Running
kube-system kube-apiserver-<supervisor-node-dns-name-3> 1/1 Running
python cleanup_stale_replicasets.py --run
Note: The argument "--run" is required to perform the clean up. Without the "--run" argument, the script performs a dry-run.
python clean_stale_images.py --run
Note: The argument "--run" is required to perform the clean up. Without the "--run" argument, the script performs a dry-run.
Log rotation and disk space improvements will be available in vSphere 9, vSphere 8.0 U3g and higher versions.