After upgrade to vCenter 8.0U3 and Supervisor version v1.28.3 the cluster is getting unstable
Supervisor services are not available
kubectl login to the Supervisor cluster not working
One or more Supervisor control planes are running out of disk space
root@**************
[ ~ ]# df -h|grep /dev/root
/dev/root 32G 30G 45M 100% /
If all 3 Supervisor control planes are up and running again check nodes and etcd health:
Example:
root@
**************
[ ~ ]# k get nodesNAME STATUS ROLES AGE VERSION
**************
Ready control-plane,master 21m v1.28.3+vmware.wcp.1**************
Ready control-plane,master 64s v1.28.3+vmware.wcp.1**************
Ready control-plane,master 9m57s v1.28.3+vmware.wcp.1**************
Ready agent 10d v1.28.2-sph-e515410**************
Ready agent 10d v1.28.2-sph-e515410**************
Ready agent 10d v1.28.2-sph-e515410**************
Ready agent 10d v1.28.2-sph-e515410
root@
[ ~ ]# etcdctl --cluster=true endpoint health -w table**************
+--------------------------+--------+------------+-------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+--------------------------+--------+------------+-------+
| https://
**************
:2379 | true | 4.671596ms | || https://
**************
:2379 | true | 7.120376ms | || https://
**************
:2379 | true | 7.356998ms | |+--------------------------+--------+------------+-------+
root@
[ ~ ]# etcdctl member list -w table**************
+------------------+---------+----------------------------------+--------------------------+--------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+----------------------------------+--------------------------+--------------------------+------------+
|
**************
| started | **************
| https://
:2380 | https://**************
:2379 | false |**************
|
**************
| started | **************
| https://
:2380 | https://**************
:2379 | false |**************
|
**************
| started | **************
| https://
:2380 | https://**************
:2379 | false |**************
+------------------+---------+----------------------------------+--------------------------+--------------------------+------------+
root@
[ ~ ]# etcdctl --cluster=true endpoint status -w table**************
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://
:2379 | **************
| 3.5.11 | 176 MB | false | false | 83 | 584410086 | 584410086 | |**************
| https://
:2379 | **************
| 3.5.11 | 176 MB | true | false | 83 | 584410086 | 584410086 | |**************
| https://
:2379 | **************
| 3.5.11 | 176 MB | false | false | 83 | 584410086 | 584410086 | |**************
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
GUI Error message may show similar to the below
Initialized vSphere resources
Deployed Control Plane VMs
Configured Control Plane VMs
Configured Load Balancer fronting the kubernetes API Server
Configured Core Supervisor Services
Service: velero.vsphere.vmware.com. Status: Configuring
Service: tkg.vsphere.vmware.com. Reason: Reconciling. Message: Reconciling.
or
Initialized vSphere resources
Deployed Control Plane VMs
Configured Control Plane VMs
Configured Load Balancer fronting the kubernetes API Server
Configured Core Supervisor Services
Service: tkg.vsphere.vmware.com. Reason: ReconcileFailed. Message: kapp: Error: waiting on reconcile packageinstall/tkg-controller (packaging.carvel.dev/v1alpha1) namespace: ###-###-######-#####:
Finished unsuccessfully (Reconcile failed: (message: kapp: Error: Timed out waiting after 15m0s for resources: [deployment/tkgs-plugin-server (apps/v1) namespace: ###-###-######-#####])).
Service: velero.vsphere.vmware.com. Reason: Reconciling. Message: Reconciling.
Customized guest of Supervisor Control plane VM
Configuration error (since 11/7/2024, 4:57:26 AM)
System error occurred on Master node with identifier **************. Details: Log forwarding sync update failed: Command '['/usr/bin/kubectl', '--kubeconfig', '/etc/kubernetes/admin.conf', 'get', 'configmap', 'fluentbit-config-system', '--namespace', 'vmware-system-logging', '--ignore-not-found=true', '-o', 'json']' returned non-zero exit status 1..
Supervisor Control plane VM disk space is filled due to an issue with old images not being properly cleaned up.
Run the following commands to clean up failed wcp backups, journalctl logs, and audit logging to clean up some space.
To ssh into the SV VM's follow this kb: https://knowledge.broadcom.com/external/article?legacyId=90194
journalctl --vacuum-time=2d
cd /var/log/vmware/audit
rm *log.gz
cd /var/lib/vmware/wcp/backup
rm *
Check disk space again
df -h|grep /dev/root
If etcd is healthy run the cleanup script "cleanup_stale_replicasets.py" and "cleanup_stale_images.py"attached to the KB to perform the cleanup on the Master Supervisor Control plan node.
If etcd is not healthy, please open a case with Broadcom support
1) On each Supervisor Control Plane Node run this script to clean up the stale replica sets from older versions:
python cleanup_stale_replicasets.py --run
2) On each Supervisor Control Plane Node run this script to delete the images that not part of an active replica set.
python clean_stale_images
.py --run
Note: without option --run
the script will run in dry mode not deleting any replica sets