vCenter /storage/seat partition fills up due to no space on Supervisor Control Plane VM

Products

VMware vCenter Server VMware vSphere Kubernetes Service VMware Avi Load Balancer VMware Tanzu Application Platform

Issue/Introduction

The vCenter /storage/seat partition is getting filled at a faster rate
Increasing the vCenter SEAT partition does not help as its getting filled
vxpd service goes down
AVI Load Balancer logins are being called excessively in a short span of time
Supervisor Control Plane VM partition is full
AVI controller logs located at /var/lib/avi/log/cc* | zgrep "Processing Key: VirtualMachine" shows lots of VM updates per minute
AVI controllers Cloud agent log located at /var/lib/avi/log/cc_agent_go_Default-Cloud* | xargs -I {} sh -c 'echo {};zgrep "Processing Key: VirtualMachine" "{}" | awk "{print substr(\$0, index(\$0, \"YYYY-MM-DDTHH:MM:SS\"), 19)}" | sort | uniq -c | awk "\$1 > 200"' shows lots of VM updates
In VC, /var/log/vmware/vpxd/vpxd-profiler-**.log:--> /MoRegistryStats/Class='11DatastoreMo'/ERProviderMixin/Overflows/total ##### shows high number forERProviderMixin/
In VC, /var/log/vmware/vpxd/vpxd-profiler-**.log--> /ActivationStats/Task/Actv='vim.AuthorizationManager.setEntityPermissions'/TotalTime/numSamples ###### shows high number of setEntityPermissions
In VC /var/log/vmware/vpxd/vpxd.log shows
YYYY-MM-DDTHH:MM:SS.608Z info vpxd[2487065] [Originator@6876 sub=vpxLro opID=wcp-699e98e7-2cdc05d1-801a-47de-8b4f-af387dd42b83-65] [VpxLRO] -- BEGIN lro-489843407 -- AuthorizationManager -- vim.AuthorizationManager.setEntityPermissions -- 52105ecb-cbfd-0bae-0676-3991b0c6630f(52259e96-7206-4596-4510-4002fff8d71b)
In VC, wcpsvc.log located at /var/log/vmware/wcp shows
wcpsvc-YYYY-MM-DDTHH:MM:SS.570.log:2YYYY-MM-DDTHH:MM:SS. debug wcp [vclib/authz.go:54] [opID=699e9708] Successfully set permissions [{{} <nil> wcp-storage-user-############@vsphere.local false 1090 true}] on entity Datastore:datastore-######
wcpsvc-YYYY-MM-DDTHH:MM:SS.967Z debug wcp [nodechecker/node_check.go:70] [opID=699eb397-2cdc05d1-801a-47de-8b4f-af387dd42b83] wcp-sv-state-checker log output when running checks '[ConnectToLoadBalancer]' on VM VirtualMachine:vm-####. stdout: {"ConnectToLoadBalancer":{"id":"ConnectToLoadBalancer","status":"SetupFailure","time_completed":"YYYY-MM-DDTHH:MM:SS.563961415Z","conditions":[{"type":"SetupFailure","message":{"severity":"ERROR","details":{"id":"vcenter.wcp.node_state_check.setup_failure","default_message":"An internal error on the control plane VM (420137a0706d0fda0b17bc4e903996ac) prevented the check ConnectToLoadBalancer from completing successfully. Error: Unable to fetch valid load balancer configs. Err: Unauthorized.","args":["420137a0706d0fda0b17bc4e903996ac","ConnectToLoadBalancer","Unable to fetch valid load balancer configs. Err: Unauthorized"]}}}],"description":{"id":"wcp.healthcheck.connect_to_loadbalancer.description","default_message":"Checks to see if the Control Plane VM is able to connect to any configured load balancers. This check can only be run in a VDS environment, after the Kubernetes API Server is up.","args":null}}}, stderr: time="wcpsvc-YYYY-MM-DDTHH:MM:SS" level=info msg="Running checks: [ConnectToLoadBalancer]"
time="wcpsvc-YYYY-MM-DDTHH:MM:SS" level=debug msg="Parsed node configuration: &{HostName:######### VCenterPNID:<fqdn>:443 ManagementNetwork:{DNSFromDHCP:false DNSServers:[###.##.##.#3] DNSSearchDomains:[<name>]} WorkloadNetwork:{IPAddress:##.##.##.## DNSServers:[###.##.##.##]} KubernetesConfig:{CertificateAuthority:0xc######### InitialAPIServer:https://##.##.##.##:6443} NSXManagerConfig:[] NetworkProvider:1 LoadBalancerProvider:HA_PROXY}"
time="wcpsvc-YYYY-MM-DDTHH:MM:SS" level=info msg="Check [ConnectToLoadBalancer] running"
time="wcpsvc-YYYY-MM-DDTHH:MM:SS" level=info msg="Attempting to connect to the Kubernetes Server, using configuration file path: '/etc/kubernetes/admin.conf'" check=ConnectToLoadBalancer
time="wcpsvc-YYYY-MM-DDTHH:MM:SS" level=info msg="Fetching all loadBalancerConfigs..." check=ConnectToLoadBalancer
time="wcpsvc-YYYY-MM-DDTHH:MM:SS" level=error msg="Failed to fetch haProxy list. Err: Unauthorized" check=ConnectToLoadBalancer
time="wcpsvc-YYYY-MM-DDTHH:MM:SS" level=error msg="Failed to fetch load balancer config info. Err: Unable to fetch valid load balancer configs. Err: Unauthorized" check=ConnectToLoadBalancer
time="wcpsvc-YYYY-MM-DDTHH:MM:SS" level=info msg="Check [ConnectToLoadBalancer] completed. Result: SetupFailure"
wcpsvc-YYYY-MM-DDTHH:MM:SS debug wcp [nodechecker/node_check.go:95] [opID=699eb397-2cdc05d1-801a-47de-8b4f-af387dd42b83] Check 'ConnectToLoadBalancer' was unsuccessful on node VirtualMachine:vm-####. Status: SetupFailure
wcpsvc-YYYY-MM-DDTHH:MM:SS error wcp [kubelifecycle/load_balancer.go:68] [opID=699eb397-2cdc05d1-801a-47de-8b4f-af387dd42b83] Unable to verify load balancer connection from nodes. node checks on control plane VM VirtualMachine:vm-#### failed for indeterminate reasons
wcpsvc-YYYY-MM-DDTHH:MM:SS info wcp [kubelifecycle/load_balancer.go:69] [opID=699eb397-2cdc05d1-801a-47de-8b4f-af387dd42b83] Reconcile load balancer exited
wcpsvc-YYYY-MM-DDTHH:MM:SS debug wcp [kubelifecycle/controller.go:506] [opID=699eb397-2cdc05d1-801a-47de-8b4f-af387dd42b83] Supervisor configuration retry.
In VC, wcpsvc.log located at /var/log/vmware/wcp/ shows "No space left on device"
YYYY-MM-DDTHH:MM:SS error wcp [vclib/guestop.go:338] [opID=699eb397-2cdc05d1-801a-47de-8b4f-af387dd42b83] Kubenode guest command failed. RC: 1, Out: , Err: INFO:__main__:Loaded 1 key from /dev/shm/secret
Traceback (most recent call last):
........
encrypted.write(encrypt(plain_bytes, key))
OSError: [Errno 28] No space left on device
YYYY-MM-DDTHH:MM:SS error wcp [kubelifecycle/master_node.go:704] [opID=699eb397-2cdc05d1-801a-47de-8b4f-af387dd42b83] Failed to encrypt desired node configuration. Err Guest operation failed for the Master node VM with identifier vm-####., stdout: , stderr: INFO:__main__:Loaded 1 key from /dev/shm/secret
Traceback (most recent call last):
File "/usr/lib/vmware-wcp/hypercrypt.py", line 292, in <module>
.........
encrypted.write(encrypt(plain_bytes, key))
OSError: [Errno 28] No space left on device
YYYY-MM-DDTHH:MM:SS error wcp [kubelifecycle/controller.go:2062] [opID=699eb397-2cdc05d1-801a-47de-8b4f-af387dd42b83] Failed to update desired config of MasterNode VirtualMachine:vm-####. Err: Guest operation failed for the Master node VM with identifier vm-####.
YYYY-MM-DDTHH:MM:SS error wcp [kubelifecycle/controller.go:2231] [opID=699eb397-2cdc05d1-801a-47de-8b4f-af387dd42b83] Error configuring API server on cluster 2cdc05d1-801a-47de-8b4f-af387dd42b83 Guest operation failed for the Master node VM with identifier vm-####.
YYYY-MM-DDTHH:MM:SS warning wcp [kubelifecycle/controller.go:1014] [opID=699eb397-2cdc05d1-801a-47de-8b4f-af387dd42b83] Unable to configure agent in cluster domain-c8. Err Guest operation failed for the Master node VM with identifier vm-####.
vm-#### is the VM ID of a Supervisor Control Plane VM and upon checking shows its partition is full

<user>@NodeID [ ~ ]# df -h /dev/root
Filesystem Size Used Avail Use% Mounted on
/dev/root 32G 32G 0 100% /

Environment

vCenter 8.x
vSphere Kubernetes Service
NSX Application Platform(NAPP)
VMware AVI Load Balancer

Cause

Supervisor Control Plane VM has no space left on device

Resolution

To resolve this issue, follow the steps from KB : vSphere Supervisor Disk Space Clean Up Scripts

To workaround this issue of vCenter SEAT partition getting full, Configure Event Retention in vCenter for 7 days using below steps.

1. Open the vSphere Client and log in to vCenter Server.

2. Navigate to Administration → vCenter Server Settings

Go to the Database Retention Policy section.

Below options are available
Task retention
Event retention

3. Set the desired number of days.
Event retention: 7 days

4. Click OK or Save to apply the changes.

Additional Information

NOTE: Running du -sh /* and re running the command on the directory consuming the highest disk shows the actual consumer

Increasing the disk space for the vCenter Server Appliance in vSphere 6.5, 6.7, 7.0 and 8.0