Guest cluster nodes report High CPU usage

Products

VMware vSphere Kubernetes Service

Issue/Introduction

- Alarm of high CPU usage keeps on traversing from one node to another even though nodes are all ready and healthy
- kubectl top showed the kapp-controller consuming maximum cpus
- Application pods starts to evict from the worker node when cpu usage goes high.
- TAP application is used in this Guest cluster.
- Ephemeral-storage is already set to 120 GB.
Refer: Prerequisites and planning for installing Tanzu Application Platform

In kapp-controller logs, we see packageinstalls/appCRs are reconciling after every ~30s, which is too aggressive.

{"level":"info","ts":1721136536.3375206,"logger":"kc.controller.app","msg":"Started deploy","request":"jasmin/netvalide-ui-tdr-rec"}
{"level":"info","ts":1721136537.6527908,"logger":"kc.controller.app","msg":"Completed deploy","request":"tiger/opale-api-tdr-rec"}
{"level":"info","ts":1721136537.652814,"logger":"kc.controller.app","msg":"Updating status","request":"tiger/opale-api-tdr-rec","desc":"flushing: app reconciled"}
{"level":"info","ts":1721136537.676626,"logger":"kc.controller.er","msg":"Requeue after given time","request":"tiger/opale-api-tdr-rec","after":32.740013697}
{"level":"info","ts":1721136537.6881676,"logger":"kc.controller.app","msg":"Started deploy","request":"raid/bi-osd-tdr-rec"}

Environment

vSphere with Tanzu

Cause

This is a "syncPeriod" issue. The syncPeriod was changed from Default 10 minutes to 30 seconds leading to the high cpu usage by kapp-controller on the nodes.
Recommendation for syncPeriod by CARVEL is ~10 min.

Resolution

Change the default value of syncPeriod back to 10 minutes and then leverage the ability to configure a sync period on a per-App or per-PackageInstall basis.

Additional Information

Check cpu used on the pods and nodes respectively using the command below

kubectl top po -A --sort-by=cpu
kubectl top node <node name>

Resync is largely for handling steady-state reconciliation of resources. When an App or PackageInstall is changed, the event triggers a requeue of the resource regardless of the last time it was sync'd. When the resource is in a steady state, all the resync period is doing is ensuring the consistent configuration of its child resources: it makes sure the Deployment still carries the state it's supposed to, any ConfigMaps or Secrets are consistent, etc. It's just trying to prevent drift and that's not usually a time-sensitive problem unless there's someone frequently applying configuration drift out from under kapp.

The only time that the resync period would really be valuable for actuating change is if kapp-controller missed the CREATE/UPDATE/PATCH/DELETE event on a resource — for example, if kapp-controller was crashed or was slow to the point that the event log dropped the event before the controller got around to its next read.