Frequent CPU Exhaustion alarm on vCenter server

Products

VMware vCenter Server

Issue/Introduction

CPU and Memory utilization may show 100 % on the VCSA appliance.
A reboot of the vCenter may clear the issue temporarily but the CPU Exhaustion alarm triggers back on the vCenter.
vCenter postgres logs show issues with messages containing any of the following:

- vpx_hist_stat1_* tables
- delayed exclusive lock
- AccessShareLock :

Examples from /var/log/vmware/vpostgres/postgresql.log :

xxxx-xx-xx xx:xx:xx.xxx xxx 68exxxxxxxxxxx 0 VCDB vpxd [local] 762641 4LOG:  process 762641 still waiting for AccessShareLock on relation 18xxx of database 164xx after 1000.xxx ms
xxxx-xx-xx xx:xx:xx.xxx xxx 68exxxxxxxxxxx 1058xxxxx VCDB vpxd [local] 840xxx 7LOG:  process 840xxx acquired AccessExclusiveLock on relation 18xxx of database 164xx after 4398xxx.xxx ms
xxxx-xx-xx xx:xx:xx.xxx xxx 68exxxxxxxxxxx 1058xxxxxx VCDB vpxd [local] 840xxx 8STATEMENT:   truncate table vpx_hist_stat1_192
xxxx-xx-xx xx:xx:xx.xxx xxx 68exxxxxxxxxxx 0 VCDB vpxd [local] 964xxx STATEMENT:  BEGIN;declare "SQL_CUR0x7fe500xxxxxx" cursor with hold for  SELECT sc.stat_id, d.device_name FROM vpx_stat_counter sc, vpx_sample_time1 sm, vpx_device d, vpx_hist_stat1 st WHERE sc.entity_id = 1 and coalesce(sc.device_id,1) = coalesce(d.device_id,1)   and sc.counter_id = st.counter_id   and st.time_id = sm.tim
e_id   AND sm.sample_time > '1970-xx-xx 00:00:00'::timestamp   AND sm.sample_time <= 'xxxx-xx-xx xx:xx:xx.785'::timestamp GROUP BY sc.stat_id, d.device_name ORDER BY sc.stat_id, d.device_name ;fetch 1024 in "SQL_CUR0x7fe500xxxxx"

xxxx-xx-xx xx:xx:xx.xxx xxx 68xxxxxxxxxxxx 0 VCDB vpxd [local] 762xxx 4LOG:  process 76xxxx still waiting for AccessShareLock on relation 18xxx of database 16xxx after 10xx.xxx ms
xxxx-xx-xx xx:xx:xx.xxx xxx 68xxxxxxxxxxxx 1058xxxxx VCDB vpxd [local] 840xxx 7LOG:  process 84xxxx acquired AccessExclusiveLock on relation 18xxx of database 16xxx after 439xxxx.xxx ms
xxxx-xx-xx xx:xx:xx.xxx xxx 68xxxxxxxxxxxx 1058xxxxx VCDB vpxd [local] 840xxx 8STATEMENT:   truncate table vpx_hist_stat1_19x
xxxx-xx-xx xx:xx:xx.xxx xxx 68xxxxxxxxxxxx 0 VCDB vpxd [local] 964138 9STATEMENT:  BEGIN;declare "SQL_CUR0x7fe500xxxxxx" cursor with hold for  SELECT sc.stat_id, d.device_name FROM vpx_stat_counter sc, vpx_sample_time1 sm, vpx_device d, vpx_hist_stat1 st WHERE sc.entity_id = 1   and coalesce(sc.device_id,1) = coalesce(d.device_id,1)   and sc.counter_id = st.counter_id   and st.time_id = sm.tim
e_id   AND sm.sample_time > '19xx-xx-xx 00:00:00'::timestamp   AND sm.sample_time <= 'xxxx-xx-xx xx:xx:xx.xxx'::timestamp GROUP BY sc.stat_id, d.device_name ORDER BY sc.stat_id, d.device_name ;fetch 10xx in "SQL_CUR0x7fe500xxxxxx"

***Find the source's username and the IP information in vpxd/vpxd-profiler.log. Example:

--> /SessionStats/SessionPool/Session/Id='52b56f78-xxxx-xxxx-xxxx-xxxxxxxxxxxx'/Username='xxxxxxx\xxxxxxxxxxxxxx'/ClientIP='xx.xx.xx.xx'/PropertyCollector/TriggeredFiltersCount/total 0

Cause

A monitoring service, such as the OpsRamp tool, is calling queryAvailableMetric without specifying a BEGIN_TIME for the query, resulting in the query looking at every partition in the vCenter database.

Resolution

It is recommended to use queryAvailableMetric with a proper start time to reduce the load on the vCenter caused by the queries. Contact the owner of the monitoring agents (such as an OpsRamp tool vendor) to make necessary changes.

Workaround to immediately free up resources:
***Note that this will result in the loss of all historical statistical data, so only proceed if that is acceptable.

Ensure that a valid, recent backup of the vCenter Server (reference documentation) exists and take an offline snapshot of the VCSA. If vCenter is using Enhanced Linked Mode, all linked VCSAs must be powered off at the same time and offline snapshots must be taken together before powering them back on.
Stop VC services except vpostgres (or stop all services and restart vpostgres only).
TRUNCATE all the Stats tables (vpx_hist_stat1, vpx_hist_stat2, vpx_hist_stat3, and vpx_hist_stat4).
Run vacuum verbose analyze. Command:
/opt/vmware/vpostgres/current/bin/vacuumdb -U postgres -d VCDB --full --analyze
Restart vCenter or restart all services.

Additional Information

Refer to (KB 425546) vCenter Server shows CPU Exhaustion with high CPU usage reported for database FETCH queries for a similar issue that occurs when monitoring tools query for objects that are no longer present in the vCenter inventory.