Frequent CPU Exhaustion alarm on vCenter server
search cancel

Frequent CPU Exhaustion alarm on vCenter server

book

Article ID: 418224

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

CPU and Memory utilization shows 100 % on the VCSA appliance. 

A reboot of the vCenter clears the issue temporarily but the CPU Exhaustion alarm triggers back on the vCenter. 

From Postgres logs, we  see multiple issues related to : 

  • vpx_hist_stat1_* tables 
  • delayed exclusive lock 
  • AccessShareLock : 

postgresql.log : 
xxxx-xx-xx xx:xx:xx.xxx xxx 68exxxxxxxxxxx 0 VCDB vpxd [local] 762641 4LOG:  process 762641 still waiting for AccessShareLock on relation 18xxx of database 164xx after 1000.xxx ms
xxxx-xx-xx xx:xx:xx.xxx xxx 68exxxxxxxxxxx 1058xxxxx VCDB vpxd [local] 840xxx 7LOG:  process 840xxx acquired AccessExclusiveLock on relation 18xxx of database 164xx after 4398xxx.xxx ms
xxxx-xx-xx xx:xx:xx.xxx xxx 68exxxxxxxxxxx 1058xxxxxx VCDB vpxd [local] 840xxx 8STATEMENT:   truncate table vpx_hist_stat1_192
xxxx-xx-xx xx:xx:xx.xxx xxx 68exxxxxxxxxxx 0 VCDB vpxd [local] 964xxx STATEMENT:  BEGIN;declare "SQL_CUR0x7fe500xxxxxx" cursor with hold for  SELECT sc.stat_id, d.device_name FROM vpx_stat_counter sc, vpx_sample_time1 sm, vpx_device d, vpx_hist_stat1 st WHERE sc.entity_id = 1 and coalesce(sc.device_id,1) = coalesce(d.device_id,1)   and sc.counter_id = st.counter_id   and st.time_id = sm.tim
e_id   AND sm.sample_time > '1970-xx-xx 00:00:00'::timestamp   AND sm.sample_time <= 'xxxx-xx-xx xx:xx:xx.785'::timestamp GROUP BY sc.stat_id, d.device_name ORDER BY sc.stat_id, d.device_name ;fetch 1024 in "SQL_CUR0x7fe500xxxxx"

xxxx-xx-xx xx:xx:xx.xxx xxx 68xxxxxxxxxxxx 0 VCDB vpxd [local] 762xxx 4LOG:  process 76xxxx still waiting for AccessShareLock on relation 18xxx of database 16xxx after 10xx.xxx ms
xxxx-xx-xx xx:xx:xx.xxx xxx 68xxxxxxxxxxxx 1058xxxxx VCDB vpxd [local] 840xxx 7LOG:  process 84xxxx acquired AccessExclusiveLock on relation 18xxx of database 16xxx after 439xxxx.xxx ms
xxxx-xx-xx xx:xx:xx.xxx xxx 68xxxxxxxxxxxx 1058xxxxx VCDB vpxd [local] 840xxx 8STATEMENT:   truncate table vpx_hist_stat1_19x
xxxx-xx-xx xx:xx:xx.xxx xxx 68xxxxxxxxxxxx 0 VCDB vpxd [local] 964138 9STATEMENT:  BEGIN;declare "SQL_CUR0x7fe500xxxxxx" cursor with hold for  SELECT sc.stat_id, d.device_name FROM vpx_stat_counter sc, vpx_sample_time1 sm, vpx_device d, vpx_hist_stat1 st WHERE sc.entity_id = 1   and coalesce(sc.device_id,1) = coalesce(d.device_id,1)   and sc.counter_id = st.counter_id   and st.time_id = sm.tim
e_id   AND sm.sample_time > '19xx-xx-xx 00:00:00'::timestamp   AND sm.sample_time <= 'xxxx-xx-xx xx:xx:xx.xxx'::timestamp GROUP BY sc.stat_id, d.device_name ORDER BY sc.stat_id, d.device_name ;fetch 10xx in "SQL_CUR0x7fe500xxxxxx"

Can view the third party username and the IP in vpxd/vpxd-profiler.log :

--> /SessionStats/SessionPool/Session/Id='52b56f78-xxxx-xxxx-xxxx-xxxxxxxxxxxx'/Username='xxxxxxx\xxxxxxxxxxxxxx'/ClientIP='xx.xx.xx.xx'/PropertyCollector/TriggeredFiltersCount/total 0

Environment

VMware vCenter server 8.x

Cause

The issue occurs when the OpsRamp tool is calling queryAvailableMetric without specifying a BEGIN_TIME for the query, then the query will look at every partition in the DB. 

Resolution

It is best to use queryAvailableMetric with a proper start time to reduce the number of partitions queries that would be modified by the vendor.

Workaround : 

1. Stop VC services except (vpostgres)
2. TRUNCATE all the Stats tables (vpx_hist_stat1, vpx_hist_stat2, vpx_hist_stat3, and vpx_hist_stat4).
3. run vacuum verbose analyze. 
Command : /opt/vmware/vpostgres/current/bin/vacuumdb -U postgres -d VCDB --full –-analyze
4. Restart vCenter or restart all services.

Additional Information

Contact the OpsRamp tool vendor to get the issue resolved.