vmware-vpostgres server 100% CPU Usage in vCenter server

Products

VMware vCenter Server 8.0 VMware vCenter Server 7.0

Issue/Introduction

You may observe vCenter Server performance degradation caused by continuous high CPU usage and memory consumption by the vmware-vpostgres service. The following symptoms are commonly reported:

vCenter Server runs out of memory and may crash.
The vmware-vpostgres service consistently consumes excessive memory and CPU (near 100%).
From /var/log/vmware/vpostgres/postgresql.log, we see that there are too may SQL commands are being executed and time to process those query was taking too much time. ie "duration: 11050849.640 ms"

YYYY-MM-DD hh:mm:ss.162 UTC 6835b87c. 7842 0 VCDB vc [local] 30786 11 LOG: duration: 11050849.640 ms statement: BEGIN; declare "SQL_CUR0x7fd67c0ad150" cursor with hold for SELECT sc.stat_id, d.device name FROM vpx stat counter sc, vpx snple_timel sm, vpx_device d, vpx_hist_statl st WHERE sc.entity_id = 1 AND sm.sample_time <= '2025-05-27 13:05:00.312' :: timestamp GROUP BY sc.stat_id, d.device_name ORDER BY sc.stat_id, d.device_name ;fetch 1024 in "SQL_CUR0x7fd67c0ad150"
YYYY-MM-DD hh:mm:ss.177 UTC 6835b87c.7842 0 VCDB vc [local] 30786 12 LOG: duration: 8450342.711 ms plan:Query Text: BEGIN; declare "SQL_CUR0x7fd67c0ad150" cursor with hold for SELECT sc.stat_id, d.device_name FROM vpx_stat_counter sc, vpx_sample_timel sm, vpx_device d, vpx_hist_statl st WHERE sc.entity_id = 1
_id, 1) = coalesce (d.device_id, 1) and sc.counter_id = st.counter_id and st.time_id = sm.time_id.device_name ORDER BY sc.stat_id, d.device_name ;fetch 1024 in "SQL_CUR0x7fd67c0ad150" Group (cost=62935302.98 .. 3034711441.99 rows=3839808 width=164) (actual rows=342 loops=1)and coalesce (sc.device id, 1) = coalesce (d.device_id, 1)and sc.counter id = st.counter_idand st.time id = sm. time id AND sm. sample time > '1970-01-01 00:00:00: timestamp AND sm. sample_time > '1970-01-01 00:00:00' :: timestamp and coalesce (sc.deviAND sm. sample_time <= '2025-05-27 13:05:00.312'
When running memory of consumed by vpostgres by running command grep "mem_used_bytes" postgresql.log , we can see that 90% is used ie "mem_avail = 2413568 kB" "mem_used_bytes = 2124556 kB"

YYYY-MM-DD hh:mm:ss. UTC 6834aa41.c577 0 50551 16685 LOG: Memory check: flag = rss, mem_used_bytes = 2124564 kB, mem_avail = 2413568 kB
From /var/log/vmware/vpxd/vpxd-profiler.log ,we see that too many session came from session "52ce18fc-7357-XXXX-2e30-beXXXXXX8cf3" for property collection and those property collection was in the form of DB query "SELECT sc.stat_id, d.device_name FROM vpx_stat_counter sc, vpx_sample_time1 sm, vpx_device d, vpx_hist_stat1 st WHERE sc.entity_id = ? and coalesce(sc.device_id,1) = coalesce(d.device_id,1) and sc.counter_id = st.counter_id and st.time_id = sm.time"

--> ThreadState/ThreadId/55745/State/Task::lro-6105574::PerfMgr::vim.PerformanceManager.queryAvailableMetric::52ce18fc-7357-XXXX-2e30-beXXXXXX8cf3(52927915-XXXX-1b0b-XXXX-05c776434074)/State/Vdb/State/ SELECT sc.stat_id, d.device_name FR
OM vpx_stat_counter sc, vpx_sample_time1 sm, vpx_device d, vpx_hist_stat1 st WHERE sc.entity_id = ? and coalesce(sc.device_id,1) = coalesce(d.device_id,1) and sc.counter_id = st.counter_id and st.time_id = sm.time
--> ThreadState/ThreadId/55746/State/Task::lro-3723516::PerfMgr::vim.PerformanceManager.queryAvailableMetric::52ce18fc-7357-XXXX-2e30-beXXXXXX8cf3(52927915-XXXX-1b0b-XXXX-05c776434074)/State/Vdb/State/ SELECT sc.stat_id, d.device_name FR
OM vpx_stat_counter sc, vpx_sample_time1 sm, vpx_device d, vpx_hist_stat1 st WHERE sc.entity_id = ? and coalesce(sc.device_id,1) = coalesce(d.device_id,1) and sc.counter_id = st.counter_id and st.time_id = sm.time
When running grep "mem_used_bytes" postgresql.log on the /var/log/vmware/vpostgres/postgresql.log file when you will see entries similar to below :

YYYY-MM-DD hh:mm:ss. UTC 66731273.2e19 0 VCDB vc [local] 11801 8 DETAIL: Process holding the lock: 11683. Wait queue: 13029, 42605, 7358, 11801.
YYYY-MM-DD hh:mm:ss. UTC 66731280.32f7 0 VCDB vc [local] 13047 4 DETAIL: Process holding the lock: 11683. Wait queue: 13029, 42605, 7358, 11801, 13047.
YYYY-MM-DD hh:mm:ss. UTC 66731271.2d47 0 VCDB vc [local] 11591 4 DETAIL: Process holding the lock: 11683. Wait queue: 13029, 42605, 7358, 11801, 13047, 11591.
When checked the session we found that monitoring too having IP XX.XX.XX.XX has requested xxxxx amount of query

grep -R "52ce18fc-7357-XXXX-2e30-beXXXXXX8cf3" vpxd-profiler* | grep "ClientIP" | less

vpxd-profiler-1406.log:--> /SessionStats/SessionPool/Session/Id='52ce18fc-7357-XXXX-2e30-beXXXXXX8cf3'/Username='VSPHERE.LOCAL\<User_name>'/ClientIP='XX.XX.XX.XX'/PropertyCollector/TriggeredFiltersCount/total 0
vpxd-profiler-1406.log:--> /SessionStats/SessionPool/Session/Id='52ce18fc-7357-XXXX-2e30-beXXXXXX8cf3'/Username='VSPHERE.LOCAL\<User_name>'/ClientIP='XX.XX.XX.XX'/PropertyCollector/NullCollectorCount/total 0
vpxd-profiler-1406.log:--> /SessionStats/SessionPool/Session/Id='52ce18fc-7357-XXXX-2e30-beXXXXXX8cf3'/Username='VSPHERE.LOCAL\<User_name>'/ClientIP='XX.XX.XX.XX'/PropertyCollector/TotalObjectCount/total 0
vpxd-profiler-1406.log:--> /SessionStats/SessionPool/Session/Id='52ce18fc-7357-XXXX-2e30-beXXXXXX8cf3'/Username='VSPHERE.LOCAL\<User_name>'/ClientIP='XX.XX.XX.XX'/PropertyCollector/QueuedOpsCount/total 0
vpxd-profiler-1406.log:--> /SessionStats/SessionPool/Session/Id='52ce18fc-7357-XXXX-2e30-beXXXXXX8cf3'/Username='VSPHERE.LOCAL\<User_name>'/ClientIP='XX.XX.XX.XX'/PropertyCollector/TriggeredProcessGUReqs/total 0
You can see how many times the queryAvailableMetric had been called with below, where it should excessive amount of calls

cat vpxd-profiler-1406.log | grep 'PerformanceManager\.queryAvailableMetric' | grep -oE '::[a-f0-9-]{36}$' | sed 's/::\(.*$(/\1/' | sort | uniq -c | sort -nr | head -10
22896 52ce18fc-7357-XXXX-2e30-beXXXXXX8cf3
To see where it is coming from you can search using the UUID, which will show the originator, this originator should be investigated in this case:

awk '/52ce18fc-7357-XXXX-2e30-beXXXXXX8c/ && /PropertyCollector/ && /ClientIP/' vpxd-profiler-1406.log | head -1
--> /SessionStats/SessionPool/Session/Id='57-XXXX-2e30-beXXXXXX8c'/Username='VSPHERE.LOCAL\username'/ClientIP='10.x.x.x'/PropertyCollector/ReadLocked/total 0
To see the number of connections the originator had established, run below.

cat vpxd-profiler-1091.log| grep -oE "/Username='[^']+'/ClientIP='[^']+'/" | sort | uniq -c | sort -nr | head -10
/Username='VSPHERE.LOCAL\usernam'/ClientIP='10.x.x.x'/

We see from the example above that while the tool is querying the DB excesively while not creating many connections. Excessive queries to the DB can still can cause this issue.

Environment

vCenter 7.x/8.x

Cause

This issue occurs when a monitoring or automation tool sends numerous performance-related SQL queries to vPostgres, querying large amounts of historical data.

Resolution

To resolve the issue, perform the following:

Identify and Stop the Query Source
- From logs, determine the client IP or user initiating the excessive queries.
Disable or Limit the Data Collection
- If the source is an external monitoring tool, either stop it temporarily or reconfigure it to reduce query load.
- Avoid full table scans or unfiltered historical queries.
Apply Constraints or Filtering
- Where possible, configure the tool to use shorter time ranges or fetch only required metrics.
- Introduce query-level filters to avoid wide joins on large metric tables.

If the initiated connections are coming from a third-party tool, refer to the tool's vendor for best practices/tuning parameters
- Refer to the vendor on how to tune to reduce number of queries to the PerformanceManager.
- Another option is to deny collection of certain metrics from the tool towards the vCenter. Consult with the vendor's documenatiton for reference.

Additional Information

The following command can be used to get a list of all the PIDs that are causing the lock and how many lines are reporting the same :

less postgresql.log| grep "Process holding the lock" | awk -F 'Wait' '{print $1}' | awk -F ':' '{print $5}'| awk -F '.' '{print $1}'| uniq -c
The following command can be used to get the list of commands that are causing the locks :

less postgresql.log| grep "Process holding the lock" | awk -F 'Wait' '{print $1}' | awk -F ':' '{print $5}'| awk -F '.' '{print $1}'| uniq > pidscausinglocks.log
You can check the stat table counts with the following command.

VCDB# SELECT count(*) cnt, 'Stat Level 1' FROM vpx_hist_stat1
VCDB# UNION ALL
VCDB# SELECT count(*) cnt, 'Stat Level 2' FROM vpx_hist_stat2
VCDB# UNION ALL
VCDB# SELECT count(*) cnt, 'Stat Level 3' FROM vpx_hist_stat3
VCDB# UNION ALL
VCDB# SELECT count(*) cnt, 'Stat Level 4' FROM vpx_hist_stat4;