vCenter Server shows CPU Exhaustion with high CPU usage reported for database FETCH queries
search cancel

vCenter Server shows CPU Exhaustion with high CPU usage reported for database FETCH queries

book

Article ID: 425546

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

  • The vCenter Server Appliance (VCSA) may display a persistent "CPU Exhaustion" alarm at the top of the vSphere Client.

  • When checking the appliance using the top command, multiple Postgres processes (user=vpostgr) show excessive CPU consumption (often 90% or higher).

  • There may be other processes with high CPU consumption present as well, such as the vCenter Identity Management service (user=idmserv)

  • Performance within the vSphere Client may become sluggish or unresponsive due to management service congestion.

  • In the /var/log/vmware/vpxd/vpxd.log file, frequent "Invoke done" and "OnStreamClose" messages may be seen for a specific entity ID that does not exist in the current inventory.
    • Example log entry with ID 55555 representing an object no longer existing in vCenter:
      YYYY-MM-DDTHH:MM:SSZ verbose vpxd[######] ... vim.view.View.destroy, ... internal, 8.0.3.0, ... id: 55555, state(in/out): 3/1 ...

    • Check the vpx_entity table to verify if an ID in the logs has been deleted:
      /opt/vmware/vpostgres/current/bin/psql -U postgres -d VCDB SELECT name, type_id FROM vpx_entity WHERE id = <ID_FROM_LOGS>;

    • The result shows there are no database entities with that ID
       
  • In the database activity logs (retrieved via pg_stat_activity), multiple active cursors may be seen performing SELECT operations on vpx_hist_stat tables with a sample_time defaulting to 1970-01-01.

Cause

This issue typically occurs when an external monitoring tool or service retains a reference to a deleted vCenter entity (Managed Object like a Virtual Machine) and repeatedly polls for its performance data due to various factors, like:

  • The VCSA database utilizes a partitioned schema for historical statistics which can grow over time to have a very large number of database partitions.

  • If a monitoring query is initiated without a specified BEGIN_TIME, vCenter defaults the search to the Unix epoch (1970).

  • Because the target entity has been deleted and cannot be found via indexed lookups, the database is forced to perform a Full Table Scan across all historical partitions (often hundreds) to verify no data exists.

This exhaustive scan consumes nearly all available CPU resources, impacting the database and secondary services like the Identity Manager which may also show processes at 90%+ CPU usage.

Resolution

To resolve this issue, the external source of the queries must be identified and corrected, and the database congestion must be cleared with 3 steps:

  1. Identify the Source IP: Review /var/log/vmware/envoy/envoy-access.log to identify the external IP address making frequent QueryPerf calls to the /sdk endpoint.

  2. Update Monitoring Configuration: Engage the owner of the identified monitoring server to:

    • Refresh the tool's inventory discovery to purge deleted entity IDs (e.g., 55555) from its polling queue.

    • Ensure all QueryPerf API calls include a specific, recent BEGIN_TIME (e.g., last 20 minutes) to prevent global database scans.

  3. Clear Database Congestion: To immediately recover CPU cycles, the statistics partitions may be cleared and optimized. See steps below.

 

Steps to Clear Database Congestion to temporarily recover from the CPU exhaustion. Perform the following on the vCenter appliance:

***Note: This action will wipe all historical statistics from vCenter. CPU Exhaustion should come down over time without this part if steps 1 and 2 are completed, but these steps will recover the VCSA CPU right away if the loss of historical statistics is acceptable.

  1. Ensure there are valid, recent backups of the VCSA.

  2. Take a new offline snapshot of the VCSA VM. If vCenter is using Enhanced Linked Mode, all linked VCSAs must be powered off at the same time and offline snapshots must be taken together before powering them back on. 

  3. Stop vCenter Services: service-control --stop --all

  4. Start the vPostgres Service: service-control --start vpostgres

  5. Truncate Statistics Partitions:
    1. Log into the vCenter's database:
      /opt/vmware/vpostgres/current/bin/psql -U postgres -d VCDB

    2. Run the following loop to clear the scan target:
      DO $$
        DECLARE
          r RECORD;
        BEGIN
          FOR r IN (SELECT tablename FROM pg_tables WHERE tablename LIKE 'vpx_hist_stat%') LOOP
          EXECUTE 'TRUNCATE TABLE ' || quote_ident(r.tablename) || ' CASCADE';
        END LOOP;
      END $$;

      ***After running this, there should be a message "DO" (which we did see in this case) showing that the loop finished.

    3. Exit the postgres CLI with  \q

  6. Rebuild Indexes: Run a full database vacuum to reclaim resources:
    /opt/vmware/vpostgres/current/bin/vacuumdb -U postgres -d VCDB --full --analyze

  7. Restart Services: service-control --start --all

Additional Information

Refer to (KB 418224) Frequent CPU Exhaustion alarm on vCenter server for a similar issue specific to monitoring with the OpsRamp tool but not necessarily querying data related to an entity that no longer exists in vCenter.