Determining if vCenter Server rollup jobs are processing performance data
search cancel

Determining if vCenter Server rollup jobs are processing performance data

book

Article ID: 309854

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

Symptoms:
  • Excessive growth of the vCenter Server database.
  • vCenter Server rollup jobs do exist.
  • Slow response or timeouts when retrieving performance data.
  • Only Real Time data is available when looking at performance data.
  • When accessing performance data for a period other than the last 24 hours, you see the message:

    Performance data is currently not available for this entity


Environment

VMware vCenter Server 4.x
VMware vCenter Server 5.x
VMware vCenter Server 6.x
VMware vCenter Server 7.x
VMware vCenter Server 8.x

Resolution

To diagnose issues with vCenter Server performance data, it is helpful to understand how vCenter Server processes performance data. There are several components that interact for the processing to occur as the data comes into vCenter Server.

Architecture

In vCenter Server 4.0, performance data is split up into 8 different tables and is processed by several rollup jobs that average the data over periods of time to make the information less granular as time passes. These tables are vpx_hist_stat[1-4], which stores the metric data, and vpx_sample_time[1-4], which stores the interval information for the data.
 
This database table split segregates the data for each interval into different tables for ease of storage as well as post processing of incoming performance data. Post processing of the performance data is done by the statistic rollup jobs. vCenter Server has the Past Day, Past week and Past Month rollup jobs. Each of these jobs run on a schedule ensuring that the data is processed in a timely manner to ensure retention of the configured amount of information.
vCenter Server 4.1 and vCenter Server 5.0 have a caching mechanism that prevents deadlocks and increases the speed of information insertion into the database. This mechanism inserts the incoming performance information into vpx_temptable[0-2] on a timed basis, then does a bulk insert of the performance data into the historical statistics tables rather than inserting directly into the tables.
For a workflow, when the performance data is in the vpx_hist_stat1/vpx_sample_time1 tables the post processing of the information begins by the scheduled past day rollup job. This job is scheduled to run frequently to process the data, averaging and summarizing it and then inserting it into the vpx_hist_stat2/vpx_sample_time2 tables. This happens for each of the rollup jobs, making the information much less granular over the different intervals and moving the data all the way down into the vpx_hist_stat4/vpx_sample_time4 tables.
 
The rollup jobs are scheduled to run by default on these intervals:
  • Past day stats rollup – Every 30 minutes
  • Past week stats rollup – Every 2 hours
  • Past month stats rollup – Every 24 hours
When the information is processed, it is purged, which cleans up the originating performance data table. This ensures that the database does not grow out of control and allows for vCenter to provide the historical data. The granularity and duration of the different interval information is configured from Administration > vCenter Server Settings > Statistics.
By default, the different intervals are configured as follows:
This configuration pane allows for control and modification of:
  • Whether an Interval Duration is Enabled or Disabled.
  • How long the interval between samples is for the statistic level
    • This is only available for the first statistic level (5 minute(s) interval in the diagram above).
  • How long the information is kept for in the statistic Level
    • By default we only keep the statistical data for 1 year, and after that it is purged. It is possible to extend this, and also to keep the data for specific statistic levels for a longer length of time if this granular level of data is needed.
  • The statistics level for the different rollup levels
    • By default the Statistics Level in vCenter Server is set to Level 1 for each level. This controls the amount of data that is gathered for the level:
      • Level 1 – Includes the basic metrics Average Usage for CPU, Memory, Disk and Network, System Uptime, System Heartbeat, and vCenter DRS Metrics. Statistics for devices are not included at this level.
      • Level 2 – This level includes all metrics for CPU, Memory, Disk and Network counters (average, summation and latest rollup types - maximum and minimum rollup types are excluded), System Uptime, System Heartbeat and vSphere DRS metrics. Statistics for devices are not included at this level.
      • Level 3 – This level includes all metrics (including devices) for all counter groups (average, summation and latest rollup types - maximum and minimum rollup types are excluded).
      • Level 4 – This level includes all metrics supported by vCenter Server.


      Note: VMware does not recommend setting the Statistics Level higher than Level 2 unless debugging an issue. The amount of data collected is substantially greater and without adequate processing power on the SQL server could cause performance data to be not collected properly.

Diagnosis and Resolutions

To start diagnosing the performance data situation in vCenter Server, check the size of the database tables. Since vpx_hist_stat1/vpx_sample_time1 stores the raw incoming data for the statistic level, these tables frequently cause problems.
 
To check the size of the database tables:
  • If you are running Microsoft SQL (MSSQL), run the command:

    exec sp_spaceused vpx_hist_statx

    Where x is the statistic table

    The output is similar to:



    The Rows and Data columns show the amount of data that exists in the database. An acceptable amount is dependant on the size of the environment, but there could be a problem if you are seeing more than 10 million rows in the vpx_hist_stat1 database table.
  • For MSSQL, Oracle, and DB2, run the command:

    select count(*) from vpx_hist_statx

    Where x is the statistic table you are interested in

    This query returns the number of rows in the database table specified.
If the roll-up jobs are not running, start with the vpx_hist_stat1 table as this is where the non- processed performance data is stored.
 
If you think that performance data is not being processed, validate the last time that data was successfully processed and moved to the next database table with the command:
 
select max(sample_time) from vpx_sample_time2
 
This query outputs a date and time that the performance data was successfully rolled up and purged from the vpx_hist_stat1/vpx_sample_time1 database tables. If the date returned is more than 24hrs in the past, then there is likely a problem with the rollup jobs. In addition, this query can be used to validate the amount of data that is being processed. If there was previously a problem the query can be used to validate whether it makes sense to wait for the rollup jobs to process the backlog of data or if it is better to truncate the data. If there is more incoming data than is being processed in the period of time that is being measured, the server may never be able to catch up.
 
Issues with can occur if:
  • Statistic rollup jobs do not exist. In an upgrade or recovery scenario, the database (although properly restored), may not include the restoration of the rollup jobs. Validate within MSSQL or Oracle whether the Past Day, Week and Month rollup jobs exist and recreate them if necessary. For more information, see Updating rollup jobs after the error: Performance data is currently not available for this entity.
  • The MSSQL agent service is not started on the Database server. By default when SQL is installed the MSSQL Agent service is started in Manual mode. On reboot the rollup jobs may not run as the service is not started. Validate this configuration by checking the agent services and making sure that the SQL agent service is set to started and automatic.
  • Statistic Collection Levels are set too high for the given configuration. If statistic collection levels above level 2 are used, other than for debugging an issue, growth of the database may occur. Reducing to a lower level stabilizes the system, but it may not be possible to recover.

Truncate the non-processed information from the vpx_hist_stat1 table is normally a last resort, but can be the ultimate solution if it is not possible to process the data in an appropriate period of time. To truncate the non-processed information, run these commands:

truncate table vpx_hist_stat1
truncate table vpx_sample_time1


These commands delete the data that has not been processed. The remainder of the historical data prior to experiencing the issue is left intact.

 

 

Additional Information