vpxd service intermittently stops due to no space left in the database partition

Products

VMware vCenter Server

Issue/Introduction

vpxd stops intermittently on multiple occasions
When reviewing the vCenter service logs /var/log/vmware/vpxd/vpxd-*.log, you find similar errors as below:

YYYY-MM-DDThh:mm:ss.xxxZ error vpxd[06541] [Originator@6876 sub=Default opID=388c9fe6] [VdbStatement] Execute result code: -1
YYYY-MM-DDThh:mm:ss.xxxZ error vpxd[06541] [Originator@6876 sub=Default opID=388c9fe6] [VdbStatement] SQL execution failed:  SELECT sc.stat_id, d.device_name FROM vpx_stat_counter sc, vpx_sample_time1 sm, vpx_device d, vpx_hist_stat1 st WHERE sc.entity_id = ?   and coalesce(sc.device_id,1) = coalesce(d.device_id,1)   and sc.counter_id = st.counter_id   and st.time_id = sm.time_id   AND sm.sample_time > ?   AND sm.sample_time <= ? GROUP BY sc.stat_id, d.device_name ORDER BY sc.stat_id, d.device_name
YYYY-MM-DDThh:mm:ss.xxxZ error vpxd[06541] [Originator@6876 sub=Default opID=388c9fe6] [VdbStatement] Execution elapsed time: 29355 ms
YYYY-MM-DDThh:mm:ss.xxxZ error vpxd[06541] [Originator@6876 sub=Default opID=388c9fe6] [VdbStatement] Statement diagnostic data from driver is 53100:0:1:ERROR: could not write to file "base/pgsql_tmp/pgsql_tmp35430.6": No space left on device;
--> Error while executing the query
YYYY-MM-DDThh:mm:ss.xxxZ error vpxd[06541] [Originator@6876 sub=Default opID=388c9fe6] [VdbStatement] Bind parameters:
YYYY-MM-DDThh:mm:ss.xxxZ error vpxd[06541] [Originator@6876 sub=Default opID=388c9fe6] [VdbStatement] [0]datatype: 1, size: 4, arraySize: 0
YYYY-MM-DDThh:mm:ss.xxxZ error vpxd[06541] [Originator@6876 sub=Default opID=388c9fe6] [VdbStatement] value = 165
YYYY-MM-DDThh:mm:ss.xxxZ error vpxd[06541] [Originator@6876 sub=Default opID=388c9fe6] [VdbStatement] [1]datatype: 10, size: 23, arraySize: 0
YYYY-MM-DDThh:mm:ss.xxxZ error vpxd[06541] [Originator@6876 sub=Default opID=388c9fe6] [VdbStatement] value = 1970-1-1 0:0:0.0
YYYY-MM-DDThh:mm:ss.xxxZ error vpxd[06541] [Originator@6876 sub=Default opID=388c9fe6] [VdbStatement] [2]datatype: 10, size: 23, arraySize: 0
YYYY-MM-DDThh:mm:ss.xxxZ error vpxd[06541] [Originator@6876 sub=Default opID=388c9fe6] [VdbStatement] value = YYYY-MM-DD hh:mm:ss.360000000
YYYY-MM-DDThh:mm:ss.xxxZ warning vpxd[08562] [Originator@6876 sub=StatsAggregator opID=685ac8bb] counterId = 276 and instance =  not found
YYYY-MM-DDThh:mm:ss.xxxZ warning vpxd[08562] [Originator@6876 sub=StatsAggregator opID=685ac8bb] counterId = 276 and instance =  not found
YYYY-MM-DDThh:mm:ss.xxxZ warning vpxd[08562] [Originator@6876 sub=StatsAggregator opID=685ac8bb] counterId = 276 and instance = 69846 not found
YYYY-MM-DDThh:mm:ss.xxxZ warning vpxd[08562] [Originator@6876 sub=StatsAggregator opID=685ac8bb] counterId = 276 and instance = 69847 not found
YYYY-MM-DDThh:mm:ss.xxxZ warning vpxd[08562] [Originator@6876 sub=StatsAggregator opID=685ac8bb] counterId = 276 and instance = 69850 not found

Environment

vCenter Server 7.0.x

vCenter Server 8.0.x

Cause

Using longer retention times and/or high statistics levels beyond level 2 for longer amounts of time will result in a large volume of data being collected. As result, a high amount of temp data will be created during the rollup, which can exhaust the DB partition.

Resolution

To fix this issue, reduce the statistics levels on maximum 2 for the first 2 intervals (day and week) and on 1 for the remaining 2 (month and year) using vSphere Client (vCenter > Settings > General). This can be done either using vSphere Client, or by directly updating the settings in the vCenter database.

Option A - changing the statistics levels using vSphere Client

To change the statistics levels using vSphere Client:

login with an account that has administrator permissions for the vCenter
select the vCenter object in the Hosts & Clusters tree
go to Configure > Settings > General
Click on the EDIT button in the upper right-hand corner
verify that the retention is set to the default values as in the screenshot below:

Option B - updating the statistics levels in the vCenter database

If you cannot use vSphere Client to change these settings, you can alternatively change them directly in the vCenter database.

Note: Before attempting any manual changes to the vCenter database, please ensure to have a fresh backup or snapshot of the vCenter Server Appliance. If the appliance is part of an Enhanced Linked Mode (ELM) deployment, you need to create backups or offline snapshots for all ELM members.

To update the statics levels and -intervals in the vCenter database, open an SSH connection to the VCSA and login using the root account, then run the commands below:

Stop the vCenter Server service:
```
# service-control --stop vmware-vpxd
```
Connect to the database:
```
# psql -d VCDB -U postgres
```

Run the following 4 queries to alter the stats levels:

update vpx_stat_interval_def set stats_level = 2 where interval_seq_num = 1;
update vpx_stat_interval_def set stats_level = 2 where interval_seq_num = 2;
update vpx_stat_interval_def set stats_level = 1 where interval_seq_num = 3;
update vpx_stat_interval_def set stats_level = 1 where interval_seq_num = 4;

Run the following 4 queries to reset the retention for each of the stats levels:

update vpx_stat_interval_def set interval_length = 86400 where interval_seq_num = 1;
update vpx_stat_interval_def set interval_length = 604800 where interval_seq_num = 2;
update vpx_stat_interval_def set interval_length = 2592000 where interval_seq_num = 3;
update vpx_stat_interval_def set interval_length = 31536000 where interval_seq_num = 4;

Next, clean up the temporary data in the database.

Note: again please ensure that a fresh backup exists, before making any changes to the vCenter database.

If you had already connected to the database, continue with step 3. Otherwise stop the vCenter Service:
```
# service-control --stop vmware-vpxd
```
Next, connect to the vCenter database:
```
# psql -d VCDB -U postgres
```

Run the following queries:

alter system set temp_tablespaces = 'hs1';
select pg_reload_conf();
show temp_tablespaces;

The output of the last query should look like this example:

postgres=# show temp_tablespaces ;
 temp_tablespaces 
------------------
 hs1
(1 row)

Exit psql:
```
\q
```
Start the vCenter Server Service:
```
service-control --start vmware-vpxd
```