default_statistics_target

Products

VMware Tanzu Greenplum

Issue/Introduction

This article pertains to Pivotal Greenplum Database (GPDB) all versions

This article explains how parameter default_statistics_target works with an example.

Resolution

Definition:

As per the documentation, the definition of default_statistics_target is described as: "Sets the default statistics target for table columns without a column-specific target set via ALTER TABLE SET STATISTICS. Larger values increase the time needed to do ANALYZE, but might improve the quality of the planner's estimates".

Default Value/Range:

By default, Greenplum set the value to 25 (4.3.x), 100 (5.x) it can accept values from 1 to 1000. The parameter can be set at the session level and doesn't require a restart of the database.

Usage:

From the description and definition mentioned above, it may be confusing to understand what it's used for.

In short and in basic terms, this parameter controls the way the stats are collected, with value 1 being the least estimated or accurate statistic and the value 1000 being the most accurate statistic - with the expense of time and resources such as CPU, memory, space, etc. Normally, the default value is sufficient to get an accurate plan but if you have a complex data distribution or a column is often referenced in the query then setting a higher value might help in getting a better statistic on the table and a better plan for the optimizer to execute.

Example

To put this into practice, let's take the example below:

The current value of the parameter:

gpadmin=# show default_statistics_target ;
 default_statistics_target
---------------------------
 25
(1 row)
Time: 0.385 ms

Create a table

gpadmin=# create table test_stats ( a int );
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'a' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
Time: 20.562 ms

Insert data into it

gpadmin=# insert into test_stats values ( generate_series ( 1,100000));
INSERT 0 100000
Time: 505.107 ms

The pg_stats (view derived from pg_statistics table) with 25 looks like

gpadmin=# select * from pg_stats where tablename='test_stats';
 schemaname | tablename  | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs |                                                                        histogram_bounds                                                                        | correlation
------------+------------+---------+-----------+-----------+------------+------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
 public     | test_stats | a       |         0 |         4 |         -1 |                  |                   | {4,3969,8084,12315,16248,20131,24190,28145,32057,35955,39838,43994,47808,52015,56015,59827,63920,67913,72007,75926,79919,83783,87843,91818,95881,99976,100000} |
(1 row)
Time: 27.106 ms

Keep an eye on the histogram_bound column changes the value of the parameter.

default_statistics_target to 100 and reanalyze the table, the pg_stats value now has more buckets of histogram, leading to more accurate predictions.

gpadmin=# set default_statistics_target to 100;
SET
Time: 2.854 ms
gpadmin=# analyze test_stats ;
ANALYZE
Time: 408.362 ms
gpadmin=# select * from pg_stats where tablename='test_stats';
 schemaname | tablename  | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs |                                                                                                                                                                                                                                                                                                  histogram_bounds                                                                                                                                                                                                                                                                                                   | correlation
------------+------------+---------+-----------+-----------+------------+------------------+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
 public     | test_stats | a       |         0 |         4 |         -1 |                  |                   | {1,1001,2001,3001,4001,5001,6001,7001,8001,9001,10001,11001,12001,13001,14001,15001,16001,17001,18001,19001,20001,21001,22001,23001,24001,25001,26001,27001,28001,29001,30001,31001,32001,33001,34001,35001,36001,37001,38001,39001,40001,41001,42001,43001,44001,45001,46001,47001,48001,49001,50001,51001,52001,53001,54001,55001,56001,57001,58001,59001,60001,61001,62001,63001,64001,65001,66001,67001,68001,69001,70001,71001,72001,73001,74001,75001,76001,77001,78001,79001,80001,81001,82001,83001,84001,85001,86001,87001,88001,89001,90001,91001,92001,93001,94001,95001,96001,97001,98001,99001,100000} |
(1 row)
Time: 9.201 ms

Similarly, increasing the value of default_statistics_target yields better and accurate stats for the table and its' data distribution. If you wish to collect stats on a specific column with specific or constant values and don't want to alter the stats collection irrespective of which value default_statistics_target is defined, you can define the stats method using the following:

Alter table <table_name> alter column <column_name> set statistics < value from 1 to 1000 > ;

To revert and to analyze to the value defined by default_statistics_target, use the following:

Alter table <table_name> alter column <column_name> set statistics -1 ;

Pros and Cons

Pros:

Better statistics of the distributed data.
Better plans - now the planner has more accurate statistics of the table.
Faster and quicker response from the query because now it has a good plan

Cons:

Increasing the value of the parameter leads to more time needed for analysis to complete (from our quick test of three values of the parameter with 3 columns yield the below time).

Serial #	Value	Time
1	25	300ms
2	100	2 sec
3	1000	10 sec

To do the extra work of collecting statistics, it will consume additional memory and CPU to achieve this. pg_statistics will consume more space because now it has to store more information about the stats of the table in the histogram column.

Additional Information

+ Environment:

Pivotal Greenplum Database (GPDB) all versions