This article pertains to Pivotal Greenplum Database (GPDB) all versions
This article explains how parameter default_statistics_target
works with an example.
As per the documentation, the definition of default_statistics_target
is described as: "Sets the default statistics target for table columns without a column-specific target set via ALTER TABLE SET STATISTICS
. Larger values increase the time needed to do ANALYZE
, but might improve the quality of the planner's estimates".
By default, Greenplum set the value to 25 (4.3.x), 100 (5.x) it can accept values from 1 to 1000. The parameter can be set at the session level and doesn't require a restart of the database.
From the description and definition mentioned above, it may be confusing to understand what it's used for.
In short and in basic terms, this parameter controls the way the stats are collected, with value 1 being the least estimated or accurate statistic and the value 1000 being the most accurate statistic - with the expense of time and resources such as CPU, memory, space, etc. Normally, the default value is sufficient to get an accurate plan but if you have a complex data distribution or a column is often referenced in the query then setting a higher value might help in getting a better statistic on the table and a better plan for the optimizer to execute.
Example
To put this into practice, let's take the example below:
gpadmin=# show default_statistics_target ; default_statistics_target --------------------------- 25 (1 row) Time: 0.385 ms
Create a table
gpadmin=# create table test_stats ( a int ); NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'a' as the Greenplum Database data distribution key for this table. HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. CREATE TABLE Time: 20.562 ms
gpadmin=# insert into test_stats values ( generate_series ( 1,100000)); INSERT 0 100000 Time: 505.107 ms
pg_stats
(view derived from pg_statistics table) with 25 looks like
gpadmin=# select * from pg_stats where tablename='test_stats'; schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+------------+---------+-----------+-----------+------------+------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+------------- public | test_stats | a | 0 | 4 | -1 | | | {4,3969,8084,12315,16248,20131,24190,28145,32057,35955,39838,43994,47808,52015,56015,59827,63920,67913,72007,75926,79919,83783,87843,91818,95881,99976,100000} | (1 row) Time: 27.106 ms
Keep an eye on the histogram_bound
column changes the value of the parameter.
default_statistics_target
to 100 and reanalyze the table, the pg_stats
value now has more buckets of histogram, leading to more accurate predictions.
gpadmin=# set default_statistics_target to 100; SET Time: 2.854 ms gpadmin=# analyze test_stats ; ANALYZE Time: 408.362 ms gpadmin=# select * from pg_stats where tablename='test_stats'; schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+------------+---------+-----------+-----------+------------+------------------+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------- public | test_stats | a | 0 | 4 | -1 | | | {1,1001,2001,3001,4001,5001,6001,7001,8001,9001,10001,11001,12001,13001,14001,15001,16001,17001,18001,19001,20001,21001,22001,23001,24001,25001,26001,27001,28001,29001,30001,31001,32001,33001,34001,35001,36001,37001,38001,39001,40001,41001,42001,43001,44001,45001,46001,47001,48001,49001,50001,51001,52001,53001,54001,55001,56001,57001,58001,59001,60001,61001,62001,63001,64001,65001,66001,67001,68001,69001,70001,71001,72001,73001,74001,75001,76001,77001,78001,79001,80001,81001,82001,83001,84001,85001,86001,87001,88001,89001,90001,91001,92001,93001,94001,95001,96001,97001,98001,99001,100000} | (1 row) Time: 9.201 ms
Similarly, increasing the value of default_statistics_target
yields better and accurate stats for the table and its' data distribution. If you wish to collect stats on a specific column with specific or constant values and don't want to alter the stats collection irrespective of which value default_statistics_target
is defined, you can define the stats method using the following:
Alter table <table_name> alter column <column_name> set statistics < value from 1 to 1000 > ;
To revert and to analyze to the value defined by default_statistics_target
, use the following:
Alter table <table_name> alter column <column_name> set statistics -1 ;
Pros:
Cons:
Serial # | Value | Time |
1 | 25 | 300ms |
2 | 100 | 2 sec |
3 | 1000 | 10 sec |
To do the extra work of collecting statistics, it will consume additional memory and CPU to achieve this. pg_statistics will consume more space because now it has to store more information about the stats of the table in the histogram column.
Pivotal Greenplum Database (GPDB) all versions