How the planner chooses between GPORCA Index, Bitmap, and Seq Scans in Taznu Greenplum

Products

VMware Tanzu Greenplum

Issue/Introduction

Through statistical information, the cost estimation system can know:

How many rows of data a table have.
How many data pages are used.
The frequency of a certain value.

Then estimate how much data a constraint can filter out based on this information. The proportion of the data filtered by the condition to the total data volume is called selectivity.

selectivity = number of tuples left after filtering / number of tuples before filtering

In general, the planner is more likely to choose a sequential scan (SeqScan) when the selectivity is higher, and the planner is more likely to choose an index scan (IndexScan) when the selectivity is lower.

postgres=# CREATE TABLE sampletable (x numeric);
CREATE TABLE

postgres=# INSERT INTO sampletable SELECT random() * 10000 FROM generate_series(1, 10000000);
INSERT 0 10000000

postgres=# CREATE INDEX idx_x ON sampletable(x);
CREATE INDEX

postgres=# ANALYZE ; 
ANALYZE

-- The selectivity is low, and the index scan method is used to obtain data
postgres=# explain SELECT * FROM sampletable WHERE x = 42353;
                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..6.00 rows=1 width=11)
   ->  Index Scan using idx_x on sampletable  (cost=0.00..6.00 rows=1 width=11)
         Index Cond: (x = 42353::numeric)
 Optimizer: Pivotal Optimizer (GPORCA)
(4 rows)

-- High selectivity, using sequential scanning method to obtain data
postgres=# explain SELECT * FROM sampletable WHERE x < 42353;
                                      QUERY PLAN                                       
---------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..1093.97 rows=10000000 width=11)
   ->  Seq Scan on sampletable  (cost=0.00..684.03 rows=3333334 width=11)
         Filter: (x < 42353::numeric)
 Optimizer: Pivotal Optimizer (GPORCA)
(4 rows)

This is because the index scans produces random reads. In the PostgreSQL database, different costs are defined for sequential reads and random reads:

SEQ_PAGE_COST  1.0      // cost per page for sequential read
RANDOM_PAGE_COST  4.0	// cost per page for random read

If a query will select small percentage of rows from a table, PostgreSQL will decide on an index scan.

If a query will select a large percentage of the rows, PostgreSQL will decide to read the table completely.

If a query will select too many for an index scan to be efficient but too few for a sequential scan, then planner will choose to use a bitmap scan.

The idea behind a bitmap scan is that a single block is only used once during a scan. It can also be very helpful if you want to use more than one index to scan a single table.

postgres=# explain SELECT * FROM sampletable WHERE x in (252, 1024, 36598, 1, 741265, 2352685, 1024, 845612, 3567898, 365, 124187);
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..387.97 rows=5 width=11)
   ->  Bitmap Heap Scan on sampletable  (cost=0.00..387.97 rows=2 width=11)
         Recheck Cond: (x = ANY ('{252,1024,36598,1,741265,2352685,1024,845612,3567898,365,124187}'::numeric[]))
         ->  Bitmap Index Scan on idx_x  (cost=0.00..0.00 rows=0 width=0)
               Index Cond: (x = ANY ('{252,1024,36598,1,741265,2352685,1024,845612,3567898,365,124187}'::numeric[]))
 Optimizer: Pivotal Optimizer (GPORCA)
(6 rows)

Environment

Product Version: 6.17

Resolution

Is there are any criteria for the planner to pick Seq scan against Index scan?

No, PostgreSQL chooses the query plan with the least cost based on the cost model, and the cost is related to the amount of data, data distribution and statistics information, we can only roughly estimate.

Does data volume contribute to the scan choices?

Yes, not only does data volume contribute to the scan choices, data distribution, statistics information and the configuration of `SEQ_PAGE_COST`, `RANDOM_PAGE_COST` will contribute to the scan choices too.

For example, the following example is also 10000000 rows of data, but if we modify the data distribution, we will get completely different query results:

postgres=# truncate table sampletable;
TRUNCATE TABLE

postgres=# INSERT INTO sampletable SELECT i FROM generate_series(1, 10000000) i;
INSERT 0 10000000

postgres=# ANALYZE ;
ANALYZE
postgres=# explain SELECT * FROM sampletable WHERE x < 42353;
                                     QUERY PLAN                                     
------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..11.12 rows=40925 width=6)
   ->  Index Scan using idx_x on sampletable  (cost=0.00..10.05 rows=13642 width=6)
         Index Cond: (x < 42353::numeric)
 Optimizer: Pivotal Optimizer (GPORCA)
(4 rows)