Through statistical information, the cost estimation system can know:
Then estimate how much data a constraint can filter out based on this information. The proportion of the data filtered by the condition to the total data volume is called selectivity.
selectivity = number of tuples left after filtering / number of tuples before filtering
In general, the planner is more likely to choose a sequential scan (SeqScan) when the selectivity is higher, and the planner is more likely to choose an index scan (IndexScan) when the selectivity is lower.
postgres=# CREATE TABLE sampletable (x numeric); CREATE TABLE postgres=# INSERT INTO sampletable SELECT random() * 10000 FROM generate_series(1, 10000000); INSERT 0 10000000 postgres=# CREATE INDEX idx_x ON sampletable(x); CREATE INDEX postgres=# ANALYZE ; ANALYZE -- The selectivity is low, and the index scan method is used to obtain data postgres=# explain SELECT * FROM sampletable WHERE x = 42353; QUERY PLAN -------------------------------------------------------------------------------- Gather Motion 1:1 (slice1; segments: 1) (cost=0.00..6.00 rows=1 width=11) -> Index Scan using idx_x on sampletable (cost=0.00..6.00 rows=1 width=11) Index Cond: (x = 42353::numeric) Optimizer: Pivotal Optimizer (GPORCA) (4 rows) -- High selectivity, using sequential scanning method to obtain data postgres=# explain SELECT * FROM sampletable WHERE x < 42353; QUERY PLAN --------------------------------------------------------------------------------------- Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..1093.97 rows=10000000 width=11) -> Seq Scan on sampletable (cost=0.00..684.03 rows=3333334 width=11) Filter: (x < 42353::numeric) Optimizer: Pivotal Optimizer (GPORCA) (4 rows)
This is because the index scans produces random reads. In the PostgreSQL database, different costs are defined for sequential reads and random reads:
SEQ_PAGE_COST 1.0 // cost per page for sequential read RANDOM_PAGE_COST 4.0 // cost per page for random read
If a query will select small percentage of rows from a table, PostgreSQL will decide on an index scan.
If a query will select a large percentage of the rows, PostgreSQL will decide to read the table completely.
If a query will select too many for an index scan to be efficient but too few for a sequential scan, then planner will choose to use a bitmap scan.
The idea behind a bitmap scan is that a single block is only used once during a scan. It can also be very helpful if you want to use more than one index to scan a single table.
postgres=# explain SELECT * FROM sampletable WHERE x in (252, 1024, 36598, 1, 741265, 2352685, 1024, 845612, 3567898, 365, 124187); QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..387.97 rows=5 width=11) -> Bitmap Heap Scan on sampletable (cost=0.00..387.97 rows=2 width=11) Recheck Cond: (x = ANY ('{252,1024,36598,1,741265,2352685,1024,845612,3567898,365,124187}'::numeric[])) -> Bitmap Index Scan on idx_x (cost=0.00..0.00 rows=0 width=0) Index Cond: (x = ANY ('{252,1024,36598,1,741265,2352685,1024,845612,3567898,365,124187}'::numeric[])) Optimizer: Pivotal Optimizer (GPORCA) (6 rows)
No, PostgreSQL chooses the query plan with the least cost based on the cost model, and the cost is related to the amount of data, data distribution and statistics information, we can only roughly estimate.
Yes, not only does data volume contribute to the scan choices, data distribution, statistics information and the configuration of `SEQ_PAGE_COST`, `RANDOM_PAGE_COST` will contribute to the scan choices too.
For example, the following example is also 10000000 rows of data, but if we modify the data distribution, we will get completely different query results:
postgres=# truncate table sampletable; TRUNCATE TABLE postgres=# INSERT INTO sampletable SELECT i FROM generate_series(1, 10000000) i; INSERT 0 10000000 postgres=# ANALYZE ; ANALYZE postgres=# explain SELECT * FROM sampletable WHERE x < 42353; QUERY PLAN ------------------------------------------------------------------------------------ Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..11.12 rows=40925 width=6) -> Index Scan using idx_x on sampletable (cost=0.00..10.05 rows=13642 width=6) Index Cond: (x < 42353::numeric) Optimizer: Pivotal Optimizer (GPORCA) (4 rows)