Overview of gpsd in Pivotal Greenplum

Products

VMware Tanzu Greenplum

Issue/Introduction

This article discusses gpsd execution and internals.

Resolution

gpsd is a GPDB utility used to collect object metadata and statistics from a source database and import them into a target database. This essentially produces a replica of the source databases system catalog and statistics which are used in query planning.

Please note that the gpsd will collect pg_statistic entries for the tables involved in the query. These entries may have sampling values of user data. It is therefore recommended to sanitize the output gpsd file for these entries and remove them if user data is not supposed to be shared.

gpsd is included in the Greenplum Database (GPDB) distribution under the $GPHOME/bin directory.
Generally, the source database is a customers' system and the target database is an internal lab system.

The target database can be used to reproduce issues when the optimizer or planner:

Produces the wrong plan
Crashes while producing a plan
Encounters optimizer or planner related issues

Only metadata and statistics are collected. Data stored in the tables is ignored. This is to prevent too much space being consumed and customers generally do not want to share data that is confidential.

How To Execute

Execute the following command on the source system. This is where the statistics will be extracted from:

gpsd <dbname> > <gpsd_dump_file>
Example:
gpsd customer_prod1 > customer_prod1_20150808.sql

gzipp the SQL file and transfer it to the target database.

To load the data into the target database, execute the following command:

psql <dbname> -f <gpsd_dump_file>
Example:
creatdb repro_db
psql repro_db -f customer_prod1_20150808.sql

Now the target database contains the same schema and statistics. It will produce the same query plan or encounter the same crash as the source database when the specific query is executed.

gpsd also supports skipping the schema and only collecting statistics:

gpsd -s <dbname> > <gpsd_dump_file>
Example:
gpsd -s customer_prod1 > customer_prod1_20150808_updated.sql

This is useful the schema has already been collected from the source database and only the updated statistics are needed.

What happens internally?

The following steps are performed internally when gpsd is executed:

1. Global objects are dumped.

if orca:
    pg_dumpall -h  -p  -U  -l  -g --no-gp-syntax
else:
    pg_dumpall -h  -p  -U  -l  -g --gp-syntax

2. The schema is dumped.

pg_dump -h  -p  -U  -s -x --gp-syntax -O

3. Tuple counts from pg_class are dumped.

SELECT
    pgc.relname, pgn.nspname, pgc.relpages, pgc.reltuples
FROM
    pg_class pgc, pg_namespace pgn
WHERE
    pgc.relnamespace = pgn.oid
    AND pgn.nspname NOT IN ('pg_toast', 'pg_bitmapindex', 'pg_temp_1', 'pg_catalog', 'information_schema')

4. The statistics from pg_statistic are dumped.

SELECT
    pgc.relname, pgn.nspname, pga.attname, pgt.typname, pgs.*
FROM
    pg_class pgc, pg_statistic pgs, pg_namespace pgn, pg_attribute pga, pg_type pgt
WHERE
    pgc.relnamespace = pgn.oid and pgn.nspname NOT IN ('pg_toast', 'pg_bitmapindex', 'pg_temp_1', 'pg_catalog', 'information_schema')
    AND pgc.oid = pgs.starelid
    AND pga.attrelid = pgc.oid
    AND pga.attnum = pgs.staattnum
    AND pga.atttypid = pgt.oid
ORDER BY
    pgc.relname, pgs.staattnum

Note: All of the output from the steps above can be redirected into a SQL file.