This article discusses gpsd execution and internals.
gpsd is a GPDB utility used to collect object metadata and statistics from a source database and import them into a target database. This essentially produces a replica of the source databases system catalog and statistics which are used in query planning.
Please note that the gpsd will collect pg_statistic
entries for the tables involved in the query. These entries may have sampling values of user data. It is therefore recommended to sanitize the output gpsd file for these entries and remove them if user data is not supposed to be shared.
gpsd is included in the Greenplum Database (GPDB) distribution under the $GPHOME/bin
directory.
Generally, the source database is a customers' system and the target database is an internal lab system.
The target database can be used to reproduce issues when the optimizer or planner:
Only metadata and statistics are collected. Data stored in the tables is ignored. This is to prevent too much space being consumed and customers generally do not want to share data that is confidential.
Execute the following command on the source system. This is where the statistics will be extracted from:
gpsd <dbname> > <gpsd_dump_file> Example: gpsd customer_prod1 > customer_prod1_20150808.sql
gzipp
the SQL file and transfer it to the target database.
To load the data into the target database, execute the following command:
psql <dbname> -f <gpsd_dump_file> Example: creatdb repro_db psql repro_db -f customer_prod1_20150808.sql
Now the target database contains the same schema and statistics. It will produce the same query plan or encounter the same crash as the source database when the specific query is executed.
gpsd also supports skipping the schema and only collecting statistics:
gpsd -s <dbname> > <gpsd_dump_file> Example: gpsd -s customer_prod1 > customer_prod1_20150808_updated.sql
This is useful the schema has already been collected from the source database and only the updated statistics are needed.
The following steps are performed internally when gpsd is executed:
1. Global objects are dumped.
if orca: pg_dumpall -h -p -U -l -g --no-gp-syntax else: pg_dumpall -h -p -U -l -g --gp-syntax
2. The schema is dumped.
pg_dump -h -p -U -s -x --gp-syntax -O
3. Tuple counts from pg_class
are dumped.
SELECT pgc.relname, pgn.nspname, pgc.relpages, pgc.reltuples FROM pg_class pgc, pg_namespace pgn WHERE pgc.relnamespace = pgn.oid AND pgn.nspname NOT IN ('pg_toast', 'pg_bitmapindex', 'pg_temp_1', 'pg_catalog', 'information_schema')
4. The statistics from pg_statistic
are dumped.
SELECT pgc.relname, pgn.nspname, pga.attname, pgt.typname, pgs.* FROM pg_class pgc, pg_statistic pgs, pg_namespace pgn, pg_attribute pga, pg_type pgt WHERE pgc.relnamespace = pgn.oid and pgn.nspname NOT IN ('pg_toast', 'pg_bitmapindex', 'pg_temp_1', 'pg_catalog', 'information_schema') AND pgc.oid = pgs.starelid AND pga.attrelid = pgc.oid AND pga.attnum = pgs.staattnum AND pga.atttypid = pgt.oid ORDER BY pgc.relname, pgs.staattnum
Note: All of the output from the steps above can be redirected into a SQL file.