Data skew in randomly distributed Pivotal HD tables

Products

Services Suite

Issue/Introduction

This article explains data skew in randomly distributed Pivotal HD tables.

Note: Disable the query optimizer before inserting data into a randomly distributed table.

Inserting or redistributing data on a randomly distributed table with the Pivotal Query Optimizer (ORCA) enabled results in data skew.

[gpadmin@hdw1 ~]$ psql
psql (8.2.15)
Type "help" for help.

gpadmin=# show optimizer;
 optimizer
-----------
 on
(1 row)

gpadmin=# create table test_columns (id int,  testdata char(39)) with (appendonly=true, orientation=parquet) distributed randomly;
CREATE TABLE
gpadmin=# create external table ext_test_columns (like test_columns) LOCATION ('pxf://localhost:8020/tmp/hawq_testdata/test_data.csv?Profile=HdfsTextSimple') FORMAT 'CSV';
CREATE EXTERNAL TABLE
gpadmin=# explain analyze insert into test_columns SELECT * from ext_test_columns;
                                                                     QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Insert  (cost=0.00..21281.23 rows=166667 width=16)
   Rows out:  Avg 2500000.0 rows x 2 workers.  Max 3222116 rows (seg0) with 166 ms to first row, 13583 ms to end, start offset by 249 ms.
   Executor memory:  1K bytes avg, 1K bytes max (seg0).
   ->  Result  (cost=0.00..447.89 rows=166667 width=32)
         Rows out:  Avg 2500000.0 rows x 2 workers.  Max 3222116 rows (seg0) with 159 ms to first row, 8947 ms to end, start offset by 249 ms.
         ->  External Scan on ext_test_columns  (cost=0.00..437.60 rows=166667 width=16)
               Rows out:  Avg 2500000.0 rows x 2 workers.  Max 3222116 rows (seg0) with 159 ms to first row, 7440 ms to end, start offset by 249 ms.
 Slice statistics:
   (slice0)    Executor memory: 3251K bytes avg x 6 workers, 9449K bytes max (seg0).
 Statement statistics:
   Memory used: 128000K bytes
 Total runtime: 14419.207 ms
(12 rows)

gpadmin=# select gp_segment_id, count(*) from test_columns group by 1;
 gp_segment_id |  count
---------------+---------
             0 | 3222116
             2 | 1777884
(2 rows)

gpadmin=# alter table test_columns set  WITH (reorganize=true) distributed randomly;
ALTER TABLE
gpadmin=# select gp_segment_id, count(*) from test_columns group by 1;
 gp_segment_id |  count
---------------+---------
             0 | 3222116
             2 | 1777884
(2 rows)

The query optimizer does not perform a redistribute motion when adding or redistributing data in randomly distributed HAWQ tables.

Resolution

If the data has already been inserted into the table, disable the optimizer and alter the table as shown below:

gpadmin=# set optimizer=off;
SET
gpadmin=# alter table test_columns set  WITH (reorganize=true) distributed randomly;
ALTER TABLE
gpadmin=# select gp_segment_id, count(*) from test_columns group by 1;
 gp_segment_id | count
---------------+--------