1. Prepare parquet files on your HDFS filesystem.
$ hdfs dfs -ls/tmp/sample Found 7 items -rwxr--r-- 3 gpadmin hdfs 235841 2016-05-11 00:52 /tmp/sample/part-m-00000.gz.parquet -rwxr--r-- 3 gpadmin hdfs 235841 2016-05-11 00:52 /tmp/sample/part-m-00001.gz.parquet -rwxr--r-- 3 gpadmin hdfs 235841 2016-05-11 00:52 /tmp/sample/part-m-00002.gz.parquet -rwxr--r-- 3 gpadmin hdfs 235841 2016-05-11 00:52 /tmp/sample/part-m-00003.gz.parquet -rwxr--r-- 3 gpadmin hdfs 235841 2016-05-11 00:52 /tmp/sample/part-m-00004.gz.parquet -rwxr--r-- 3 gpadmin hdfs 235841 2016-05-11 00:52 /tmp/sample/part-m-00005.gz.parquet -rwxr--r-- 3 gpadmin hdfs 235841 2016-05-11 00:52 /tmp/sample/part-m-00006.gz.parquet
2. Using the Hive command line (CLI), create a Hive external table pointing to the parquet files. You first need to know table structure of your parquet file to create an external table.
hive> CREATE EXTERNAL TABLE my_hive_parquet ( MONTH0 string, CUSTOMER_ID string, SERVICE string, SERVICE_NUMBER string, TYPE_OF_SHAREPLUS__MOBILE string, SHAREPLUS_PARENT_NUMBER__MOBILE string, STUDENT_DISCOUNT_COMPONENT__MOBILE string, CIS_STATUS___SUB__MOBILE string, HANDSET__MOBILE string, HUBCLUB_MEMBER string, GENDER string, RACE string, AGE_BAND string, RESIDENT_STATUS string, ADDRESS_ID___HH string, POSTAL_CODE___HH string, SMART_CARD_NUMBER string ) STORED AS PARQUET LOCATION '/tmp/sample' ;
3. Create a Hawq external table pointing to the Hive table you just created using PXF.
gpadmin=# CREATE EXTERNAL TABLE my_hdb_parquet(
MONTH0 text,
CUSTOMER_ID text,
SERVICE text,
SERVICE_NUMBER text,
TYPE_OF_SHAREPLUS__MOBILE text,
SHAREPLUS_PARENT_NUMBER__MOBILE text,
STUDENT_DISCOUNT_COMPONENT__MOBILE text,
CIS_STATUS___SUB__MOBILE text,
HANDSET__MOBILE text,
HUBCLUB_MEMBER text,
GENDER text,
RACE text,
AGE_BAND text,
RESIDENT_STATUS text,
ADDRESS_ID___HH text,
POSTAL_CODE___HH text,
SMART_CARD_NUMBER text
)
LOCATION ('pxf://hdm1:51200/my_hive_parquet?PROFILE=Hive')
FORMAT 'custom' (formatter='pxfwritable_import');
4. Read the data through the external table from HDB.
gpadmin=# select count(*) from my_hdb_parquet; count -------- 700000 (1 row) Time: 6536.905 ms gpadmin=# select * from my_hdb_parquet limit 5;
Note: Please make sure pxf-service is running on the NameNodes and all HDB segment nodes before following this example.
Download sample_parquet.tgz (60 KB, there is no preview, click Download from the top when you open the file).