1. Prepare parquet files on your HDFS filesystem.
$ hdfs dfs -ls/tmp/sample Found 7 items -rwxr--r-- 3 gpadmin hdfs 235841 2016-05-11 00:52 /tmp/sample/part-m-00000.gz.parquet -rwxr--r-- 3 gpadmin hdfs 235841 2016-05-11 00:52 /tmp/sample/part-m-00001.gz.parquet -rwxr--r-- 3 gpadmin hdfs 235841 2016-05-11 00:52 /tmp/sample/part-m-00002.gz.parquet -rwxr--r-- 3 gpadmin hdfs 235841 2016-05-11 00:52 /tmp/sample/part-m-00003.gz.parquet -rwxr--r-- 3 gpadmin hdfs 235841 2016-05-11 00:52 /tmp/sample/part-m-00004.gz.parquet -rwxr--r-- 3 gpadmin hdfs 235841 2016-05-11 00:52 /tmp/sample/part-m-00005.gz.parquet -rwxr--r-- 3 gpadmin hdfs 235841 2016-05-11 00:52 /tmp/sample/part-m-00006.gz.parquet
2. Using the Hive command line (CLI), create a Hive external table pointing to the parquet files. You first need to know table structure of your parquet file to create an external table.
hive> CREATE EXTERNAL TABLE my_hive_parquet ( MONTH0 string, CUSTOMER_ID string, SERVICE string, SERVICE_NUMBER string, TYPE_OF_SHAREPLUS__MOBILE string, SHAREPLUS_PARENT_NUMBER__MOBILE string, STUDENT_DISCOUNT_COMPONENT__MOBILE string, CIS_STATUS___SUB__MOBILE string, HANDSET__MOBILE string, HUBCLUB_MEMBER string, GENDER string, RACE string, AGE_BAND string, RESIDENT_STATUS string, ADDRESS_ID___HH string, POSTAL_CODE___HH string, SMART_CARD_NUMBER string ) STORED AS PARQUET LOCATION '/tmp/sample' ;
3. Create a Hawq external table pointing to the Hive table you just created using PXF.
gpadmin=# CREATE EXTERNAL TABLE my_hdb_parquet( MONTH0 text, CUSTOMER_ID text, SERVICE text, SERVICE_NUMBER text, TYPE_OF_SHAREPLUS__MOBILE text, SHAREPLUS_PARENT_NUMBER__MOBILE text, STUDENT_DISCOUNT_COMPONENT__MOBILE text, CIS_STATUS___SUB__MOBILE text, HANDSET__MOBILE text, HUBCLUB_MEMBER text, GENDER text, RACE text, AGE_BAND text, RESIDENT_STATUS text, ADDRESS_ID___HH text, POSTAL_CODE___HH text, SMART_CARD_NUMBER text ) LOCATION ('pxf://hdm1:51200/my_hive_parquet?PROFILE=Hive') FORMAT 'custom' (formatter='pxfwritable_import');
4. Read the data through the external table from HDB.
gpadmin=# select count(*) from my_hdb_parquet; count -------- 700000 (1 row) Time: 6536.905 ms gpadmin=# select * from my_hdb_parquet limit 5;
Note: Please make sure pxf-service is running on the NameNodes and all HDB segment nodes before following this example.
Download sample_parquet.tgz (60 KB, there is no preview, click Download from the top when you open the file).