How to read parquet files on HDFS from HDB using PXF
search cancel

How to read parquet files on HDFS from HDB using PXF

book

Article ID: 294680

calendar_today

Updated On:

Products

Services Suite

Issue/Introduction

This article discusses an example of using PXF to read parquet files on a Hadoop cluster (HDFS).

Environment


Resolution

1. Prepare parquet files on your HDFS filesystem.

$ hdfs dfs -ls/tmp/sample
Found 7 items
-rwxr--r--   3 gpadmin hdfs     235841 2016-05-11 00:52 /tmp/sample/part-m-00000.gz.parquet
-rwxr--r--   3 gpadmin hdfs     235841 2016-05-11 00:52 /tmp/sample/part-m-00001.gz.parquet
-rwxr--r--   3 gpadmin hdfs     235841 2016-05-11 00:52 /tmp/sample/part-m-00002.gz.parquet
-rwxr--r--   3 gpadmin hdfs     235841 2016-05-11 00:52 /tmp/sample/part-m-00003.gz.parquet
-rwxr--r--   3 gpadmin hdfs     235841 2016-05-11 00:52 /tmp/sample/part-m-00004.gz.parquet
-rwxr--r--   3 gpadmin hdfs     235841 2016-05-11 00:52 /tmp/sample/part-m-00005.gz.parquet
-rwxr--r--   3 gpadmin hdfs     235841 2016-05-11 00:52 /tmp/sample/part-m-00006.gz.parquet


2. Using the Hive command line (CLI), create a Hive external table pointing to the parquet files. You first need to know table structure of your parquet file to create an external table.

hive> CREATE EXTERNAL TABLE my_hive_parquet
(
    MONTH0 string,
    CUSTOMER_ID string,
    SERVICE string,
    SERVICE_NUMBER string,
    TYPE_OF_SHAREPLUS__MOBILE string,
    SHAREPLUS_PARENT_NUMBER__MOBILE string,
    STUDENT_DISCOUNT_COMPONENT__MOBILE string,
    CIS_STATUS___SUB__MOBILE string,
    HANDSET__MOBILE string,
    HUBCLUB_MEMBER string,
    GENDER string,
    RACE string,
    AGE_BAND string,
    RESIDENT_STATUS string,
    ADDRESS_ID___HH string,
  POSTAL_CODE___HH string,
    SMART_CARD_NUMBER string
 )
 STORED AS PARQUET LOCATION '/tmp/sample' ; 


3. Create a Hawq external table pointing to the Hive table you just created using PXF.

gpadmin=# CREATE EXTERNAL TABLE my_hdb_parquet(
    MONTH0 text,
    CUSTOMER_ID text,
    SERVICE text,
    SERVICE_NUMBER text,
    TYPE_OF_SHAREPLUS__MOBILE text,
    SHAREPLUS_PARENT_NUMBER__MOBILE text,
    STUDENT_DISCOUNT_COMPONENT__MOBILE text,
    CIS_STATUS___SUB__MOBILE text,
    HANDSET__MOBILE text,
    HUBCLUB_MEMBER text,
    GENDER text,
    RACE text,
    AGE_BAND text,
    RESIDENT_STATUS text,
    ADDRESS_ID___HH text,
    POSTAL_CODE___HH text,
    SMART_CARD_NUMBER text
 )
 LOCATION ('pxf://hdm1:51200/my_hive_parquet?PROFILE=Hive')
 FORMAT 'custom' (formatter='pxfwritable_import');


4. Read the data through the external table from HDB.

gpadmin=# select count(*) from my_hdb_parquet;
  count
 --------
  700000
 (1 row)
  
 Time: 6536.905 ms
  
 gpadmin=# select * from my_hdb_parquet limit 5;


Note: Please make sure pxf-service is running on the NameNodes and all HDB segment nodes before following this example. 

Download sample_parquet.tgz (60 KB, there is no preview, click Download from the top when you open the file).