This article discusses how to troubleshoot issues involving Greenplum external queries to your remote Hadoop clusters using Platform Extension Framework(PXF).
These troubleshooting steps should be helpful in scenarios where:
First, every Greenplum Segment host must be able to communicate with ALL Hadoop HDFS nodes.
Note: It may be helpful to spend 5 minutes reviewing the NameNodes and DataNodes section of the Hadoop HDFS Architecture Overview.
Checklist:
Let's say you or your end-users have the following PXF external tables:
Table 1 PXF: HDFS External Readable Table
CREATE EXTERNAL TABLE pxf_hdfs_tbl_1col ( id text ) LOCATION ( 'pxf://user/gpadmin/gpdb5_data/pxf_hdfs_1col.txt?PROFILE=HdfsTextSimple' ) ON ALL FORMAT 'text' (delimiter E' ' null E'\\N' escape E'\\') OPTIONS ( ) ENCODING 'UTF8';
CREATE WRITABLE EXTERNAL TABLE pxf_hdfs_tbl_1col_wr ( id text ) LOCATION ( 'pxf://user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr?PROFILE=HdfsTextSimple' ) FORMAT 'text' (delimiter E' ' null E'\\N' escape E'\\') OPTIONS ( ) ENCODING 'UTF8' DISTRIBUTED RANDOMLY;
CREATE EXTERNAL TABLE pxf_hbase_tbl_orderinfo ( "product:name" character varying, "shipping_info:zipcode" integer ) LOCATION ( 'pxf://order_info?PROFILE=HBase' ) ON ALL FORMAT 'custom' (formatter = 'pxfwritable_import') OPTIONS ( ) ENCODING 'UTF8';
From every segment host running PXF, you should be able to:
Redhat7
or Centos7
host, you should see the following: cd /etc/yum.repos.d wget http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.6.2.0/hdp.repo yum install hadoop-client yum install hbase
Then run the basic client utilities using the $PXF_CONF
configuration.
For example:
1. List the remote HDFS LOCATION file for Table 1:
$ hdfs --config $PXF_CONF/servers/default dfs -ls /user/gpadmin/gpdb5_data/pxf_hdfs_1col.txt -rw-r--r-- 3 gpadmin gpadmin 21 2019-04-27 21:16 /user/gpadmin/gpdb5_data/pxf_hdfs_1col.txt $
2. Perform the '-cat
' command on the contents of the remote HDFS LOCATION file for Table 1:
$ hdfs --config $PXF_CONF/servers/default dfs -cat /user/gpadmin/gpdb5_data/pxf_hdfs_1col.txt 1 2 3 4 5 6 7 8 9 10 $
3. Create the directory referenced by the remote HDFS LOCATION for Writable Table 2:
$ hdfs --config $PXF_CONF/servers/default dfs -mkdir /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr
4. Alternatively to the previous step, you can also "list" or "cat" the contents of the remote HDFS LOCATION Directory for Writable Table 2:
$ hdfs --config $PXF_CONF/servers/default dfs -ls /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr Found 4 items -rw-r--r-- 3 gpadmin gpadmin 4 2019-05-01 22:33 /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr/1556750803-0000002994_0 -rw-r--r-- 3 gpadmin gpadmin 6 2019-05-01 22:33 /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr/1556750803-0000002994_1 -rw-r--r-- 3 gpadmin gpadmin 7 2019-05-01 22:33 /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr/1556750803-0000002994_2 -rw-r--r-- 3 gpadmin gpadmin 4 2019-05-01 22:33 /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr/1556750803-0000002994_3 $
$ hdfs --config $PXF_CONF/servers/default dfs -cat /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr/1556750803-0000002994_0 4 8 $
5. Check the HBase table referenced in LOCATION URI for Table 3 by connecting through the HBase client shell from the segment host. If successful, you will:
hbase --config $PXF_CONF/servers/default shell ... hbase(main):001:0>
hbase(main):002:0> list "order_info" TABLE order_info 1 row(s) in 0.0120 seconds => ["order_info"] hbase(main):003:0>
hbase(main):006:0> scan "order_info" ROW COLUMN+CELL 1 column=product:location, timestamp=1533511907001, value=out of stock 1 column=product:name, timestamp=1533511906943, value=tennis racquet 1 column=shipping_info:state, timestamp=1533511907058, value=CA 1 column=shipping_info:zipcode, timestamp=1533511907127, value=12345 2 column=product:location, timestamp=1533511907185, value=on floor 2 column=product:name, timestamp=1533511907153, value=soccer ball 2 column=shipping_info:state, timestamp=1533511907216, value=CO 2 column=shipping_info:zipcode, timestamp=1533511907239, value=56789 3 column=product:location, timestamp=1533511907296, value=warehouse 3 column=product:name, timestamp=1533511907266, value=snorkel set 3 column=shipping_info:state, timestamp=1533511907315, value=OH 3 column=shipping_info:zipcode, timestamp=1533511907371, value=34567 3 row(s) in 0.0400 seconds hbase(main):007:0>
Note: For more information on the HBase shell, refer to the Apache HBase Shell.