Troubleshooting a PXF or GPHDFS external query in Greenplum
search cancel

Troubleshooting a PXF or GPHDFS external query in Greenplum

book

Article ID: 296286

calendar_today

Updated On:

Products

VMware Tanzu Greenplum Greenplum

Issue/Introduction

This article discusses how to troubleshoot issues involving Greenplum external queries to your remote Hadoop clusters. It will cover both PXF Hadoop connectors, as well as Greenplum Hadoop Distributed File System (GPHDFS) Protocol.

This article discusses how to troubleshoot issues involving Greenplum external queries to your remote Hadoop clusters. We will cover both PXF Hadoop connectors, as well as GPHDFS Protocol.


These troubleshooting steps should be helpful in scenarios where:

  • Similar external queries were previously working.

  • You have recently enabled PXF or GPHDFS for Greenplum using the documented procedures and still ecounter issues.

First, every Greenplum Segment host must be able to communicate with ALL Hadoop HDFS nodes.

Note: It may be helpful to spend 5 minutes reviewing the NameNodes and DataNodes section of the Hadoop HDFS Architecture Overview.

 

Environment

Product Version: 5.3

Resolution

Checklist:

PXF: Troubleshooting PXF Hadoop connectors

Let's say you or your end-users have the following PXF external tables:

  • Table 1 PXF: HDFS External Readable Table

CREATE EXTERNAL TABLE pxf_hdfs_tbl_1col (
    id text
) LOCATION (
    'pxf://user/gpadmin/gpdb5_data/pxf_hdfs_1col.txt?PROFILE=HdfsTextSimple'
) ON ALL
FORMAT 'text' (delimiter E'	' null E'\\N' escape E'\\')
OPTIONS (

 )
ENCODING 'UTF8';

 

  • Table 2 PXF: HDFS External Writable Table
CREATE WRITABLE EXTERNAL TABLE pxf_hdfs_tbl_1col_wr (
    id text
) LOCATION (
    'pxf://user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr?PROFILE=HdfsTextSimple'
)
FORMAT 'text' (delimiter E' ' null E'\\N' escape E'\\')
OPTIONS (
 )
ENCODING 'UTF8' DISTRIBUTED RANDOMLY;

 

  • Table 3 PXF: HBase External Table
CREATE EXTERNAL TABLE pxf_hbase_tbl_orderinfo (
    "product:name" character varying,
    "shipping_info:zipcode" integer
) LOCATION (
    'pxf://order_info?PROFILE=HBase'
) ON ALL
FORMAT 'custom' (formatter = 'pxfwritable_import')
OPTIONS (
 )
ENCODING 'UTF8';

 

From every segment host running PXF, you should be able to:

  • Quickly install Hadoop clients to validate the Host Hadoop is set up correctly outside of GPDB or PXF.

  • For example, from the Redhat7 or Centos7 host, you should see the following: 
cd /etc/yum.repos.d 
wget http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.6.2.0/hdp.repo
yum install hadoop-client
yum install hbase

Then run the basic client utilities using the $PXF_CONF configuration. 

For example:

1. List the remote HDFS LOCATION file for Table 1:

$ hdfs --config $PXF_CONF/servers/default dfs -ls /user/gpadmin/gpdb5_data/pxf_hdfs_1col.txt 
-rw-r--r--   3 gpadmin gpadmin         21 2019-04-27 21:16 /user/gpadmin/gpdb5_data/pxf_hdfs_1col.txt
$ 

 

2. Perform the '-cat' command on the contents of the remote HDFS LOCATION file for Table 1:

$ hdfs --config $PXF_CONF/servers/default dfs -cat /user/gpadmin/gpdb5_data/pxf_hdfs_1col.txt
1
2
3
4
5
6
7
8
9
10
$

 

3. Create the directory referenced by the remote HDFS LOCATION for Writable Table 2:

$ hdfs --config $PXF_CONF/servers/default dfs -mkdir /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr

 

4. Alternatively to the previous step, you can also "list" or "cat" the contents of the remote HDFS LOCATION Directory for Writable Table 2:

$ hdfs --config $PXF_CONF/servers/default dfs -ls /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr
Found 4 items
-rw-r--r--   3 gpadmin gpadmin          4 2019-05-01 22:33 /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr/1556750803-0000002994_0
-rw-r--r--   3 gpadmin gpadmin          6 2019-05-01 22:33 /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr/1556750803-0000002994_1
-rw-r--r--   3 gpadmin gpadmin          7 2019-05-01 22:33 /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr/1556750803-0000002994_2
-rw-r--r--   3 gpadmin gpadmin          4 2019-05-01 22:33 /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr/1556750803-0000002994_3
$

 

$ hdfs --config $PXF_CONF/servers/default dfs -cat /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr/1556750803-0000002994_0
4
8
$ 
  1. Note: For more information on the hdfs CLI, refer to the HDFSCommands reference.


5. Check the HBase table referenced in LOCATION URI for Table 3 by connecting through the HBase client shell from the segment host. If successful, you will:

  • Receive the following HBase CLI prompt:
hbase --config $PXF_CONF/servers/default shell
...
hbase(main):001:0>
  • Be able to list the HBase table in the LOCATION URI:
hbase(main):002:0> list "order_info"
TABLE
order_info
1 row(s) in 0.0120 seconds

=> ["order_info"]
hbase(main):003:0>
  • Be able to do a HBase table scan:
hbase(main):006:0> scan "order_info"
ROW                                                                     COLUMN+CELL
 1                                                                      column=product:location, timestamp=1533511907001, value=out of stock
 1                                                                      column=product:name, timestamp=1533511906943, value=tennis racquet
 1                                                                      column=shipping_info:state, timestamp=1533511907058, value=CA
 1                                                                      column=shipping_info:zipcode, timestamp=1533511907127, value=12345
 2                                                                      column=product:location, timestamp=1533511907185, value=on floor
 2                                                                      column=product:name, timestamp=1533511907153, value=soccer ball
 2                                                                      column=shipping_info:state, timestamp=1533511907216, value=CO
 2                                                                      column=shipping_info:zipcode, timestamp=1533511907239, value=56789
 3                                                                      column=product:location, timestamp=1533511907296, value=warehouse
 3                                                                      column=product:name, timestamp=1533511907266, value=snorkel set
 3                                                                      column=shipping_info:state, timestamp=1533511907315, value=OH
 3                                                                      column=shipping_info:zipcode, timestamp=1533511907371, value=34567
3 row(s) in 0.0400 seconds

hbase(main):007:0>

Note: For more information on the HBase shell, refer to the Apache HBase Shell.  

 


GPHDFS: Troubleshooting External tables using GPHDFS Protocol

Let's say you or your end-users have the following GPHDFS external tables:

  • Table 1 GPHDFS: External Readable Table

CREATE EXTERNAL TABLE hdfs_ext_r (
    id integer
) LOCATION (
    'gphdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_test.txt'
) FORMAT 'text' (delimiter E',' null E'\\N' escape E'\\')
ENCODING 'UTF8';

 

  • Table 2 GPHDFS: External Writable Table
CREATE WRITABLE EXTERNAL TABLE gphdfs_ext_wr (
     id text
) LOCATION (
    'gphdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_ext_wr_dir?compress=true'
) FORMAT 'custom' (formatter='gphdfs_export');


From each segment host for GPHDFS, you should be able to:

1. List the remote HDFS uniform resource identifier (URI) of GPHDFS Table 1:

$ hdfs dfs -ls hdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_test.txt
-rw-r--r--   2 gpadmin gpadmin         21 2019-04-29 18:12 hdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_test.txt
$

 

2. Perform the '-cat' command on the contents of the remote HDFS URI of GPHDFS Table1:

$ hdfs dfs -cat hdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_test.txt
1
2
3
4
5
6
7
8
9
10
$

 

3. List the remote HDFS Directory URI of Writable GPHDFS Table 2:

$ hdfs dfs -ls hdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_ext_wr_dir
Found 5 items
-rw-r--r--   2 gpadmin gpadmin        246 2019-05-02 17:17 hdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_ext_wr_dir/0_1556573987-0000000137
-rw-r--r--   2 gpadmin gpadmin        234 2019-05-02 17:28 hdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_ext_wr_dir/0_1556573987-0000000138
-rw-r--r--   2 gpadmin gpadmin        251 2019-05-02 17:17 hdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_ext_wr_dir/1_1556573987-0000000137
-rw-r--r--   2 gpadmin gpadmin        257 2019-05-02 17:28 hdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_ext_wr_dir/1_1556573987-0000000138
-rw-r--r--   2 gpadmin gpadmin        228 2019-05-02 17:28 hdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_ext_wr_dir/2_1556573987-0000000138
$

 

If the previous hdfs client tests look good, then you likely have a GPHDFS host environment setup issue. 

Note: For more information on configuring GPHDFS and the GphdfsTest.sh utility it provides, refer to the following KB article: Understanding-gphdfs-Configurations


As mentioned in the above article, improper or unset CLASSPATH and JAVA_HOME variables are the most common issues with GPHDFS. Make sure to check each segment host's GPHDFS environment is setup correctly to your Hadoop cluster with the GphdfsTest.sh tool:

  • Download the GphdfsTest.sh tool, from the referenced article, to the segment host(s).

  • ssh gpadmin@<segment host>
  • source /usr/local/greenplum-db/greenplum_path.sh
  • Make sure you set the following variables correctly for your distro:
JAVA_HOME
HADOOP_HOME

Example of an Hortonworks HDP distro:
export JAVA_HOME=/usr/jdk64/jdk1.8.0_112
export HADOOP_HOME=/usr/hdp/current/hadoop-client

 

$ ./GphdfsTest.sh hdp2 $HADOOP_HOME TEXT gphdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_test.txt
19/05/01 15:19:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/05/01 15:19:49 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
19/05/01 15:19:49 INFO input.FileInputFormat: Total input paths to process : 1
1
2
3
4
5
6
7
8
9
10
$

 

4. If you are still hitting errors such as, NoClassDefFoundError or ClassNotFoundException,  then you must update the $GPHOME/lib/hadoop/hadoop_env.sh file on the segment host(s).

 

Tip: Try using Hadoop's mapred utility (provided by your Hadoop distribution) to append the Hadoop classes to those in $GPHOME/lib/hadoop/hadoop_env.sh file on your segment host(s). Also try to put the following at the end of your $GPHOME/lib/hadoop/hadoop_env.sh files:

if [ -x /usr/bin/mapred ]; then
   MAPRED_CLASSPATH=$(mapred classpath)
   CLASSPATH=${CLASSPATH}:${MAPRED_CLASSPATH}
fi



Attachments

GphdfsTest get_app