This article discusses how to troubleshoot issues involving Greenplum external queries to your remote Hadoop clusters. It will cover both PXF Hadoop connectors, as well as Greenplum Hadoop Distributed File System (GPHDFS) Protocol.
This article discusses how to troubleshoot issues involving Greenplum external queries to your remote Hadoop clusters. We will cover both PXF Hadoop connectors, as well as GPHDFS Protocol.
These troubleshooting steps should be helpful in scenarios where:
First, every Greenplum Segment host must be able to communicate with ALL Hadoop HDFS nodes.
Note: It may be helpful to spend 5 minutes reviewing the NameNodes and DataNodes section of the Hadoop HDFS Architecture Overview.
Checklist:
Let's say you or your end-users have the following PXF external tables:
Table 1 PXF: HDFS External Readable Table
CREATE EXTERNAL TABLE pxf_hdfs_tbl_1col ( id text ) LOCATION ( 'pxf://user/gpadmin/gpdb5_data/pxf_hdfs_1col.txt?PROFILE=HdfsTextSimple' ) ON ALL FORMAT 'text' (delimiter E' ' null E'\\N' escape E'\\') OPTIONS ( ) ENCODING 'UTF8';
CREATE WRITABLE EXTERNAL TABLE pxf_hdfs_tbl_1col_wr ( id text ) LOCATION ( 'pxf://user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr?PROFILE=HdfsTextSimple' ) FORMAT 'text' (delimiter E' ' null E'\\N' escape E'\\') OPTIONS ( ) ENCODING 'UTF8' DISTRIBUTED RANDOMLY;
CREATE EXTERNAL TABLE pxf_hbase_tbl_orderinfo ( "product:name" character varying, "shipping_info:zipcode" integer ) LOCATION ( 'pxf://order_info?PROFILE=HBase' ) ON ALL FORMAT 'custom' (formatter = 'pxfwritable_import') OPTIONS ( ) ENCODING 'UTF8';
From every segment host running PXF, you should be able to:
Redhat7
or Centos7
host, you should see the following: cd /etc/yum.repos.d wget http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.6.2.0/hdp.repo yum install hadoop-client yum install hbase
Then run the basic client utilities using the $PXF_CONF
configuration.
For example:
1. List the remote HDFS LOCATION file for Table 1:
$ hdfs --config $PXF_CONF/servers/default dfs -ls /user/gpadmin/gpdb5_data/pxf_hdfs_1col.txt -rw-r--r-- 3 gpadmin gpadmin 21 2019-04-27 21:16 /user/gpadmin/gpdb5_data/pxf_hdfs_1col.txt $
2. Perform the '-cat
' command on the contents of the remote HDFS LOCATION file for Table 1:
$ hdfs --config $PXF_CONF/servers/default dfs -cat /user/gpadmin/gpdb5_data/pxf_hdfs_1col.txt 1 2 3 4 5 6 7 8 9 10 $
3. Create the directory referenced by the remote HDFS LOCATION for Writable Table 2:
$ hdfs --config $PXF_CONF/servers/default dfs -mkdir /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr
4. Alternatively to the previous step, you can also "list" or "cat" the contents of the remote HDFS LOCATION Directory for Writable Table 2:
$ hdfs --config $PXF_CONF/servers/default dfs -ls /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr Found 4 items -rw-r--r-- 3 gpadmin gpadmin 4 2019-05-01 22:33 /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr/1556750803-0000002994_0 -rw-r--r-- 3 gpadmin gpadmin 6 2019-05-01 22:33 /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr/1556750803-0000002994_1 -rw-r--r-- 3 gpadmin gpadmin 7 2019-05-01 22:33 /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr/1556750803-0000002994_2 -rw-r--r-- 3 gpadmin gpadmin 4 2019-05-01 22:33 /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr/1556750803-0000002994_3 $
$ hdfs --config $PXF_CONF/servers/default dfs -cat /user/gpadmin/gpdb5_data/pxf_hdfs_tbl_1col_wr/1556750803-0000002994_0 4 8 $
5. Check the HBase table referenced in LOCATION URI for Table 3 by connecting through the HBase client shell from the segment host. If successful, you will:
hbase --config $PXF_CONF/servers/default shell ... hbase(main):001:0>
hbase(main):002:0> list "order_info" TABLE order_info 1 row(s) in 0.0120 seconds => ["order_info"] hbase(main):003:0>
hbase(main):006:0> scan "order_info" ROW COLUMN+CELL 1 column=product:location, timestamp=1533511907001, value=out of stock 1 column=product:name, timestamp=1533511906943, value=tennis racquet 1 column=shipping_info:state, timestamp=1533511907058, value=CA 1 column=shipping_info:zipcode, timestamp=1533511907127, value=12345 2 column=product:location, timestamp=1533511907185, value=on floor 2 column=product:name, timestamp=1533511907153, value=soccer ball 2 column=shipping_info:state, timestamp=1533511907216, value=CO 2 column=shipping_info:zipcode, timestamp=1533511907239, value=56789 3 column=product:location, timestamp=1533511907296, value=warehouse 3 column=product:name, timestamp=1533511907266, value=snorkel set 3 column=shipping_info:state, timestamp=1533511907315, value=OH 3 column=shipping_info:zipcode, timestamp=1533511907371, value=34567 3 row(s) in 0.0400 seconds hbase(main):007:0>
Note: For more information on the HBase shell, refer to the Apache HBase Shell.
Table 1 GPHDFS: External Readable Table
CREATE EXTERNAL TABLE hdfs_ext_r ( id integer ) LOCATION ( 'gphdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_test.txt' ) FORMAT 'text' (delimiter E',' null E'\\N' escape E'\\') ENCODING 'UTF8';
CREATE WRITABLE EXTERNAL TABLE gphdfs_ext_wr ( id text ) LOCATION ( 'gphdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_ext_wr_dir?compress=true' ) FORMAT 'custom' (formatter='gphdfs_export');
From each segment host for GPHDFS, you should be able to:
1. List the remote HDFS uniform resource identifier (URI) of GPHDFS Table 1:
$ hdfs dfs -ls hdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_test.txt -rw-r--r-- 2 gpadmin gpadmin 21 2019-04-29 18:12 hdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_test.txt $
2. Perform the '-cat
' command on the contents of the remote HDFS URI of GPHDFS Table1:
$ hdfs dfs -cat hdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_test.txt 1 2 3 4 5 6 7 8 9 10 $
3. List the remote HDFS Directory URI of Writable GPHDFS Table 2:
$ hdfs dfs -ls hdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_ext_wr_dir Found 5 items -rw-r--r-- 2 gpadmin gpadmin 246 2019-05-02 17:17 hdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_ext_wr_dir/0_1556573987-0000000137 -rw-r--r-- 2 gpadmin gpadmin 234 2019-05-02 17:28 hdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_ext_wr_dir/0_1556573987-0000000138 -rw-r--r-- 2 gpadmin gpadmin 251 2019-05-02 17:17 hdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_ext_wr_dir/1_1556573987-0000000137 -rw-r--r-- 2 gpadmin gpadmin 257 2019-05-02 17:28 hdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_ext_wr_dir/1_1556573987-0000000138 -rw-r--r-- 2 gpadmin gpadmin 228 2019-05-02 17:28 hdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_ext_wr_dir/2_1556573987-0000000138 $
If the previous hdfs client tests look good, then you likely have a GPHDFS host environment setup issue.
Note: For more information on configuring GPHDFS and the GphdfsTest.sh utility it provides, refer to the following KB article: Understanding-gphdfs-Configurations
As mentioned in the above article, improper or unset CLASSPATH and JAVA_HOME variables are the most common issues with GPHDFS. Make sure to check each segment host's GPHDFS environment is setup correctly to your Hadoop cluster with the GphdfsTest.sh tool:
ssh gpadmin@<segment host>
source /usr/local/greenplum-db/greenplum_path.sh
distro
:JAVA_HOME HADOOP_HOME Example of an Hortonworks HDP distro: export JAVA_HOME=/usr/jdk64/jdk1.8.0_112 export HADOOP_HOME=/usr/hdp/current/hadoop-client
$ ./GphdfsTest.sh hdp2 $HADOOP_HOME TEXT gphdfs://hdp1.hdp.local:/user/gpadmin/gphdfs_test/gphdfs_test.txt 19/05/01 15:19:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/05/01 15:19:49 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 19/05/01 15:19:49 INFO input.FileInputFormat: Total input paths to process : 1 1 2 3 4 5 6 7 8 9 10 $
4. If you are still hitting errors such as, NoClassDefFoundError
or ClassNotFoundException
, then you must update the $GPHOME/lib/hadoop/hadoop_env.sh
file on the segment host(s).
Tip: Try using Hadoop's mapred utility (provided by your Hadoop distribution) to append the Hadoop classes to those in $GPHOME/lib/hadoop/hadoop_env.sh file on your segment host(s). Also try to put the following at the end of your $GPHOME/lib/hadoop/hadoop_env.sh files:
if [ -x /usr/bin/mapred ]; then MAPRED_CLASSPATH=$(mapred classpath) CLASSPATH=${CLASSPATH}:${MAPRED_CLASSPATH} fi