How to access HDFS data via GPDB external table with gphdfs protocol
search cancel

How to access HDFS data via GPDB external table with gphdfs protocol

book

Article ID: 294707

calendar_today

Updated On:

Products

Services Suite

Environment


Resolution

Prerequisites

  • Download the JDK (1.7 is recommended) package and install it on all servers of the GPDB cluster
  • Download the Pivotal HD installation package (same version as that of PHD cluster to be access) and put it on GPDB master host
  • The link to download HDP installation package can be found in the release notes of each HDP release. Access here to choose one HDP release then read the release note

Install required PHD packages on GPDB cluster:


Option 1

1. Unpack PHD installation tallball. Take the PHD-2.0.1 release below as an example:

[root@admin phd201]# tar xvfz PHD-2.0.1.0-148.tar.gz

... ...
And find out the following RPM packages in the unpacked directory:
utility/rpm/bigtop-jsvc-1.0.15_gphd_2_0_1_0-43.x86_64.rpm
utility/rpm/bigtop-utils-0.4.0_gphd_2_0_1_0-43.noarch.rpm
zookeeper/rpm/zookeeper-3.4.5_gphd_2_0_1_0-43.noarch.rpm
hadoop/rpm/hadoop-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm
hadoop/rpm/hadoop-yarn-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm
hadoop/rpm/hadoop-mapreduce-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm
hadoop/rpm/hadoop-hdfs-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm
2. Install the RPM packages (follow the order listed above to avoid the dependency check error). This needs to be done on all segment servers of GPDB cluster.
[root@admin phd201]# cd PHD-2.0.1.0-148

[root@admin PHD-2.0.1.0-148]#rpm -ivh \
utility/rpm/bigtop-jsvc-1.0.15_gphd_2_0_1_0-43.x86_64.rpm \
utility/rpm/bigtop-utils-0.4.0_gphd_2_0_1_0-43.noarch.rpm \
zookeeper/rpm/zookeeper-3.4.5_gphd_2_0_1_0-43.noarch.rpm \
hadoop/rpm/hadoop-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm \
hadoop/rpm/hadoop-yarn-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm \
hadoop/rpm/hadoop-mapreduce-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm \
hadoop/rpm/hadoop-hdfs-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm
3. Configure the Hadoop configuration files and ensure that HDFS works.


Option 2

1. If there is admin node (where Pivotal Commander Center is running) available on the target PHD cluster, then the required RPM packages can be found under /usr/lib/gphd/rpms on the admin node.

2. Install the RPM packages on all segment hosts of GPDB cluster through either of the following ways.

  a) Copy those RPM packages to each segment server and install them with "rpm -ivh" command manually

  b) Add a repo file (like gphd.repo) under /etc/yum.repos.d on every segment server with the content below.

  • Note: "admin.hadoop.local" is the hostname of the admin node on your site which needs to be modified accordingly.
[gphd]
name=PHD Admin Node Repo
baseurl=http://admin.hadoop.local/gphd_yum_repo
enabled=1
gpgcheck=0
metadata_expire=0
3. Configure the Hadoop configuration files & ensure that HDFS works.

Then run "yum install <rpm package name>" to complete installation.

GPDB Configuration

1. Set environment variable JAVA_HOME for gpadmin user correctly on all segment servers, as illustrated below. Better to set it in .bashrc or .bash_profile

[gpadmin@admin ~]$echo $JAVA_HOME

/usr/java/default
[gpadmin@admin ~]$ls -l /usr/java/default
lrwxrwxrwx 1 root root 16 Jul 18 2013 /usr/java/default -> /usr/java/latest
[gpadmin@admin ~]$ls -l /usr/java/latest
lrwxrwxrwx 1 root root 21 Dec 15 2013 /usr/java/latest -> /usr/java/jdk1.7.0_25

[gpadmin@admin ~]$cat .bash_profile | grep JAVA_HOME
export JAVA_HOME=/usr/java/default
2. Set the parameters for GPDB:
[gpadmin@admin ~]$ gpconfig -c gp_hadoop_home -v "'/usr/lib/gphd'"
[gpadmin@admin ~]$ gpconfig -c gp_hadoop_target_version -v "'gphd-2.0'"

Run "gpstop -u" to make the changes take effect.
 

Test

1. Check that HDFS is accessible from any of the segment servers:
[gpadmin@sdw1 ~]$hdfs dfs -ls hdfs://hdm2:8020/

Found 7 items
drwxr-xr-x - hdfs hadoop 0 2014-06-14 21:20 hdfs://hdm2:8020/apps
drwxr-xr-x - postgres gpadmin 0 2014-06-16 04:55 hdfs://hdm2:8020/hawq_data
drwxr-xr-x - hdfs hadoop 0 2014-06-14 21:21 hdfs://hdm2:8020/hive
drwxr-xr-x - mapred hadoop 0 2014-06-14 21:20 hdfs://hdm2:8020/mapred
drwxrwxrwx - hdfs hadoop 0 2014-07-10 22:29 hdfs://hdm2:8020/tmp
drwxrwxrwx - hdfs hadoop 0 2014-06-16 18:11 hdfs://hdm2:8020/user
drwxr-xr-x - hdfs hadoop 0 2014-06-14 21:21 hdfs://hdm2:8020/yarn
2. Create a temporary text file and put it to HDFS:
[gpadmin@admin ~]$cat test1.txt
15,west
25,east
[gpadmin@admin ~] hdfs dfs -put test1.txt hdfs://hdm2:8020/tmp/
[gpadmin@admin ~]$hdfs dfs -ls hdfs://hdm2:8020/tmp
Found 1 items
-rw-r--r-- 3 gpadmin hadoop 16 2014-07-06 22:38 hdfs://hdm2:8020/tmp/test1.txt

 

3. Create a readable external table in GPDB pointing it to sample file (test1.txt) in HDFS

NOTE: The below example location field is for a single namenode deployment. In the case of two namenodes (High Availability), the location field would be as follows "gphdfs:///tmp/test1.txt". We do not include a port and replace the hostname with the name configured in core-site.xml for param "fs.defaultFS"

initdb=# create external table test_hdfs (age int, name text) location('gphdfs://hdm2:8020/tmp/test1.txt') format 'text' (delimiter ',');

CREATE EXTERNAL TABLE

initdb=# \d test_hdfs
External table "public.test_hdfs"
Column | Type | Modifiers
--------+---------+-----------
age | integer |
name | text |
Type: readable
Encoding: UTF8
Format type: text
Format options: delimiter ',' null '\N' escape '\'
External location: gphdfs://hdm2:8020/tmp/test1.txt
4. Try to query data from the external table:
initdb=# select * from test_hdfs;
age | name
-----+------
15 | west
25 | east
(2 rows)
5. Create a writable external table in GPDB pointing to a file in HDFS:
initdb=# select * from myt1;

id | name
------+-------
1000 | Jason
(1 row)

initdb=# create writable external table test_hdfs2 (like myt1) location('gphdfs://hdm2:8020/tmp/test2.txt') format 'text' (delimiter ',');

NOTICE: Table doesn't have 'distributed by' clause, defaulting to distribution columns from LIKE table
CREATE EXTERNAL TABLE
6. Insert data to the writable external table:
initdb=# insert into test_hdfs2 select * from myt1;

INSERT 0 1
7. Check existence and content of file in HDFS:
[gpadmin@admin ~]$hdfs dfs -ls hdfs://hdm2:8020/tmp/test2.txt
Found 1 items
-rw-r--r-- 3 gpadmin hadoop 11 2014-07-13 23:37 hdfs://hdm2:8020/tmp/test2.txt/0_1402800555-0000000098
[gpadmin@admin ~]$hdfs dfs -cat hdfs://hdm2:8020/tmp/test2.txt/0_1402800555-0000000098
1000,Jason