Install required PHD packages on GPDB cluster:
1. Unpack PHD installation tallball. Take the PHD-2.0.1 release below as an example:
[root@admin phd201]# tar xvfz PHD-2.0.1.0-148.tar.gz ... ...
utility/rpm/bigtop-jsvc-1.0.15_gphd_2_0_1_0-43.x86_64.rpm utility/rpm/bigtop-utils-0.4.0_gphd_2_0_1_0-43.noarch.rpm zookeeper/rpm/zookeeper-3.4.5_gphd_2_0_1_0-43.noarch.rpm hadoop/rpm/hadoop-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm hadoop/rpm/hadoop-yarn-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm hadoop/rpm/hadoop-mapreduce-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm hadoop/rpm/hadoop-hdfs-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm
[root@admin phd201]# cd PHD-2.0.1.0-148 [root@admin PHD-2.0.1.0-148]#rpm -ivh \ utility/rpm/bigtop-jsvc-1.0.15_gphd_2_0_1_0-43.x86_64.rpm \ utility/rpm/bigtop-utils-0.4.0_gphd_2_0_1_0-43.noarch.rpm \ zookeeper/rpm/zookeeper-3.4.5_gphd_2_0_1_0-43.noarch.rpm \ hadoop/rpm/hadoop-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm \ hadoop/rpm/hadoop-yarn-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm \ hadoop/rpm/hadoop-mapreduce-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm \ hadoop/rpm/hadoop-hdfs-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm
1. If there is admin node (where Pivotal Commander Center is running) available on the target PHD cluster, then the required RPM packages can be found under /usr/lib/gphd/rpms on the admin node.
2. Install the RPM packages on all segment hosts of GPDB cluster through either of the following ways.
a) Copy those RPM packages to each segment server and install them with "rpm -ivh" command manually
b) Add a repo file (like gphd.repo) under /etc/yum.repos.d on every segment server with the content below.
[gphd] name=PHD Admin Node Repo baseurl=http://admin.hadoop.local/gphd_yum_repo enabled=1 gpgcheck=0 metadata_expire=0
[gpadmin@admin ~]$echo $JAVA_HOME /usr/java/default [gpadmin@admin ~]$ls -l /usr/java/default lrwxrwxrwx 1 root root 16 Jul 18 2013 /usr/java/default -> /usr/java/latest [gpadmin@admin ~]$ls -l /usr/java/latest lrwxrwxrwx 1 root root 21 Dec 15 2013 /usr/java/latest -> /usr/java/jdk1.7.0_25 [gpadmin@admin ~]$cat .bash_profile | grep JAVA_HOME export JAVA_HOME=/usr/java/default
[gpadmin@sdw1 ~]$hdfs dfs -ls hdfs://hdm2:8020/ Found 7 items drwxr-xr-x - hdfs hadoop 0 2014-06-14 21:20 hdfs://hdm2:8020/apps drwxr-xr-x - postgres gpadmin 0 2014-06-16 04:55 hdfs://hdm2:8020/hawq_data drwxr-xr-x - hdfs hadoop 0 2014-06-14 21:21 hdfs://hdm2:8020/hive drwxr-xr-x - mapred hadoop 0 2014-06-14 21:20 hdfs://hdm2:8020/mapred drwxrwxrwx - hdfs hadoop 0 2014-07-10 22:29 hdfs://hdm2:8020/tmp drwxrwxrwx - hdfs hadoop 0 2014-06-16 18:11 hdfs://hdm2:8020/user drwxr-xr-x - hdfs hadoop 0 2014-06-14 21:21 hdfs://hdm2:8020/yarn
3. Create a readable external table in GPDB pointing it to sample file (test1.txt) in HDFS
NOTE: The below example location field is for a single namenode deployment. In the case of two namenodes (High Availability), the location field would be as follows "gphdfs:///tmp/test1.txt". We do not include a port and replace the hostname with the name configured in core-site.xml for param "fs.defaultFS"
initdb=# create external table test_hdfs (age int, name text) location('gphdfs://hdm2:8020/tmp/test1.txt') format 'text' (delimiter ','); CREATE EXTERNAL TABLE initdb=# \d test_hdfs External table "public.test_hdfs" Column | Type | Modifiers --------+---------+----------- age | integer | name | text | Type: readable Encoding: UTF8 Format type: text Format options: delimiter ',' null '\N' escape '\' External location: gphdfs://hdm2:8020/tmp/test1.txt
initdb=# select * from test_hdfs; age | name -----+------ 15 | west 25 | east (2 rows)
initdb=# select * from myt1; id | name ------+------- 1000 | Jason (1 row) initdb=# create writable external table test_hdfs2 (like myt1) location('gphdfs://hdm2:8020/tmp/test2.txt') format 'text' (delimiter ','); NOTICE: Table doesn't have 'distributed by' clause, defaulting to distribution columns from LIKE table CREATE EXTERNAL TABLE
initdb=# insert into test_hdfs2 select * from myt1; INSERT 0 1