How to Configure HiveServer2 and use a Beeline Client on a Pivotal HD cluster

Products

Services Suite

Issue/Introduction

This article lists the steps required to configure HiveServer2. This article assumes that you have a Pivotal HD cluster running with a HiveServer.

Environment

Cause

This section provides a general overview of what HiveServer2 is and how it works.

HiveServer2 is a server interface that enables remote clients to execute queries against HiveServer1 and retrieve the results. HiveServer2 accesses HiveServer1 data without alteration, provided you are not changing your HiveServer1 release. You do not need to update or transform data in order to begin using HiveServer2. You can use HiveServer2 instead of HiveServer1 as soon as you have enabled support.

HiveServer2 provides support for the following features:

Support for multi-client concurrency.
Ability to authenticate users to prevent untrusted user access.
Enforced authorization around user permissions to data.
Provides better support for open API clients like JDBC and ODBC.

Resolution

Perform the following procedure to configure HiveServer2:

1. Choose the node on which you will install HiveServer2.

Ideally, you can install HiveServer2 on any of the nodes in your Pivotal HD Cluster but a master service node is recommended. If HiveServer1 is already installed, you may use the same node to run HiveServer2, since HiveServer1 will be deprecated and eventually, you will discontinue its use. You must, however, turn off HiveServer1 before you begin using HiveServer2.

2. Install the HiveServer2 package.

Since your existing PHD cluster is setup using Pivotal HD Install & Configuration Manager utility (ICM), /etc/yum.repos.d/gphd.repo will be already available on the cluster node and you can use yum to install HiveServer2.

[gpadmin@hdw3 ~]$ sudo yum install hive-server2
 Loaded plugins: fastestmirror
 Loading mirror speeds from cached hostfile
 Dependencies Resolved
 ..
 ..
 Installing:
 hive-server2 noarch 0 .11.0_gphd_2_1_1_0-83 gphd-admin-localrepo 5.1 k
 ..
 ..
 Installed:
  hive-server2.noarch 0:0.11.0_gphd_2_1_1_0-83
 Complete!
 Note: If yum is not working. Please install hive-server2 package using rpm command. 
Ex: rpm -ivh <path to PHD binary>/hive/rpm/hive-server2-x.x.x_gphd_x_x_x_x-x.noarch.rpm

3. Modify the hive-site.xml file, in the following location: /etc/gphd/hive/conf/hive-site.xml

4. By default, HiveServer1 and Hiveserver2 try to bind at port 10000. Ensure that you have turned off HiveServer1 before starting HiveServer2.

NOTE: There is no need to change the port but if necessary, you can change the port number using the hive.server2.thrift.port parameter.

 <property>
  <name>hive.server2.thrift.port</name>
  <value>10001</value>
  <description>TCP port number to listen on, default 10000</description>
 </property>

5. You must set the following parameters to avoid any corruption.

Hive includes a locking feature that uses Apache Zookeeper for locking. Zookeeper implements a highly reliable distributed coordination. Other than the configuration steps, Zookeeper is invisible to Hive users.

 <property>
  <name>hive.support.concurrency</name>
  <description>Whether Hive supports concurrency or not. A Zookeeper instance must be up and running for the default Hive lock manager to support read-write locks.
 </description>
  <value>true</value>
 </property>
 <property>
  <name>hive.zookeeper.quorum</name>
  <description>Zookeeper quorum used by Hive's Table Lock Manager</description>
  <value>hdw1.hadoop.local,hdw2.hadoop.local,hdw3.hadoop.local</value>
 </property>

With the above parameters configured, Hive automatically starts acquiring locks for certain queries. You can review all current lock queries using the SHOW LOCKS <TABLE_NAME>; command.

There are 2 types of locks supported by Hive, and they are enabled automatically when concurrency is enabled:

Shared Lock: A shared lock is acquired when a table is read. Multiple, concurrent shared locks are allowed.

Exclusive Locks: An exclusive lock is required for all operations that modify the table is some way. They not only freeze other table mutating operations, they also prevent queries by other processes.

When the table is partitioned, acquiring an exclusive lock on a partition causes shared lock to be acquired on the table itself to prevent incompatible concurrent changes from occurring, such as attempting to drop the table while a partition is being modified.

To prevent hive server from opening too many connections with namenode, we need to set ipc.client.connection.maxidletime to the default value of 10 seconds. By default, PHD will set this parameter to 1 hour in the core-site.xml which can cause out of memory errors on HiveServer2.

<property>
  <name>ipc.client.connection.maxidletime</name>
  <value>10000</value>
</property>

1. Setup is complete, start HiveServer2

[gpadmin@hdw3 conf]$ sudo service hive-server2 start
starting hive-server2, logging to /var/log/gphd/hive/hive-server2.log
[ ]

2. Connect to HiveServer2 using the beeline utility.

Beeline is a JDBC client based on the SQLLine CLI, developed specifically to interact with HiveServer2. The JDBC driver in use communicates with HiveServer2 using HiveServer2s Thrift APIs. Get adapted to beeline, as hive development has shifted from HiveServer to HiveServer2

NOTE: In this example, hdw3 is the server where hive-server2 is installed & 10001 is the port defined for hive-server2, so change it appropriately to suit your installation.

[gpadmin@hdm1 init.d]$ beeline
 Beeline version 0.11.0-gphd-2.1.1.0 by Apache Hive
 beeline> !connect jdbc:hive2://hdw3:10000 username password org.apache.hive.jdbc.HiveDriver
 Connecting to jdbc:hive2://hdw3:10001
 Connected to: Hive (version 0.11.0-gphd-2.1.1.0)
 Driver: Hive (version 0.11.0-gphd-2.1.1.0)
 Transaction isolation: TRANSACTION_REPEATABLE_READ
 0: jdbc:hive2://hdw3:10000> show tables
 . . . . . . . . . . . . . > ;
 +------------+
 |tab_name|
 +------------+
 | passwords|
 +------------+
 1 row selected (1.487 seconds)
 0: jdbc:hive2://hdw3:10000>

3. Connect to HiveServer2 on a secured cluster

[gpadmin@hdm1 init.d]$ beeline
2: jdbc:hive2://hdm1:10001> !connect jdbc:hive2://hdm1:10001/default;principal=hive/hdm1.gphd.local@GPHD.LOCAL
Connecting to jdbc:hive2://hdm1:10001/default;principal=hive/hdm1.gphd.local@GPHD.LOCAL
Enter username for jdbc:hive2://hdm1:10001/default;principal=hive/hdm1.gphd.local@GPHD.LOCAL: gpadmin
Enter password for jdbc:hive2://hdm1:10001/default;principal=hive/hdm1.gphd.local@GPHD.LOCAL: *******
Connected to: Hive (version 0.12.0-gphd-3.0.1.0)
Driver: Hive (version 0.12.0-gphd-3.0.1.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
3: jdbc:hive2://hdm1:10001/default> show tables;
+------------+
| tab_name |
+------------+
| book |
| book1 |
| book10 |

NOTE: If you don't have privileges or the cluster has been secured, you will not be able to perform any read / write operations and may receive an error similar to the following. In that case, you will need to grant appropriate privileges.

3: jdbc:hive2://hdm1:10001/default> select * from book;
Error: Error while processing statement: Authorization failed:No privilege 'Select' found for inputs { database:default, table:book, columnName:word}. Use show grant to get more details. (state=,code=403)
0: jdbc:hive2://hdm1.gphd.local:10001/default> grant select on table book to user gpadmin ;
0: jdbc:hive2://hdm1.gphd.local:10001> select * from book;
 +-------+
 | word |
 +-------+
+-------+

Additional Information

Miscellaneous:

beeline -e "show tables" : List out the tables
beeline -f file_containing_sql : Execute a sql contained in a file
beeline -help | -h : Print out options for beeline utility
!quit : Quit beeline terminal
set env:TERM; env:TERM=xterm; : Set the value of a variable in beeline client
beeline -u jdbc:hive2://localhost:10001 -n gpadmin -p changeme : Syntax to connect to HiveServer2 using beeline
beeline wiki: https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients