HDB cluster failed to start returning error “Unable to connect to server”
search cancel

HDB cluster failed to start returning error “Unable to connect to server”

book

Article ID: 295003

calendar_today

Updated On:

Products

Services Suite

Issue/Introduction

Symptoms:
When attempting to start a HDB cluster, it fails producing the following error message, “Unable to connect to server.”
20170210:04:25:27:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Starting gpstart with args:
20170210:04:25:27:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Gathering information and validating the environment...
20170210:04:25:28:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Greenplum Binary Version: 'postgres (HAWQ) 4.2.0 build 1'
20170210:04:25:28:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Greenplum Catalog Version: '201412220'
20170210:04:25:28:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Starting Master instance in admin mode
20170210:04:25:30:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Obtaining Greenplum Master catalog information
20170210:04:25:30:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Obtaining Segment details from master...
20170210:04:25:31:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Setting new master era
20170210:04:25:31:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Master Started...
......
20170210:04:25:48:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Commencing parallel segment instance startup, please wait...
20170210:04:26:11:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-Process results...
20170210:04:26:11:032190 gpstart:rmisdca2m2:gpadmin-[WARNING]:-No segment started for content: 38.
......
20170210:04:26:11:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-DBID:40 FAILED host:'sdw7.gphd.local' datadir:'/data/hawq/segments/3/gpseg38' with reason:'Unable to connect to server'
20170210:04:26:11:032190 gpstart:rmisdca2m2:gpadmin-[INFO]:-----------------------------------------------------

Environment


Cause

The reserved port (40002) for gpseg38 is taken by another process.  


RCA

Running gpstart with verbose mode reveals more details about the error:

20170210:04:40:22:090313 gpsegstart.py_sdw7:gpadmin:sdw7:gpadmin-[INFO]:-Reviewing /data/hawq/segments/3/gpseg38
20170210:04:40:22:090313 gpsegstart.py_sdw7:gpadmin:sdw7:gpadmin-[WARNING]:-Error getting data stdout:"" stderr:"failed to con
nect: Connection refused (errno: 111) Retrying no 1 failed to connect: Connection refused (errno: 111) Retrying no 2 failed to connec
t: Connection refused (errno: 111) Retrying no 3 failed to connect: Connection refused (errno: 111) Retrying no 4 failed to connect:
Connection refused (errno: 111) Retrying no 5 failed to connect: Connection refused (errno: 111) Retrying no 6 failed to connect: Con
nection refused (errno: 111) Retrying no 7 failed to connect: Connection refused (errno: 111) Retrying no 8 failed to connect: Connec
tion refused (errno: 111) Retrying no 9 failed to connect: Connection refused (errno: 111) Retrying no 10 failed to connect: Connecti
on refused (errno: 111) Retrying no 11 failed to connect: Connection refused (errno: 111) Retrying no 12 failed to connect: Connectio
n refused (errno: 111) Retrying no 13 failed to connect: Connection refused (errno: 111) Retrying no 14 failed to connect: Connection
 refused (errno: 111) Retrying no 15 failed to connect: Connection refused (errno: 111) Retrying no 16 failed to connect: Connection
refused (errno: 111) Retrying no 17 failed to connect: Connection refused (errno: 111) Retrying no 18 failed to connect: Connection r
efused (errno: 111) Retrying no 19 failed to connect: Connection refused (errno: 111) "
20170210:04:40:22:090313 gpsegstart.py_sdw7:gpadmin:sdw7:gpadmin-[INFO]:-Marking failed /data/hawq/segments/3/gpseg38, Unable t
o connect to server, 9

Observe that another check found port 40002 is being used by one Java application - HBase RegionServer.

[root@sdw7 ~]# netstat -anp|grep 40002
tcp 0 0 10.178.143.146:40002 10.178.143.137:2181 ESTABLISHED 86153/java
[root@sdw7 ~]# ps -ef|grep 86153
hbase 86153 86140 0 04:23 ? 00:00:15 /usr/jdk64/jdk1.7.0_67/bin/java -Dproc_regionserver -XX:OnOutOfMemoryError=kill -9 %p -Xmx1000m -Dhdp.version=3.0.1.0-1 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hbase/hs_err_pid%p.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/hbase/gc.log-201702100423 -Xmn512m -XX:CMSInitiatingOccupancyFraction=70 -Xms4096m -Xmx4096m -Dhbase.log.dir=/var/log/hbase -Dhbase.log.file=hbase-hbase-regionserver-rmisdcaw7.gphd.local.log -Dhbase.home.dir=/usr/phd/current/hbase-regionserver/bin/.. -Dhbase.id.str=hbase -Dhbase.root.logger=INFO,RFA -Djava.library.path=:/usr/phd/3.0.1.0-1/hadoop/lib/native/Linux-amd64-64:/usr/phd/3.0.1.0-1/hadoop/lib/native -Dhbase.security.logger=INFO,RFAS org.apache.hadoop.hbase.regionserver.HRegionServer start

Resolution

Follow these steps to resolve this issue:
 

1. Stop HBase RegionServer Java application on host sdw7 on Ambari Web UI.
2. Start up the HDB cluster.
3. Start HBase RegionServer Java application on host sdw7.