Hbase region servers fail to come up in crash recovery with immutable configuration error

Products

Services Suite

Issue/Introduction

Symptoms:

Hbase RegionServers fail to come up in crash recover with an immutable configuration error.

2015-07-02 06:05:02,273 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=-ROOT-,,0.70236052, starting to roll back the global memstore size.

java.io.IOException: Cannot get log reader

        at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:721)

        at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:3179)

        at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:3128)

        at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:631)

        at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:547)

        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4399)

        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4347)

        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:330)

        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:101)

        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

        at java.lang.Thread.run(Thread.java:722)

Caused by: java.lang.UnsupportedOperationException: Immutable Configuration

        at org.apache.hadoop.hbase.regionserver.CompoundConfiguration.setClass(CompoundConfiguration.java:445)

        at org.apache.hadoop.ipc.RPC.setProtocolEngine(RPC.java:193)

        at org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:249)

        at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:168)

        at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:129)

        at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:418)

        at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:385)

        at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:123)

        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2277)

        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:314)

        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194)

        at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1747)

        at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1773)

        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:55)

        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)

        at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:715)

        ... 12 more

Environment

Cause

The Immutable configuration error is related to a known Hbase bug HBASE-8372 where Hbase uses a CompoundConfiguration class that overrides all set functions in the HadoopConfiguration class.

Resolution

Workaround

The Hbase RegionServer recover passes the CompoundConfiguration conf object to a Hadoop client. Then the org.apache.hadoop.ipc.RPC.setProtocolEngine attempts to modify the conf data structor using setClass which is overridden. As a result an immutable exception error is produced.

 191   public static void setProtocolEngine(Configuration conf,

 192                                 Class<?> protocol, Class<?> engine) {

 193     conf.setClass(ENGINE_PROP+"."+protocol.getName(), engine, RpcEngine.class);

 194   }

The above condition is only true when HDFS HA is not enabled. In the HA case the CompoundConfiguration object gets copied into a new Configuration object class. This results in a mutable configuration object which gets passed down to the HDFS client org.apache.hadoop.ipc.RPC.setProtocolEngine.

132       // HA case

133       FailoverProxyProvider failoverProxyProvider = NameNodeProxies

134           .createFailoverProxyProvider(conf, failoverProxyProviderClass, xface,

135               nameNodeUri);

136       Conf config = new Conf(conf);

With that in mind a proven workaround in this case is to enable HDFS HA for all the Hbase RegionServers and Hbase Master services only. This will trick Hbase into thinking HA is enabled, even though there is a single NameNode in the environment. This allows Hbase RegionServers to successfully get out of recovery mode. Upon successful recovery, the HA related configuration settings can be removed by following the steps below:

1. Take a backup of the /etc/gphd configuration directory on all nodes.

2. Edit the /etc/gphd/hadoop/conf/hdfs-site.xml.

<property> 

  <name>dfs.nameservices</name> 

  <value>${nameservices}</value> 

</property> 



<property> 

  <name>dfs.ha.namenodes.${nameservices}</name> 

  <value>${namenode1id},${namenode2id}</value> 

</property> 



<property> 

  <name>dfs.namenode.rpc-address.${nameservices}.${namenode1id}</name> 

  <value>${namenode}:8020</value> 

</property> 



<property> 

  <name>dfs.namenode.rpc-address.${nameservices}.${namenode2id}</name> 

  <value>${standbynamenode}:8020</value> 

</property> 



<property> 

  <name>dfs.namenode.http-address.${nameservices}.${namenode1id}</name> 

  <value>${namenode}:50070</value> 

</property> 



<property> 

  <name>dfs.namenode.http-address.${nameservices}.${namenode2id}</name> 

  <value>${standbynamenode}:50070</value> 

</property> 



<property> 

  <name>dfs.namenode.shared.edits.dir</name> 

  <value>qjournal://${journalnode}/${nameservices}</value> 

</property> 



<property> 

  <name>dfs.client.failover.proxy.provider.${nameservices}</name> 

  <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> 

</property> 



<property> 

  <name>dfs.ha.fencing.methods</name> 

  <value>

  sshfence

  shell(/bin/true)

  </value> 

</property> 



<property>

  <name>dfs.ha.fencing.ssh.private-key-files</name>

  <value>/home/hdfs/.ssh/id_rsa</value>

</property>



<property> 

  <name>dfs.journalnode.edits.dir</name> 

  <value>${journalpath}</value> 

</property> 



<!-- Namenode Auto HA related properties --> 

<property>

   <name>dfs.ha.automatic-failover.enabled</name>

   <value>true</value>

 </property>

<!-- END Namenode Auto HA related properties -->

3. Edit the /etc/gphd/hadoop/conf/core-site.xml file:

<property>

  <name>fs.defaultFS</name>

  <value>hdfs://${nameservices}</value> 

  <description>The name of the default file system.  A URI whose

  scheme and authority determine the FileSystem implementation.  The

  uri's scheme determines the config property (fs.SCHEME.impl) naming

  the FileSystem implementation class.  The uri's authority is used to

  determine the host, port, etc. for a filesystem.</description>

</property>





<property>

   <name>ha.zookeeper.quorum</name>

   <value>${zookeeper-server}:${zookeeper.client.port}</value>

 </property>

4. Edit the /etc/gphd/hadoop/conf/yarn-site.xml file:

<property>

    <name>mapreduce.job.hdfs-servers</name>

    <value>hdfs://${nameservices}</value>

</property>

5. Edit the /etc/gphd/hbase/conf/hbase-site.xml file:

<property>

    <name>hbase.rootdir</name>

    <value>hdfs://${nameservices}/apps/hbase/data</value>

    <description>The directory shared by region servers and into

    which HBase persists.  The URL should be 'fully-qualified'

    to include the filesystem scheme.  For example, to specify the

    HDFS directory '/hbase' where the HDFS instance's namenode is

    running at namenode.example.org on port 9000, set this value to:

    hdfs://namenode.example.org:9000/hbase.  By default HBase writes

    into /tmp.  Change this configuration else all data will be lost

    on machine restart.

    </description>

</property>

6. Distribute the configuration changes to all Hbase Master and RegionServer nodes.

7. Restart Hbase services.

8. Restore the original configuration.

9. Restart Hbase services and confirm issue is resolved.

Fix

Upgrade to PHD 3.0 which includes HBASE version 0.98.4 or permanently enable HDFS HA to stop this issue from occurring in the future.