Hive job failed with error "org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException"
search cancel

Hive job failed with error "org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException"

book

Article ID: 294959

calendar_today

Updated On:

Products

Services Suite

Issue/Introduction

Symptoms:

A hive job which needs to process 40TB of data set failed with the following error:

org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:544) at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157) ... 8 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /tmp/hive-svcckp/hive_2014-08-12_21-58-23_567_8729876893578629726-1/_task_tmp.-ext-10002/country_code=US/base_div_nbr=1/retail_channel_code=1/visit_date=2011-12-22/load_timestamp=20140810001530/_tmp.002881_0: File does not exist. Holder DFSClient_attempt_1407874976831_0233_m_000178_0_-1560055874_1 does not have any open files. at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2932) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2738) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2646) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:555) .....  
Also, the customer reported that the same type job worked well before.

Environment


Cause

Based on our observation, the problem may be triggered by the following issue:
  • The job needs to open more files which exceeded the maximum number of open files specified in the ulimit. For the customer who reported this issue, the problem was triggered by a fact that half of their Hadoop cluster nodes were dead because of power supply failure. Therefor, the rest of the nodes need to handle more workflow which requires more files to be open. The maximum number of open files on each node was not able to handle the extra workloads.
To confirm the issue, you may run the following command on a few nodemanager hosts while the problem is happening.
lsof |wc -l 

Resolution

When you see this problem, consider increasing the ulimit maximum number of open files for the user runs the job on all nodemanager hosts.

Note: for a secured cluster, the job owner is the user who launched the job. For non-secure, yarn user owns the job.

An example of setting maximum number of open files to 3 million for gpadmin user:
Add the following line in /etc/security/limit.conf
gpadmin nofile 3000000

Then login as gpadmin and run
ulimit -n 3000000

To make the permanent change, add the following line in /etc/security/limit.d/gpadmin. Create this file if not exist.
gpadmin - nofile 3000000

To confirm the change, run following command as gpadmin
ulimit -a

To confirm the change for nologin users, run the following command ( user yarn user as an example)
sudo -u yarn sh -c "ulimit -a && exec su -u yarn"