2014-04-22 13:00:59,898 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing replica BP-1306430579-172.28.9.250-1381221906808:-8712134517697604346 on failed volume /data2/dfs/current 2014-04-22 13:00:59,898 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removed 12308 out of 123402(took 138 millisecs) 2014-04-22 13:00:59,898 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode.handleDiskError: Keep Running: false 2014-04-22 13:01:00,110 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is shutting down: DataNode failed volumes:/data2/dfs/current; 2014-04-22 13:01:00,112 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:svc-platfora (auth:SIMPLE) cause:java.io.IOException: Block blk_2910942244825575033_338680521 is not valid. 2014-04-22 13:01:00,112 INFO org.apache.hadoop.ipc.Server: IPC Server handler 50 on 50020, call org.apache.hadoop.hdfs.protocol.ClientDatanodeProtocol.getBlockLocalPathInfo from 172.28.10.40:55874: error: java.io.IOException: Block blk_2910942244825575033_338680521 is not valid. java.io.IOException: Block blk_2910942244825575033_338680521 is not valid. at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockFile(FsDatasetImpl.java:306) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockFile(FsDatasetImpl.java:287) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockLocalPathInfo(FsDatasetImpl.java:1737) at org.apache.hadoop.hdfs.server.datanode.DataNode.getBlockLocalPathInfo(DataNode.java:1023) at org.apache.hadoop.hdfs.protocolPB.ClientDatanodeProtocolServerSideTranslatorPB.getBlockLocalPathInfo(ClientDatanodeProtocolServerSideTranslatorPB.java:112) at org.apache.hadoop.hdfs.protocol.proto.ClientDatanodeProtocolProtos$ClientDatanodeProtocolService$2.callBlockingMethod(ClientDatanodeProtocolProtos.java:5104) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
By default, ICM client will configure the hdfs-site.xml
parameter "dfs.datanode.failed.volumes.tolerated
" to 0 which will force the datanode daemon to shutdown in the event of a failure accessing one of its defined data volumes. The data volumes are defined by the param "dfs.datanode.data.dir
" and in this case is set to use the following data volumes:
<property> <name>dfs.datanode.failed.volumes.tolerated</name> <value>0</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/data1/dfs,/data2/dfs,/data3/dfs</value> </property>
/data2
data volume became inaccessible and datanode shutdown as a result. Typically, the data volume will be associated with a single disk that is configured with raid 0
so whatever data existed on that volume is lost.dfs.replication
" so chances are there are 2 safe and sound copies somewhere else in the cluster that the application can read from.Replaced any failed disks associated with /data2
volume and recreate the data directory structure as defined by dfs.datanode.data.dir
.
mkdir /data2/dfs
chown hdfs:hadoop /data2/dfs
hadoop-hdfs-datanode
startYou can increase the dfs.datanode.failed.volumes.tolerated
parameter to 1 and start the datanode service. This will prevent the datanode from shutting down when a single data volume fails.
NOTE: It is not recommended to increase this value if you have a datanode with 4 or less volumes or if your hardware is not being monitored for disk drive failures. You may experience dataloss if you have individual volume failures on spread across multiple datanodes and no alerts in place to detect failed data volumes.
<property> <name>dfs.datanode.failed.volumes.tolerated</name> <value>0</value> </property>
hadoop-hdfs-datanode
start<property> <name>dfs.datanode.data.dir</name> <value>/data1/dfs,/data2/dfs,/data3/dfs</value> </property>
Change To:
<property> <name>dfs.datanode.data.dir</name> <value>/data1/dfs,/data3/dfs</value> </property>
hadoop-hdfs-datanode
startdfs
is healthy with "sudo -u hdfs hdfs dfsadmin -report
"