The DataNode is started but it can not send a heartbeat to NameNode, and it becomes dead.
The following output is from the DataNode log.
INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data1/dfs/dn/dfs/data/in_use.lock acquired by nodename [email protected] INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data2/dfs/dn/dfs/data/in_use.lock acquired by nodename [email protected] INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data3/dfs/dn/dfs/data/in_use.lock acquired by nodename [email protected] INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data4/dfs/dn/dfs/data/in_use.lock acquired by nodename [email protected] INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data5/dfs/dn/dfs/data/in_use.lock acquired by nodename [email protected] INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data6/dfs/dn/dfs/data/in_use.lock acquired by nodename [email protected] INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data7/dfs/dn/dfs/data/in_use.lock acquired by nodename [email protected] ERROR org.apache.hadoop.hdfs.server.common.Storage: It appears that another namenode [email protected] has already locked the storage directory INFO org.apache.hadoop.hdfs.server.common.Storage: Cannot lock storage /data8/dfs/dn/dfs/data. The directory is already locked WARN org.apache.hadoop.hdfs.server.common.Storage: Ignoring storage directory /data8/dfs/dn/dfs/data due to an exception java.io.IOException: Cannot lock storage /data8/dfs/dn/dfs/data. The directory is already locked at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:636) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:459) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:152) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:848) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:819) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664) at java.lang.Thread.run(Thread.java:744) INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data9/dfs/dn/dfs/data/in_use.lock acquired by nodename [email protected] INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data10/dfs/dn/dfs/data/in_use.lock acquired by nodename [email protected]
By default, dfs.datanode.failed.volumes.tolerated is set to 0. As a result, the following error messages are produced:
FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-xxx-xxx.xxx.x.x-xxxxxxx (storage id DS-xxx-192.168.xxx.x-xxxxx-xxxxxxx) service to namenode.VIADEA.INFO/192.168.xxx.2:8020
org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 9, volumes configured: 10, volumes failed: 1, volume failures tolerated: 0
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.<init>(FsDatasetImpl.java:186)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:34)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:30)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:857)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:819)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
at java.lang.Thread.run(Thread.java:744)
1. This issue is some orphan process is holding the lock on /data8/dfs/dn/dfs/data/. However, fuser/data8 and lsof|grep/data8 show nothing.
2. Restarting that problematic DataNode service does not work.
3. Before starting the DataNode service, the in_use.lock lock file does not exist. After starting the DataNode, the in_use.lock is created by the running the DataNode process itself. This means that no other process is trying to lock the data directory.
4. Per /org/apache/hadoop/hdfs/server/common/Storage.java:
public void lock() throws IOException {
if (!useLock) {
LOG.info("Locking is disabled");
return;
}
this.lock = tryLock();
if (lock == null) {
String msg = "Cannot lock storage " + this.root
+ ". The directory is already locked.";
LOG.info(msg);
throw new IOException(msg);
}
}
FileLock tryLock() throws IOException {
File lockF = new File(root, STORAGE_FILE_LOCK);
lockF.deleteOnExit();
RandomAccessFile file = new RandomAccessFile(lockF, "rws");
FileLock res = null;
try {
res = file.getChannel().tryLock();
} catch(OverlappingFileLockException oe) {
file.close();
return null;
} catch(IOException e) {
LOG.error("Cannot create lock on " + lockF, e);
file.close();
throw e;
}
return res;
}
On that DataNode server, the /data8 was mounted to the same device (/dev/sdf1) as /data6.
Which means, /data8 and /data6 are alias for each other.
This explains why the DataNode process is trying to lock /data8 twice.
/dev/sdf1 1.8T 914G 826G 53% /data6 /dev/sdf1 1.8T 914G 826G 53% /data8
1. Remove /data8 from all *-site.xml configurations on that problematic datanode server.
2. Restart that datanode to skip /data8.
3. SysAdmin to fix the mount point issue.
4. Create needed directories on /data8.
5. Add the /data8 back into all *-site.xml.
6. Restart that datanode.