The DataNode is started but it can not send a heartbeat to NameNode, and it becomes dead.
The following output is from the DataNode log.
INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data1/dfs/dn/dfs/data/in_use.lock acquired by nodename [email protected] INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data2/dfs/dn/dfs/data/in_use.lock acquired by nodename [email protected] INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data3/dfs/dn/dfs/data/in_use.lock acquired by nodename [email protected] INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data4/dfs/dn/dfs/data/in_use.lock acquired by nodename [email protected] INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data5/dfs/dn/dfs/data/in_use.lock acquired by nodename [email protected] INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data6/dfs/dn/dfs/data/in_use.lock acquired by nodename [email protected] INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data7/dfs/dn/dfs/data/in_use.lock acquired by nodename [email protected] ERROR org.apache.hadoop.hdfs.server.common.Storage: It appears that another namenode [email protected] has already locked the storage directory INFO org.apache.hadoop.hdfs.server.common.Storage: Cannot lock storage /data8/dfs/dn/dfs/data. The directory is already locked WARN org.apache.hadoop.hdfs.server.common.Storage: Ignoring storage directory /data8/dfs/dn/dfs/data due to an exception java.io.IOException: Cannot lock storage /data8/dfs/dn/dfs/data. The directory is already locked at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:636) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:459) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:152) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:848) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:819) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664) at java.lang.Thread.run(Thread.java:744) INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data9/dfs/dn/dfs/data/in_use.lock acquired by nodename [email protected] INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /data10/dfs/dn/dfs/data/in_use.lock acquired by nodename [email protected]
By default, dfs.datanode.failed.volumes.tolerated is set to 0. As a result, the following error messages are produced:
FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-xxx-xxx.xxx.x.x-xxxxxxx (storage id DS-xxx-192.168.xxx.x-xxxxx-xxxxxxx) service to namenode.VIADEA.INFO/192.168.xxx.2:8020 org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 9, volumes configured: 10, volumes failed: 1, volume failures tolerated: 0 at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.<init>(FsDatasetImpl.java:186) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:34) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:30) at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:857) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:819) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664) at java.lang.Thread.run(Thread.java:744)
1. This issue is some orphan process is holding the lock on /data8/dfs/dn/dfs/data/. However, fuser/data8 and lsof|grep/data8 show nothing.
2. Restarting that problematic DataNode service does not work.
3. Before starting the DataNode service, the in_use.lock lock file does not exist. After starting the DataNode, the in_use.lock is created by the running the DataNode process itself. This means that no other process is trying to lock the data directory.
4. Per /org/apache/hadoop/hdfs/server/common/Storage.java:
public void lock() throws IOException { if (!useLock) { LOG.info("Locking is disabled"); return; } this.lock = tryLock(); if (lock == null) { String msg = "Cannot lock storage " + this.root + ". The directory is already locked."; LOG.info(msg); throw new IOException(msg); } } FileLock tryLock() throws IOException { File lockF = new File(root, STORAGE_FILE_LOCK); lockF.deleteOnExit(); RandomAccessFile file = new RandomAccessFile(lockF, "rws"); FileLock res = null; try { res = file.getChannel().tryLock(); } catch(OverlappingFileLockException oe) { file.close(); return null; } catch(IOException e) { LOG.error("Cannot create lock on " + lockF, e); file.close(); throw e; } return res; }
On that DataNode server, the /data8 was mounted to the same device (/dev/sdf1) as /data6.
Which means, /data8 and /data6 are alias for each other.
This explains why the DataNode process is trying to lock /data8 twice.
/dev/sdf1 1.8T 914G 826G 53% /data6 /dev/sdf1 1.8T 914G 826G 53% /data8
1. Remove /data8 from all *-site.xml configurations on that problematic datanode server.
2. Restart that datanode to skip /data8.
3. SysAdmin to fix the mount point issue.
4. Create needed directories on /data8.
5. Add the /data8 back into all *-site.xml.
6. Restart that datanode.