Pivotal HD: replication does not succeed to a DataNode which has a block with an old generation timestamp

Products

Services Suite

Issue/Introduction

Symptoms:

If the number of DataNodes in your cluster is less than or equal to the replication factor for a file on HDFS, a corrupt/old replica of a block in that file is not automatically fixed and requires manual intervention.

Before this manual fix is implemented, the block is marked as under-replicated and you get messages similar to this in the NameNode logs (normally the entry is repeated every few seconds until the problem is resolved):

2014-04-30 10:57:24,858 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 3

In the example below, we have 3 DataNodes and replication factor 3 (the default).

block blk_5582039430147844965 has a replica with a generation timestamp (6872) on host hdw1.dca, which is older than the current/latest replica (6880) on the other two hosts hdw2.dca and hdw3.dca. This happened because the DataNode service was down on hdw1.dca while the block was updated.

[gpadmin@hdm1 ~]$ gpssh -f ~/hostfile_seg "ls -l /data/*/dfs/data/current/BP-2083006907-192.165.10.1-1392999006690/current/finalized/subdir*/blk_5582039430147844965*"
[hdw3.dca] -rw-r--r-- 1 hdfs hadoop 21253104 Apr 30 15:24 /data/1/dfs/data/current/BP-2083006907-192.165.10.1-1392999006690/current/finalized/subdir63/blk_5582039430147844965
[hdw3.dca] -rw-r--r-- 1 hdfs hadoopåÊåÊ 166047 Apr 30 15:24 /data/1/dfs/data/current/BP-2083006907-192.165.10.1-1392999006690/current/finalized/subdir63/blk_5582039430147844965_6880.meta
[hdw2.dca] -rw-r--r-- 1 hdfs hadoop 21253104 Apr 30 15:24 /data/1/dfs/data/current/BP-2083006907-192.165.10.1-1392999006690/current/finalized/subdir29/blk_5582039430147844965
[hdw2.dca] -rw-r--r-- 1 hdfs hadoopåÊåÊ 166047 Apr 30 15:24 /data/1/dfs/data/current/BP-2083006907-192.165.10.1-1392999006690/current/finalized/subdir29/blk_5582039430147844965_6880.meta
[hdw1.dca] -rw-r--r-- 1 hdfs hadoop 21253072 Apr 30 13:59 /data/3/dfs/data/current/BP-2083006907-192.165.10.1-1392999006690/current/finalized/subdir41/blk_5582039430147844965
[hdw1.dca] -rw-r--r-- 1 hdfs hadoopåÊåÊ 166047 Apr 30 13:59 /data/3/dfs/data/current/BP-2083006907-192.165.10.1-1392999006690/current/finalized/subdir41/blk_5582039430147844965_6872.meta

This event is found in the NN logs:

2014-04-30 10:41:20,941 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(newblock=BP-2083006907-192.165.10.1-1392999006690:blk_5582039430147844965_6877, file=/hawq_data/gpseg1/16385/16522/28311.1, newgenerationstamp=6880, newlength=21253104, newtargets=[192.165.10.3:50010, 192.165.10.4:50010]) successful
[...]
2014-04-30 10:57:21,111 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_5582039430147844965 added as corrupt on 192.165.10.2:50010 by hdw1.dca/192.165.10.2 because block is COMPLETE and reported genstamp 6872 does not match genstamp in block map 6880

Environment

Cause

This is a known issue https://issues.apache.org/jira/browse/HDFS-3493 which has no fix yet.

Resolution

The corrupt block and its related .meta file need to be manually deleted or moved to another location. The block will be eventually replicated back to that DataNode with a good copy. To accelerate the recovery, the DataNode service can be restarted.

You will then see something like this in the NN logs:

2014-04-30 22:29:49,005 INFO BlockStateChange: BLOCK* ask 192.165.10.3:50010 to replicate blk_5582039430147844965_6880 to datanode(s) 192.165.10.2:50010
2014-04-30 22:29:49,788 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 192.165.10.2:50010 is added to blk_5582039430147844965_6880 size 21253104