Pivotal HDFS Data Layout

Products

Services Suite

Issue/Introduction

This article reviews the Pivotal HDFS Data Layout.

Resolution

NameNode Data Files

Fig 1. Contents of FSIMAGE binary

The NameNode data files store metadata about the Hadoop cluster. These files are located in the directory pointed by the "dfs.name.dir" parameter (configuration file "hdfs-site.xml"). The files that store the metadata information are "fsimage", "version" and "edits".

The NameNode "fsimage" file contains the image of the NameNode file catalog. This catalog contains the metadata about the files stored in the Hadoop HDFS file system.

Unless the NameNode is starting up the "fsimage" file gets never updated. The "fsimage" file is a static file even when files, directories, permissions, etc are changed in HDFS.

Now you should be thinking: what if the NameNode process suddenly quits!?

This is where the "edits" file comes in. Whenever there is a change to HDFS the NameNode updates the in memory object with the new information and updates the "edits" file otherwise known as the HDFS Journal Log.

There is only two ways to merge the HDFS changes from the "edits" file to the static "fsimage" binary file. The first is during the NameNode startup. The NameNode will read in both the "fsimage" file and the "edits" file into memory. Then it will perform a merge operation syncing the changes. Once the merge completes the NameNode overwrites the exiting "fsimage" file with the updated one. The other method is via the periodic checkpoint mechanism that executes once every hour managed by the Secondary NameNode service.

The "version" file represents the current version of HDFS since the last NameNode format. If you ever see a datanode fail to register with HDFS check and make sure the HDFS version on the data node matches the version on the NameNode.

DataNode Data Files

Fig 2. HDFS layout on Datanode

A typical DataNode will have individual drives that essentially make up JBOD configuration. Linux will create a device for each disk and a XFS, etx3, etc. file system type will be created. When writing files to HDFS the NameNode decides which blocks go into which DataNode. The blocks are stored as files within a directory structure on top of the exiting file system layer (XFS for example). With root access to the DataNode you can list the files and observe how they are created. Listing the directory specified by the "dfs.data.dir" parameter (configuration file "hdfs-site.xml" ) would look something like the following:

hadoop@hdw1:/data/dfs/data/current$ ls -l
total 24
-rw-rw-r-- 1 hadoop hadoop  64M Nov 17 21:06 blk_-7586700455251598184
-rw-rw-r-- 1 hadoop hadoop 513K Nov 17 21:06 blk_-7586700455251598184_1015.meta
-rw-rw-r-- 1 hadoop hadoop  51M Nov 17 21:06 blk_-7878794766348260704
-rw-rw-r-- 1 hadoop hadoop 408K Nov 17 21:06 blk_-7878794766348260704_1015.meta

In this example the HDFS block size is set to 64MB (see the size of the files above). You can see that each datablock is stored in a file with name "blk_<64bit block ID>". The metadata block is stored in file with name "blk_<64bit block ID>_<generation_stamp>.meta". The additional generation stamp is used to tell if the datablock was created before or after the last NameNode "fsimage" checkpoint. If before the last checkpoint then this number would be less than what is stored in the "fsimage" binary file. The meta file contains a checksum of the datablock. If the checksum does not match during a read operation then an error is returned back to the HDFS client.

Dumping the contents of fsimage and edits file

[gpadmin@hdm1 current]$ pwd
/data/nn/dfs/name/current
[gpadmin@hdm1 current]$ ls -l
-rw-r--r-- 1 hdfs hadoop 1048576 Feb 11 14:38 edits_inprogress_0000000000000002473
-rw-r--r-- 1 hdfs hadoop   24803 Feb 11 14:38 fsimage_0000000000000002472
-rw-r--r-- 1 hdfs hadoop      62 Feb 11 14:38 fsimage_0000000000000002472.md5

[gpadmin@hdm1 current]$ hdfs oiv -i fsimage_0000000000000002472 -o /tmp/fsimage.out
[gpadmin@hdm1 current]$ hdfs oev -i edits_inprogress_0000000000000002473 -o /tmp/edits.out