Removing files in Hadoop Distributed File System does not free up space because of snapshots

Products

Services Suite

Issue/Introduction

Symptoms:
Capacity related issues may be reported because files have been removed from Hadoop Distributed File System (HDFS) and Trash, however, the value of "DFS Used" has not decreased.

This may happen if the deleted files are within a directory that has a snapshot. To be able to revert back to snapshots, the files are kept in HDFS but are hidden. As a result, they will still take up space until the snapshot is removed.

Note: There is a difference between:

Total capacity used in HDFS files system as per "hdfs dfs -du -h /".
DFS used as per "hdfs dfsadmin -report".

Environment

Cause

The capacity difference and lack of space is caused by snapshots. If a file is removed and there is a snapshot on one of the parent directories, the file is no longer needed in the current version of the directory. As a result, it no longer shows up in the file system commands such as "hdfs dfs -du" or "hdfs dfs -ls". However, the file will still be taking up space in HDFS because the files need to be kept in case reverting back to the snapshot is needed.

Resolution

Permanent solution

Even if capacity is high, snapshots should be removed in order to free up space.

Example and further details

This system starts off with 980 MB of DFS Used:

-bash-4.1$ hdfs dfsadmin -report
Configured Capacity: 145835704320 (135.82 GB)
Present Capacity: 113644720128 (105.84 GB)
DFS Remaining: 112616583168 (104.88 GB)
DFS Used: 1028136960 (980.51 MB)
DFS Used%: 0.90%
Under replicated blocks: 4244
Blocks with corrupt replicas: 0
Missing blocks: 0
Datanodes available: 1 (1 total, 0 dead)
Live datanodes:
Name: 127.0.0.1:50010 (localhost)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 145835704320 (135.82 GB)
DFS Used: 1028136960 (980.51 MB)
Non DFS Used: 32190984192 (29.98 GB)
DFS Remaining: 112616583168 (104.88 GB)
DFS Used%: 0.70%
DFS Remaining%: 77.22%
Last contact: Wed Feb 10 00:28:34 CST 2016

The space used is distributed in this manner in HDFS:

-bash-4.1$ hdfs dfs -du -h /
 4.5 K /apps
536.5 M /hawq_data
0 /hive
0 /mapred
286.4 M /retail_demo
0 /tmp
108.2 M /user
7.1 M /yarn
-bash-4.1$

Observe that there is a snapshot on the /hawq_data/ directory:

-bash-4.1$ hdfs dfs -ls /hawq_data/.snapshot/
Found 1 items
drwxr-xr-x - gpadmin hadoop 0 2016-02-10 00:07 /hawq_data/.snapshot/s20160210-000709.684
-bash-4.1$

The /hawq_data/ directory is deleted and then removed from the Trash:

-bash-4.1$ hdfs dfs -rm -R /hawq_data/gpseg0/
16/02/10 00:31:25 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 86400000 minutes, Emptier interval = 0 minutes.
Moved: 'hdfs://pivhdsne.localdomain:8020/hawq_data/gpseg0' to trash at: hdfs://pivhdsne.localdomain:8020/user/hdfs/.Trash/Current
-bash-4.1$ hdfs dfs -rm -R /user/hdfs/.Trash/Current
16/02/10 00:31:55 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 86400000 minutes, Emptier interval = 0 minutes.
Deleted /user/hdfs/.Trash/Current

The space used by /hawq_data/ now shows up as 0 MB:

hdfs://pivhdsne.localdomain:8020 135.8 G 980.5 M 104.9 G 1%
-bash-4.1$ hdfs dfs -du -h /
4.5 K /apps
0 /hawq_data
0 /hive
0 /mapred
286.4 M /retail_demo
0 /tmp
108.2 M /user
7.1 M /yarn

However, the space used in DFS is still 980 MB:

-bash-4.1$ hdfs dfsadmin -report
Configured Capacity: 145835704320 (135.82 GB)
Present Capacity: 113644474368 (105.84 GB)
DFS Remaining: 112616337408 (104.88 GB)
DFS Used: 1028136960 (980.51 MB)
DFS Used%: 0.90%
Under replicated blocks: 4244
Blocks with corrupt replicas: 0
Missing blocks: 0
Datanodes available: 1 (1 total, 0 dead)
Live datanodes:
Name: 127.0.0.1:50010 (localhost)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 145835704320 (135.82 GB)
DFS Used: 1028136960 (980.51 MB)
Non DFS Used: 32191229952 (29.98 GB)
DFS Remaining: 112616337408 (104.88 GB)
DFS Used%: 0.70%
DFS Remaining%: 77.22%
Last contact: Wed Feb 10 00:32:22 CST 2016

Only after removing the snapshot, does the "DFS Used" go down and the space is available once again:

-bash-4.1$ hdfs dfs -deleteSnapshot /hawq_data/ s20160210-000709.684
-bash-4.1$ hdfs dfs -ls /hawq_data/.snapshot/

-bash-4.1$ hdfs dfsadmin -report
Configured Capacity: 145835704320 (135.82 GB)
Present Capacity: 114780148015 (106.90 GB)
DFS Remaining: 114319011840 (106.47 GB)
DFS Used: 461136175 (439.77 MB)
DFS Used%: 0.40%
Under replicated blocks: 4229
Blocks with corrupt replicas: 0
Missing blocks: 0
Datanodes available: 1 (1 total, 0 dead)
Live datanodes:
Name: 127.0.0.1:50010 (localhost)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 145835704320 (135.82 GB)
DFS Used: 461136175 (439.77 MB)
Non DFS Used: 31055556305 (28.92 GB)
DFS Remaining: 114319011840 (106.47 GB)
DFS Used%: 0.32%
DFS Remaining%: 78.39%
Last contact: Wed Feb 10 00:38:04 CST 2016
-bash-4.1$