DFSIO is part of the Hadoop distribution and can be found in "hadoop-mapreduce-client-jobclient-*-tests.jar
" for MR2. There are two types of DFSIO tools, but this article discusses TestDFSIO only. TestDFSIO is Distributed I/O Benchmark tool as per the help description below. There are several options you can pass into the tool.
Option | Description |
---|---|
-read |
This must be run after executing TestDFSIO with write option. This reads the generated data from hdfs://namendoe:8020/benchmarks/TestDFSIO/io_data |
-write | This generates data based on file options and writes it to hdfs://namendoe:8020/benchmarks/TestDFSIO/io_data |
-append | Can be run after executing TestDFSIO with write option. Will append to existing data under hdfs://namendoe:8020/benchmarks/TestDFSIO/io_data |
-clean | Removes all data generated by TestDFSIO under hdfs://namendoe:8020/benchmarks/TestDFSIO/io_data |
-nrFiles | Number of files to generate within HDFS. This is also equivialvent to the number of map tasks that will get executed. |
-fileSize | Genterate a file for each map task ( -nrFiles ) with this size. Example -nrFiles 10 -fileSize 250GB would generate 2500GB of data in HDFS cluster |
-resFile | Path to local file system ( not hdfs ) where TestDFSIO will store the resutls for the test |
-bufferSize | Buffer size in bytes for each map tasks when reading and writing IO. Defaults to 1000000bytes ( about 1MB ) |
[gpadmin@hdm3 ~]$ hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-2.0.2-alpha-gphd-2.0.1.0-tests.jar TestDFSIO -help 13/08/21 11:27:41 INFO fs.TestDFSIO: TestDFSIO.0.0.6 Illegal argument: -help Usage: TestDFSIO [genericOptions] -read | -write | -append | -clean [-nrFiles N] [-fileSize Size[B|KB|MB|GB|TB]] [-resFile resultFileName] [-bufferSize Bytes] [-rootDir] Other Available benchmarck tools: [gpadmin@hdm3 ~]$ hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-*-tests.jar An example program must be given as the first argument. Valid program names are: DFSCIOTest: Distributed i/o benchmark of libhdfs. DistributedFSCheck: Distributed checkup of the file system consistency. JHLogAnalyzer: Job History Log analyzer. MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures SliveTest: HDFS Stress Test and Live Data Verification. TestDFSIO: Distributed i/o benchmark. fail: a job that always fails filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed) loadgen: Generic map/reduce load generator mapredtest: A map/reduce test check. minicluster: Single process HDFS and MR cluster. mrbench: A map/reduce benchmark that can create many small jobs nnbench: A benchmark that stresses the namenode. sleep: A job that sleeps at each map and reduce task. testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce testfilesystem: A test for FileSystem read/write. testmapredsort: A map/reduce program that validates the map-reduce framework's sort. testsequencefile: A test for flat files of binary key value pairs. testsequencefileinputformat: A test for sequence file input format. testtextinputformat: A test for text input format. threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill
Running the Write Job
hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-2.0.2-alpha-gphd-2.0.1.0-tests.jar TestDFSIO -write -nrFiles 64 -fileSize 16GB -resFile /tmp/TestDFSIOwrite.txt 13/08/21 10:56:45 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write 13/08/21 10:56:45 INFO fs.TestDFSIO: Date & time: Wed Aug 21 10:56:45 PDT 2013 13/08/21 10:56:45 INFO fs.TestDFSIO: Number of files: 64 13/08/21 10:56:45 INFO fs.TestDFSIO: Total MBytes processed: 1048576.0 13/08/21 10:56:45 INFO fs.TestDFSIO: Throughput mb/sec: 23.046824301966463 13/08/21 10:56:45 INFO fs.TestDFSIO: Average IO rate mb/sec: 23.143465042114258 13/08/21 10:56:45 INFO fs.TestDFSIO: IO rate std deviation: 1.5490700854356283 13/08/21 10:56:45 INFO fs.TestDFSIO: Test exec time sec: 796.676 13/08/21 10:56:45 INFO fs.TestDFSIO: [gpadmin@hdm3 ~]$ hdfs dfs -cat /benchmarks/TestDFSIO/io_write/part* f:rate 1481181.8 f:sqrate 3.4433252E7 l:size 1099511627776 l:tasks 64 l:time 45497635
hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-2.0.2-alpha-gphd-2.0.1.0-tests.jar TestDFSIO -read -nrFiles 64 -fileSize 16GB -resFile /tmp/TestDFSIOwrite.txt 13/08/21 11:03:45 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read 13/08/21 11:03:45 INFO fs.TestDFSIO: Date & time: Wed Aug 21 11:03:45 PDT 2013 13/08/21 11:03:45 INFO fs.TestDFSIO: Number of files: 64 13/08/21 11:03:45 INFO fs.TestDFSIO: Total MBytes processed: 1048576.0 13/08/21 11:03:45 INFO fs.TestDFSIO: Throughput mb/sec: 46.94650035960607 13/08/21 11:03:45 INFO fs.TestDFSIO: Average IO rate mb/sec: 47.33715057373047 13/08/21 11:03:45 INFO fs.TestDFSIO: IO rate std deviation: 4.734873712739776 13/08/21 11:03:45 INFO fs.TestDFSIO: Test exec time sec: 414.219 13/08/21 11:03:45 INFO fs.TestDFSIO: [gpadmin@hdm3 ~]$ hdfs dfs -cat /benchmarks/TestDFSIO/io_write/part* f:rate 1481181.8 f:sqrate 3.4433252E7 l:size 1099511627776 l:tasks 64 l:time 45497635
hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-2.0.2-alpha-gphd-2.0.1.0-tests.jar TestDFSIO -write -nrFiles 4 -fileSize 250GB -resFile /tmp/TestDFSIOwrite.txt 13/08/20 23:17:38 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write 13/08/20 23:17:38 INFO fs.TestDFSIO: Date & time: Tue Aug 20 23:17:38 PDT 2013 13/08/20 23:17:38 INFO fs.TestDFSIO: Number of files: 4 13/08/20 23:17:38 INFO fs.TestDFSIO: Total MBytes processed: 1024000.0 13/08/20 23:17:38 INFO fs.TestDFSIO: Throughput mb/sec: 161.73416935999862 13/08/20 23:17:38 INFO fs.TestDFSIO: Average IO rate mb/sec: 161.75624084472656 13/08/20 23:17:38 INFO fs.TestDFSIO: IO rate std deviation: 1.8999879033336318 13/08/20 23:17:38 INFO fs.TestDFSIO: Test exec time sec: 1603.932 13/08/20 23:17:38 INFO fs.TestDFSIO: [gpadmin@hdm3 ~]$ hdfs dfs -cat /benchmarks/TestDFSIO/io_write/part* f:rate 647024.94 f:sqrate 1.04674768E8 l:size 1073741824000 l:tasks 4 l:time 6331377
hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-2.0.2-alpha-gphd-2.0.1.0-tests.jar TestDFSIO -read -nrFiles 4 -fileSize 250GB -resFile /tmp/TestDFSIOwrite.txt åÊ 13/08/21 09:40:12 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read 13/08/21 09:40:12 INFO fs.TestDFSIO: Date & time: Wed Aug 21 09:40:12 PDT 2013 13/08/21 09:40:12 INFO fs.TestDFSIO: Number of files: 4 13/08/21 09:40:12 INFO fs.TestDFSIO: Total MBytes processed: 1024000.0 13/08/21 09:40:12 INFO fs.TestDFSIO: Throughput mb/sec: 122.51965010589454 13/08/21 09:40:12 INFO fs.TestDFSIO: Average IO rate mb/sec: 122.5361557006836 13/08/21 09:40:12 INFO fs.TestDFSIO: IO rate std deviation: 1.4152211082822392 13/08/21 09:40:12 INFO fs.TestDFSIO: Test exec time sec: 2141.713 13/08/21 09:40:12 INFO fs.TestDFSIO: [gpadmin@hdm3 ~]$ hdfs dfs -cat /benchmarks/TestDFSIO/io_write/part* f:rate 647024.94 f:sqrate 1.04674768E8 l:size 1073741824000 l:tasks 4 l:time 6331377
Code from TestDFSIO.java
835 double med = rate / 1000 / tasks; 836 double stdDev = Math.sqrt(Math.abs(sqrate / 1000 / tasks - med*med)); 837 String resultLines[] = { 838 "----- TestDFSIO ----- : " + testType, 839 " Date & time: " + new Date(System.currentTimeMillis()), 840 " Number of files: " + tasks, 841 "Total MBytes processed: " + toMB(size), 842 " Throughput mb/sec: " + size * 1000.0 / (time * MEGA), 843 "Average IO rate mb/sec: " + med, 844 " IO rate std deviation: " + stdDev, 845 " Test exec time sec: " + (float)execTime / 1000, 846 "" };
First have to collect the RAW mapreudce results: #################################################################### [gpadmin@hdm3 ~]$ hdfs dfs -cat /benchmarks/TestDFSIO/io_write/part* f:rate 1481181.8 f:sqrate 3.4433252E7 l:size 1099511627776 l:tasks 64 l:time 45497635 The actual results from the test #################################################################### 13/08/21 10:56:45 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write 13/08/21 10:56:45 INFO fs.TestDFSIO: Date & time: Wed Aug 21 10:56:45 PDT 2013 13/08/21 10:56:45 INFO fs.TestDFSIO: Number of files: 64 13/08/21 10:56:45 INFO fs.TestDFSIO: Total MBytes processed: 1048576.0 13/08/21 10:56:45 INFO fs.TestDFSIO: Throughput mb/sec: 23.046824301966463 13/08/21 10:56:45 INFO fs.TestDFSIO: Average IO rate mb/sec: 23.143465042114258 13/08/21 10:56:45 INFO fs.TestDFSIO: IO rate std deviation: 1.5490700854356283 13/08/21 10:56:45 INFO fs.TestDFSIO: Test exec time sec: 796.676 here is how the results are calculated ################################################################### Throughput = size * 1000 / time * 1048576 Throughput = 1099511627776 * 1000 / 45497635 * 1048576 Throughput = 1099511627776000 / 47707728117760 = 23.04682430196646 AVG IO Rate = rate / 1000 / tasks AVG IO Rate = 1481181.8 / 1000 / 64 = 23.143465625 Standard Deviation = square root of ( absolute value(sqrate / 1000 / tasks - AvgIoRate * AvgIoRate)) Standard Deviation = square root of ( absolute value(34433252 / 1000 / 64 - 23.143465625 * 23.143465625)) = 1.549051762996757
64 file test has a write throughput of 23MB/s and the 4 file had a write throughput of about 161MB/s.64 file test yielded better results for a few reasons.
/benchmarks/TestDFSIO/io_write/part-00000
file. It basically sums up all of these values "rate, sqrate, size, etc.." from each of the map tasks. So the throughput, IO rate, STD deviation, results are based on individual map tasks and not the overall throughput of the cluster. We know that nrFiles is equal to the number of map tasks and this specific cluster configuration will allow for up to 79 simultaneous Map tasks to execute. In the 64 file-test, there will be 16 map tasks running simultaneously on each nodemanager node versus 1 map tasks on each node in the 4 file test. The 4 file test yields a throughput results of 161MB/s on each nodemanager node. The 64 file test yields ( 16 * 23MB/s ) 368MB/s per nodemanager node. Clearly the 64 file test proves the MapReduce/HDFS performance optimal and the 4 file test is less than optimal. It also shows how MapReduce IO performance can vary depending on the data size, number of map/reduce tasks, and available cluster resources.