Benchmark Hadoop

When we install hadoop we get few jars to test the installation and for benchmarking. In Cloudera distribution:

/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar
/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
/opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar
/opt/cloudera/parcels/CDH/jars/hadoop-examples.jar


All above jars are same and it is used for installation test like wordcount and benchmark HDFS, MapReduce etc.

You can alternatively search this file like below if you using diffrent CDH version or have installed package version:

[root@host1 ~]# find / -name "*hadoop*example*.jar"
/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/hadoop-examples.jar
/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar
/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar
/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hue/apps/oozie/examples/lib/hadoop-examples.jar

This jar provide various test example:

[root@host1 hasnain]# yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.
  wordmean: A map/reduce program that counts the average length of the words in the input files.
  wordmedian: A map/reduce program that counts the median length of the words in the input files.
  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

Example1:  pi test: 
Pi <no of mapper> <sample per mapper>
[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar pi 50 10

Parcel - sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 100

Package - sudo -u hdfs hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 100

Example2: Wordcount test:

I am going to run wordcount on bible. You can find from below:


[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount -Dmapred.reduce.tasks=8 /user/data/bible.txt /user/output/wordcount

[hdfs@host1 ~]$ hadoop fs -tail /user/output/wordcount/part-r-00006
pons    17
weapons,        2
weary?  2
weeks   8

--to genrate random 30 GB data, you can use this data for wordcount also

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar randomtextwriter -Ddfs.replication=1 /user/data/random

[hdfs@host1 ~]$ hadoop fs -du -s -h /user/data/random
30.5 G  30.5 G  /user/data/random

[hdfs@host1 ~]$ hadoop fs -df -h
Filesystem                                Size    Used  Available  Use%
hdfs://host1.internal.cloudapp.net:8020  84.7 G  32.9 G     47.0 G   39%

TERAGEN:
This Benchmark is used to test both, MapReduce and HDFS by sorting some amount of data as quickly as possible in order to measure the capabilities of distributing and mapreducing files in cluster. This benchmark consists of 3 components:

TeraGen - generates random data
TeraSort - does the sorting using MapReduce
TeraValidate - used to validate the output

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen
teragen <num rows> <output dir>

#The original TeraSort benchmark sorts 10 million 100 byte records making the total data size 1 TB. However, we can specify the number of records, making it possible to configure the total size of data.
# Each row is of size 100 byte, so to generate 1GB of data, num of rows value is 10000000
Suppose you want to generate data of 10 GB, then the number of 100-byte rows will be 10.
Explanation : 10 GB =10 * 100 bytes = 10 bytes

SIZE=1G = 7'0    ROWS=10000000
SIZE=10G = 8'0   ROWS=100000000
SIZE=100G = 9'0  ROWS=1000000000
SIZE=500G = 9'0  ROWS=5000000000
SIZE=1T = 10'0   ROWS=10000000000

Step1: Teragen

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/jars/hadoop-examples.jar teragen 10000000 /user/hasnain/teragen2

[hdfs@host1 ~]$ hadoop fs -du -h /user/hasnain/teragen2
0        0      /user/hasnain/teragen2/_SUCCESS
476.8 M  1.4 G  /user/hasnain/teragen2/part-m-00000
476.8 M  1.4 G  /user/hasnain/teragen2/part-m-00001

# To generate a file of 325MB size, with blocksize of 64MB

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen -D dfs.blocksize=67108864 3407872 /user/hasnain/teragen

[hdfs@host1 ~]$ hadoop fs -du -h /user/hasnain/teragen
0        0        /user/hasnain/teragen/_SUCCESS
162.5 M  487.5 M  /user/hasnain/teragen/part-m-00000
162.5 M  487.5 M  /user/hasnain/teragen/part-m-00001

---  It's a good idea to specify the number of Map tasks to the teragen computation to speed up the data generation. This can be done by specifying the –Dmapred.map.tasks parameter. Like
$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen –Ddfs.block.size=536870912 –Dmapred.map.tasks=4 10000000 tera-in

Step2: Terasort

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort /user/hasnain/teragen /user/hasnain/terasort

---- It's a good idea to specify the number of Reduce tasks to the TeraSort computation to speed up the Reducer part of the computation. This can be done by specifying the 
–Dmapred.reduce.tasks parameter as follows:
$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar terasort –Dmapred.reduce.tasks=32 tera-in tera-out
-- for 50GB 4 reducer is fine, the default number of terasort reducer tasks is set to 1

STEP3: VALIDATE
If anything is wrong with the sorted output, the output of this reducer reports the problem.

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teravalidate /user/hasnain/terasort /user/hasnain/teravalidate

[hdfs@host1 ~]$ hadoop fs -ls /user/hasnain/teravalidate
Found 2 items
-rw-r--r--   3 hdfs hasnain          0 2020-04-28 05:45 /user/hasnain/teravalidate/_SUCCESS
-rw-r--r--   3 hdfs hasnain         24 2020-04-28 05:45 /user/hasnain/teravalidate/part-r-00000
[hdfs@host1 ~]$ hadoop fs -cat /user/hasnain/teravalidate/part-r-00000
checksum        19fed0702bc126

-- Also, do not forget to clean up the terasort data between runs (and after testing is finished)

$ hdfs dfs -rm -r -skipTrash Tera*

-- Find job history for Terasort analysis, this will provide all stats

hadoop job -history all <job output directory>

[hdfs@host1 ~]$ hadoop fs -ls /user/history
Found 2 items
drwxrwx--x   - mapred hadoop          0 2020-04-28 05:29 /user/history/done
drwxrwxrwt   - mapred hadoop          0 2020-04-28 05:29 /user/history/done_intermediate

[hdfs@host1 ~]$ hadoop fs -ls /user/history/done
Found 1 items
drwxrwx--x   - mapred hadoop          0 2020-04-28 05:29 /user/history/done/2020

[hdfs@host1 ~]$ hadoop fs -ls -R /user/history/done/2020
drwxrwx--x   - mapred hadoop          0 2020-04-28 05:29 /user/history/done/2020/04
drwxrwx--x   - mapred hadoop          0 2020-04-28 05:29 /user/history/done/2020/04/28
drwxrwx--x   - mapred hadoop          0 2020-04-28 05:34 /user/history/done/2020/04/28/000000
-rwxrwx---   3 hdfs   hadoop      20169 2020-04-28 05:29 /user/history/done/2020/04/28/000000/job_1588050916962_0001-1588051764325-hdfs-TeraGen-1588051780683-2-0-SUCCEEDED-root.users.hdfs-1588051771239.jhist
-rwxrwx---   3 hdfs   hadoop     215334 2020-04-28 05:29 /user/history/done/2020/04/28/000000/job_1588050916962_0001_conf.xml
-rwxrwx---   1 hdfs   hadoop      88555 2020-04-28 05:34 /user/history/done/2020/04/28/000000/job_1588050916962_0002-1588052034254-hdfs-TeraSort-1588052056154-6-12-SUCCEEDED-root.users.hdfs-1588052037959.jhist
-rwxrwx---   1 hdfs   hadoop     216449 2020-04-28 05:34 /user/history/done/2020/04/28/000000/job_1588050916962_0002_conf.xml

-- Important parameter CPU for map and reduce and many other stats you can get using below:

[hdfs@host1 ~]$ hadoop job -history 

/user/history/done/2020/04/28/000000/job_1588050916962_0002-1588052034254-hdfs-TeraSort-1588052056154-6-12-SUCCEEDED-root.users.hdfs-1588052037959.jhist

BENCHMARK IO

TestDFSIO benchmark is a read and write test for HDFS. It is helpful for tasks such as stress testing HDFS
Run write tests before read tests

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar
An example program must be given as the first argument.
Valid program names are:
  DFSCIOTest: Distributed i/o benchmark of libhdfs.
  MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures
  TestDFSIO: Distributed i/o benchmark.
  fail: a job that always fails
  gsleep: A sleep job whose mappers create 1MB buffer for every record.
  loadgen: Generic map/reduce load generator
  mapredtest: A map/reduce test check.
  mrbench: A map/reduce benchmark that can create many small jobs
  nnbench: A benchmark that stresses the namenode w/ MR.
  nnbenchWithoutMR: A benchmark that stresses the namenode w/o MR.
  sleep: A job that sleeps at each map and reduce task.
  testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce
  testfilesystem: A test for FileSystem read/write.
  testmapredsort: A map/reduce program that validates the map-reduce framework's sort.
  testsequencefile: A test for flat files of binary key value pairs.
  testsequencefileinputformat: A test for sequence file input format.
  testtextinputformat: A test for text input format.
  threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill
  timelineperformance: A job that launches mappers to test timeline service performance.

# Write test – to run a write test that generates 3 output files of size 100 MB

Usage: TestDFSIO [genericOptions] -read [-random | -backward | -skip [-skipSize Size]] | -write | -append | -truncate | -clean [-compression codecClassName] [-nrFiles N] [-size Size[B|KB|MB|GB|TB]] [-resFile resultFileName] [-bufferSize Bytes] [-storagePolicy storagePolicyName] [-erasureCodePolicy erasureCodePolicyName]

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 3 -size 100MB
20/04/28 05:50:52 INFO fs.TestDFSIO: TestDFSIO.1.8
20/04/28 05:50:52 INFO fs.TestDFSIO: nrFiles = 3
20/04/28 05:50:52 INFO fs.TestDFSIO: nrBytes (MB) = 100.0
20/04/28 05:50:52 INFO fs.TestDFSIO: bufferSize = 1000000
20/04/28 05:50:52 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO
20/04/28 05:50:53 INFO fs.TestDFSIO: creating control file: 104857600 bytes, 3 files
20/04/28 05:50:53 INFO fs.TestDFSIO: created control files for: 3 files
20/04/28 05:50:53 INFO client.RMProxy: Connecting to ResourceManager at adf2.internal.cloudapp.net/10.0.0.5:8032
20/04/28 05:50:53 INFO client.RMProxy: Connecting to ResourceManager at adf2.internal.cloudapp.net/10.0.0.5:8032
---
---
20/04/28 05:51:10 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
20/04/28 05:51:10 INFO fs.TestDFSIO:             Date & time: Tue Apr 28 05:51:10 UTC 2020
20/04/28 05:51:10 INFO fs.TestDFSIO:         Number of files: 3
20/04/28 05:51:10 INFO fs.TestDFSIO:  Total MBytes processed: 300
20/04/28 05:51:10 INFO fs.TestDFSIO:       Throughput mb/sec: 133.04
20/04/28 05:51:10 INFO fs.TestDFSIO:  Average IO rate mb/sec: 133.06
20/04/28 05:51:10 INFO fs.TestDFSIO:   IO rate std deviation: 1.6
20/04/28 05:51:10 INFO fs.TestDFSIO:      Test exec time sec: 17.37
20/04/28 05:51:10 INFO fs.TestDFSIO:

[hdfs@host1 ~]$ hadoop fs -ls /benchmarks/TestDFSIO
Found 3 items
drwxr-xr-x   - hdfs supergroup          0 2020-04-28 05:50 /benchmarks/TestDFSIO/io_control
drwxr-xr-x   - hdfs supergroup          0 2020-04-28 05:51 /benchmarks/TestDFSIO/io_data
drwxr-xr-x   - hdfs supergroup          0 2020-04-28 05:51 /benchmarks/TestDFSIO/io_write

[hdfs@host1 ~]$ hadoop fs -ls /benchmarks/TestDFSIO/io_data
Found 3 items
-rw-r--r--   3 hdfs supergroup  104857600 2020-04-28 05:51 /benchmarks/TestDFSIO/io_data/test_io_0
-rw-r--r--   3 hdfs supergroup  104857600 2020-04-28 05:51 /benchmarks/TestDFSIO/io_data/test_io_1
-rw-r--r--   3 hdfs supergroup  104857600 2020-04-28 05:51 /benchmarks/TestDFSIO/io_data/test_io_2

[hdfs@host1 ~]$ hadoop fs -ls /benchmarks/TestDFSIO/io_write
Found 2 items
-rw-r--r--   3 hdfs supergroup          0 2020-04-28 05:51 /benchmarks/TestDFSIO/io_write/_SUCCESS
-rw-r--r--   3 hdfs supergroup         77 2020-04-28 05:51 /benchmarks/TestDFSIO/io_write/part-00000

[hdfs@host1 ~]$ hadoop fs -cat /benchmarks/TestDFSIO/io_write/part-00000
f:rate  399170.84
f:sqrate        5.3120168E7
l:size  314572800
l:tasks 3
l:time  2255

# Read test – to run the corresponding read test using 3 input files of size 100 MB
-- same parameter for read also

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -read -nrFiles 3 -size 100MB
20/04/28 05:57:56 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
20/04/28 05:57:56 INFO fs.TestDFSIO:             Date & time: Tue Apr 28 05:57:56 UTC 2020
20/04/28 05:57:56 INFO fs.TestDFSIO:         Number of files: 3
20/04/28 05:57:56 INFO fs.TestDFSIO:  Total MBytes processed: 300
20/04/28 05:57:56 INFO fs.TestDFSIO:       Throughput mb/sec: 1119.4
20/04/28 05:57:56 INFO fs.TestDFSIO:  Average IO rate mb/sec: 1122.85
20/04/28 05:57:56 INFO fs.TestDFSIO:   IO rate std deviation: 62.68
20/04/28 05:57:56 INFO fs.TestDFSIO:      Test exec time sec: 16.27
20/04/28 05:57:56 INFO fs.TestDFSIO:

-- you can also find one local file created once job run successfully, having summery stats details:

[hdfs@host1 ~]$  ls -ltr
total 4
drwxrwxr-x 3 hdfs hdfs  18 Apr 28 01:29 cache
-rw-rw-r-- 1 hdfs hdfs 265 Apr 28 05:51 TestDFSIO_results.log

[hdfs@host1 ~]$ cat TestDFSIO_results.log
----- TestDFSIO ----- : write
            Date & time: Tue Apr 28 05:51:10 UTC 2020
        Number of files: 3
 Total MBytes processed: 300
      Throughput mb/sec: 133.04
 Average IO rate mb/sec: 133.06
  IO rate std deviation: 1.6
     Test exec time sec: 17.37

----- TestDFSIO ----- : read
            Date & time: Tue Apr 28 05:57:56 UTC 2020
        Number of files: 3
 Total MBytes processed: 300
      Throughput mb/sec: 1119.4
 Average IO rate mb/sec: 1122.85
  IO rate std deviation: 62.68
     Test exec time sec: 16.27

-- cleanup all benchmarking file

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -clean
20/04/28 09:24:33 INFO fs.TestDFSIO: TestDFSIO.1.8
20/04/28 09:24:33 INFO fs.TestDFSIO: nrFiles = 1
20/04/28 09:24:33 INFO fs.TestDFSIO: nrBytes (MB) = 1.0
20/04/28 09:24:33 INFO fs.TestDFSIO: bufferSize = 1000000
20/04/28 09:24:33 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO
20/04/28 09:24:33 INFO fs.TestDFSIO: Cleaning up test files

For other performance benchmarking refer:

Post a Comment

Thanks for your comment !
I will review your this and will respond you as soon as possible.