Benchmark Hadoop

When we install hadoop we get few jars to test the installation and for benchmarking. In Cloudera distribution:

/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar

/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar

/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar

/opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar

/opt/cloudera/parcels/CDH/jars/hadoop-examples.jar

All above jars are same and it is used for installation test like wordcount and benchmark HDFS, MapReduce etc.

You can alternatively search this file like below if you using diffrent CDH version or have installed package version:

[root@host1 ~]# find / -name "*hadoop*example*.jar"

/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/hadoop-examples.jar

/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar

/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar

/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar

/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hue/apps/oozie/examples/lib/hadoop-examples.jar

This jar provide various test example:

[root@host1 hasnain]# yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar

An example program must be given as the first argument.

Valid program names are:

aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.

aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.

bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.

dbcount: An example job that count the pageview counts from a database.

distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.

grep: A map/reduce program that counts the matches of a regex in the input.

join: A job that effects a join over sorted, equally partitioned datasets

multifilewc: A job that counts words from several files.

pentomino: A map/reduce tile laying program to find solutions to pentomino problems.

pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.

randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.

randomwriter: A map/reduce program that writes 10GB of random data per node.

secondarysort: An example defining a secondary sort to the reduce.

sort: A map/reduce program that sorts the data written by the random writer.

sudoku: A sudoku solver.

teragen: Generate data for the terasort

terasort: Run the terasort

teravalidate: Checking results of terasort

wordcount: A map/reduce program that counts the words in the input files.

wordmean: A map/reduce program that counts the average length of the words in the input files.

wordmedian: A map/reduce program that counts the median length of the words in the input files.

wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

Example1: pi test:

Pi <no of mapper> <sample per mapper>

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar pi 50 10

Parcel - sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 100

Package - sudo -u hdfs hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 100

Example2: Wordcount test:

I am going to run wordcount on bible. You can find from below:

wget https://raw.githubusercontent.com/mxw/grmr/master/src/finaltests/bible.txt

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount -Dmapred.reduce.tasks=8 /user/data/bible.txt /user/output/wordcount

[hdfs@host1 ~]$ hadoop fs -tail /user/output/wordcount/part-r-00006

pons 17

weapons, 2

weary? 2

weeks 8

--to genrate random 30 GB data, you can use this data for wordcount also

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar randomtextwriter -Ddfs.replication=1 /user/data/random

[hdfs@host1 ~]$ hadoop fs -du -s -h /user/data/random

30.5 G 30.5 G /user/data/random

[hdfs@host1 ~]$ hadoop fs -df -h

Filesystem Size Used Available Use%

hdfs://host1.internal.cloudapp.net:8020 84.7 G 32.9 G 47.0 G 39%

TERAGEN:

This Benchmark is used to test both, MapReduce and HDFS by sorting some amount of data as quickly as possible in order to measure the capabilities of distributing and mapreducing files in cluster. This benchmark consists of 3 components:

TeraGen - generates random data

TeraSort - does the sorting using MapReduce

TeraValidate - used to validate the output

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen

teragen <num rows> <output dir>

#The original TeraSort benchmark sorts 10 million 100 byte records making the total data size 1 TB. However, we can specify the number of records, making it possible to configure the total size of data.

# Each row is of size 100 byte, so to generate 1GB of data, num of rows value is 10000000

Suppose you want to generate data of 10 GB, then the number of 100-byte rows will be 10⁷.

Explanation : 10 GB =10⁷ * 100 bytes = 10⁹ bytes

SIZE=1G = 7'0 ROWS=10000000

SIZE=10G = 8'0 ROWS=100000000

SIZE=100G = 9'0 ROWS=1000000000

SIZE=500G = 9'0 ROWS=5000000000

SIZE=1T = 10'0 ROWS=10000000000

Step1: Teragen

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/jars/hadoop-examples.jar teragen 10000000 /user/hasnain/teragen2

[hdfs@host1 ~]$ hadoop fs -du -h /user/hasnain/teragen2

0 0 /user/hasnain/teragen2/_SUCCESS

476.8 M 1.4 G /user/hasnain/teragen2/part-m-00000

476.8 M 1.4 G /user/hasnain/teragen2/part-m-00001

# To generate a file of 325MB size, with blocksize of 64MB

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen -D dfs.blocksize=67108864 3407872 /user/hasnain/teragen

[hdfs@host1 ~]$ hadoop fs -du -h /user/hasnain/teragen

0 0 /user/hasnain/teragen/_SUCCESS

162.5 M 487.5 M /user/hasnain/teragen/part-m-00000

162.5 M 487.5 M /user/hasnain/teragen/part-m-00001

--- It's a good idea to specify the number of Map tasks to the teragen computation to speed up the data generation. This can be done by specifying the –Dmapred.map.tasks parameter. Like

$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen –Ddfs.block.size=536870912 –Dmapred.map.tasks=4 10000000 tera-in

Step2: Terasort

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort /user/hasnain/teragen /user/hasnain/terasort

---- It's a good idea to specify the number of Reduce tasks to the TeraSort computation to speed up the Reducer part of the computation. This can be done by specifying the
–Dmapred.reduce.tasks parameter as follows:

$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar terasort –Dmapred.reduce.tasks=32 tera-in tera-out

-- for 50GB 4 reducer is fine, the default number of terasort reducer tasks is set to 1

STEP3: VALIDATE

If anything is wrong with the sorted output, the output of this reducer reports the problem.

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teravalidate /user/hasnain/terasort /user/hasnain/teravalidate

[hdfs@host1 ~]$ hadoop fs -ls /user/hasnain/teravalidate

Found 2 items

-rw-r--r-- 3 hdfs hasnain 0 2020-04-28 05:45 /user/hasnain/teravalidate/_SUCCESS

-rw-r--r-- 3 hdfs hasnain 24 2020-04-28 05:45 /user/hasnain/teravalidate/part-r-00000

[hdfs@host1 ~]$ hadoop fs -cat /user/hasnain/teravalidate/part-r-00000

checksum 19fed0702bc126

-- Also, do not forget to clean up the terasort data between runs (and after testing is finished)

$ hdfs dfs -rm -r -skipTrash Tera*

-- Find job history for Terasort analysis, this will provide all stats

hadoop job -history all <job output directory>

[hdfs@host1 ~]$ hadoop fs -ls /user/history

Found 2 items

drwxrwx--x - mapred hadoop 0 2020-04-28 05:29 /user/history/done

drwxrwxrwt - mapred hadoop 0 2020-04-28 05:29 /user/history/done_intermediate

[hdfs@host1 ~]$ hadoop fs -ls /user/history/done

Found 1 items

drwxrwx--x - mapred hadoop 0 2020-04-28 05:29 /user/history/done/2020

[hdfs@host1 ~]$ hadoop fs -ls -R /user/history/done/2020

drwxrwx--x - mapred hadoop 0 2020-04-28 05:29 /user/history/done/2020/04

drwxrwx--x - mapred hadoop 0 2020-04-28 05:29 /user/history/done/2020/04/28

drwxrwx--x - mapred hadoop 0 2020-04-28 05:34 /user/history/done/2020/04/28/000000

-rwxrwx--- 3 hdfs hadoop 20169 2020-04-28 05:29 /user/history/done/2020/04/28/000000/job_1588050916962_0001-1588051764325-hdfs-TeraGen-1588051780683-2-0-SUCCEEDED-root.users.hdfs-1588051771239.jhist

-rwxrwx--- 3 hdfs hadoop 215334 2020-04-28 05:29 /user/history/done/2020/04/28/000000/job_1588050916962_0001_conf.xml

-rwxrwx--- 1 hdfs hadoop 88555 2020-04-28 05:34 /user/history/done/2020/04/28/000000/job_1588050916962_0002-1588052034254-hdfs-TeraSort-1588052056154-6-12-SUCCEEDED-root.users.hdfs-1588052037959.jhist

-rwxrwx--- 1 hdfs hadoop 216449 2020-04-28 05:34 /user/history/done/2020/04/28/000000/job_1588050916962_0002_conf.xml

-- Important parameter CPU for map and reduce and many other stats you can get using below:

[hdfs@host1 ~]$ hadoop job -history

/user/history/done/2020/04/28/000000/job_1588050916962_0002-1588052034254-hdfs-TeraSort-1588052056154-6-12-SUCCEEDED-root.users.hdfs-1588052037959.jhist

BENCHMARK IO

TestDFSIO benchmark is a read and write test for HDFS. It is helpful for tasks such as stress testing HDFS

Run write tests before read tests

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar

An example program must be given as the first argument.

Valid program names are:

DFSCIOTest: Distributed i/o benchmark of libhdfs.

MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures

TestDFSIO: Distributed i/o benchmark.

fail: a job that always fails

gsleep: A sleep job whose mappers create 1MB buffer for every record.

loadgen: Generic map/reduce load generator

mapredtest: A map/reduce test check.

mrbench: A map/reduce benchmark that can create many small jobs

nnbench: A benchmark that stresses the namenode w/ MR.

nnbenchWithoutMR: A benchmark that stresses the namenode w/o MR.

sleep: A job that sleeps at each map and reduce task.

testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce

testfilesystem: A test for FileSystem read/write.

testmapredsort: A map/reduce program that validates the map-reduce framework's sort.

testsequencefile: A test for flat files of binary key value pairs.

testsequencefileinputformat: A test for sequence file input format.

testtextinputformat: A test for text input format.

threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill

timelineperformance: A job that launches mappers to test timeline service performance.

# Write test – to run a write test that generates 3 output files of size 100 MB

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 3 -size 100MB

20/04/28 05:50:52 INFO fs.TestDFSIO: TestDFSIO.1.8

20/04/28 05:50:52 INFO fs.TestDFSIO: nrFiles = 3

20/04/28 05:50:52 INFO fs.TestDFSIO: nrBytes (MB) = 100.0

20/04/28 05:50:52 INFO fs.TestDFSIO: bufferSize = 1000000

20/04/28 05:50:52 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO

20/04/28 05:50:53 INFO fs.TestDFSIO: creating control file: 104857600 bytes, 3 files

20/04/28 05:50:53 INFO fs.TestDFSIO: created control files for: 3 files

20/04/28 05:50:53 INFO client.RMProxy: Connecting to ResourceManager at adf2.internal.cloudapp.net/10.0.0.5:8032

---

20/04/28 05:51:10 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write

20/04/28 05:51:10 INFO fs.TestDFSIO: Date & time: Tue Apr 28 05:51:10 UTC 2020

20/04/28 05:51:10 INFO fs.TestDFSIO: Number of files: 3

20/04/28 05:51:10 INFO fs.TestDFSIO: Total MBytes processed: 300

20/04/28 05:51:10 INFO fs.TestDFSIO: Throughput mb/sec: 133.04

20/04/28 05:51:10 INFO fs.TestDFSIO: Average IO rate mb/sec: 133.06

20/04/28 05:51:10 INFO fs.TestDFSIO: IO rate std deviation: 1.6

20/04/28 05:51:10 INFO fs.TestDFSIO: Test exec time sec: 17.37

20/04/28 05:51:10 INFO fs.TestDFSIO:

[hdfs@host1 ~]$ hadoop fs -ls /benchmarks/TestDFSIO

Found 3 items

drwxr-xr-x - hdfs supergroup 0 2020-04-28 05:50 /benchmarks/TestDFSIO/io_control

drwxr-xr-x - hdfs supergroup 0 2020-04-28 05:51 /benchmarks/TestDFSIO/io_data

drwxr-xr-x - hdfs supergroup 0 2020-04-28 05:51 /benchmarks/TestDFSIO/io_write

[hdfs@host1 ~]$ hadoop fs -ls /benchmarks/TestDFSIO/io_data

Found 3 items

-rw-r--r-- 3 hdfs supergroup 104857600 2020-04-28 05:51 /benchmarks/TestDFSIO/io_data/test_io_0

-rw-r--r-- 3 hdfs supergroup 104857600 2020-04-28 05:51 /benchmarks/TestDFSIO/io_data/test_io_1

-rw-r--r-- 3 hdfs supergroup 104857600 2020-04-28 05:51 /benchmarks/TestDFSIO/io_data/test_io_2

[hdfs@host1 ~]$ hadoop fs -ls /benchmarks/TestDFSIO/io_write

Found 2 items

-rw-r--r-- 3 hdfs supergroup 0 2020-04-28 05:51 /benchmarks/TestDFSIO/io_write/_SUCCESS

-rw-r--r-- 3 hdfs supergroup 77 2020-04-28 05:51 /benchmarks/TestDFSIO/io_write/part-00000

[hdfs@host1 ~]$ hadoop fs -cat /benchmarks/TestDFSIO/io_write/part-00000

f:rate 399170.84

f:sqrate 5.3120168E7

l:size 314572800

l:tasks 3

l:time 2255

# Read test – to run the corresponding read test using 3 input files of size 100 MB

-- same parameter for read also

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -read -nrFiles 3 -size 100MB

20/04/28 05:57:56 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read

20/04/28 05:57:56 INFO fs.TestDFSIO: Date & time: Tue Apr 28 05:57:56 UTC 2020

20/04/28 05:57:56 INFO fs.TestDFSIO: Number of files: 3

20/04/28 05:57:56 INFO fs.TestDFSIO: Total MBytes processed: 300

20/04/28 05:57:56 INFO fs.TestDFSIO: Throughput mb/sec: 1119.4

20/04/28 05:57:56 INFO fs.TestDFSIO: Average IO rate mb/sec: 1122.85

20/04/28 05:57:56 INFO fs.TestDFSIO: IO rate std deviation: 62.68

20/04/28 05:57:56 INFO fs.TestDFSIO: Test exec time sec: 16.27

20/04/28 05:57:56 INFO fs.TestDFSIO:

-- you can also find one local file created once job run successfully, having summery stats details:

[hdfs@host1 ~]$ ls -ltr

total 4

drwxrwxr-x 3 hdfs hdfs 18 Apr 28 01:29 cache

-rw-rw-r-- 1 hdfs hdfs 265 Apr 28 05:51 TestDFSIO_results.log

[hdfs@host1 ~]$ cat TestDFSIO_results.log

----- TestDFSIO ----- : write

Date & time: Tue Apr 28 05:51:10 UTC 2020

Number of files: 3

Total MBytes processed: 300

Throughput mb/sec: 133.04

Average IO rate mb/sec: 133.06

IO rate std deviation: 1.6

Test exec time sec: 17.37

----- TestDFSIO ----- : read

Date & time: Tue Apr 28 05:57:56 UTC 2020

Number of files: 3

Total MBytes processed: 300

Throughput mb/sec: 1119.4

Average IO rate mb/sec: 1122.85

IO rate std deviation: 62.68

Test exec time sec: 16.27

-- cleanup all benchmarking file

[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -clean

20/04/28 09:24:33 INFO fs.TestDFSIO: TestDFSIO.1.8

20/04/28 09:24:33 INFO fs.TestDFSIO: nrFiles = 1

20/04/28 09:24:33 INFO fs.TestDFSIO: nrBytes (MB) = 1.0

20/04/28 09:24:33 INFO fs.TestDFSIO: bufferSize = 1000000

20/04/28 09:24:33 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO

20/04/28 09:24:33 INFO fs.TestDFSIO: Cleaning up test files

For other performance benchmarking refer:

CPU

Benchmark Hadoop

Post a Comment

Post a Comment

Contact Form

Benchmark Hadoop

You might like

Post a Comment

Post a Comment

Contact Form