When we install hadoop we get few jars to test the
installation and for benchmarking. In Cloudera distribution:
This jar provide various test example:
[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar randomtextwriter -Ddfs.replication=1 /user/data/random
#The original TeraSort benchmark sorts 10 million 100 byte records making the total data size 1 TB. However, we can specify the number of records, making it possible to configure the total size of data.
[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen -D dfs.blocksize=67108864 3407872 /user/hasnain/teragen
[hdfs@host1 ~]$ hadoop fs -du -h /user/hasnain/teragen
[hdfs@host1 ~]$ hadoop fs -ls /benchmarks/TestDFSIO/io_data
[hdfs@host1 ~]$ hadoop fs -ls /benchmarks/TestDFSIO/io_write
[hdfs@host1 ~]$ hadoop fs -cat /benchmarks/TestDFSIO/io_write/part-00000
[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -read -nrFiles 3 -size 100MB
[hdfs@host1 ~]$ ls -ltr
/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar
/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
/opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar
/opt/cloudera/parcels/CDH/jars/hadoop-examples.jar
All above jars are same and it is used for installation
test like wordcount and benchmark HDFS, MapReduce etc.
You can alternatively search this file like below if you using diffrent CDH version or have installed package version:
[root@host1 ~]# find / -name
"*hadoop*example*.jar"
/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/hadoop-examples.jar
/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar
/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar
/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hue/apps/oozie/examples/lib/hadoop-examples.jar
This jar provide various test example:
[root@host1 hasnain]# yarn jar
/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount:
An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist:
An Aggregate based map/reduce program that computes the histogram of the words
in the input files.
bbp: A map/reduce
program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example
job that count the pageview counts from a database.
distbbp: A
map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce
program that counts the matches of a regex in the input.
join: A job that
effects a join over sorted, equally partitioned datasets
multifilewc: A job
that counts words from several files.
pentomino: A
map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi
using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that
writes 10GB
of random textual data per node.
randomwriter: A
map/reduce program that writes 10GB of random data per node.
secondarysort: An
example defining a secondary sort to the reduce.
sort: A map/reduce
program that sorts the data written by the random writer.
sudoku: A sudoku
solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts
the words in the input files.
wordmean: A
map/reduce program that counts the average length of the words in the input
files.
wordmedian: A
map/reduce program that counts the median length of the words in the input
files.
wordstandarddeviation: A map/reduce program that counts the standard
deviation of the length of the words in the input files.
Example1:
pi test:
Pi <no of mapper> <sample per mapper>
[hdfs@host1 ~]$ yarn jar
/opt/cloudera/parcels/CDH/jars/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar pi
50 10
Parcel - sudo -u hdfs hadoop jar
/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi
10 100
Package - sudo -u hdfs hadoop jar
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 100
Example2:
Wordcount test:
I am going to run wordcount on bible. You can find from below:
[hdfs@host1 ~]$ yarn jar
/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
wordcount
-Dmapred.reduce.tasks=8 /user/data/bible.txt /user/output/wordcount
[hdfs@host1 ~]$ hadoop fs -tail /user/output/wordcount/part-r-00006
pons 17
weapons, 2
weary? 2
weeks 8
--to genrate random 30 GB data, you can use this data for wordcount also
[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar randomtextwriter -Ddfs.replication=1 /user/data/random
[hdfs@host1 ~]$ hadoop fs -du -s -h /user/data/random
30.5 G 30.5 G /user/data/random
[hdfs@host1 ~]$ hadoop fs -df -h
Filesystem Size Used
Available Use%
hdfs://host1.internal.cloudapp.net:8020 84.7 G
32.9 G 47.0 G 39%
TERAGEN:
This Benchmark is used to test both, MapReduce and HDFS by
sorting some amount of data as quickly as possible in order to measure the
capabilities of distributing and mapreducing files in cluster. This benchmark
consists of 3 components:
TeraGen - generates random data
TeraSort - does the sorting using MapReduce
TeraValidate - used to validate the output
[hdfs@host1 ~]$ yarn jar
/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
teragen
teragen <num
rows> <output dir>
#The original TeraSort benchmark sorts 10 million 100 byte records making the total data size 1 TB. However, we can specify the number of records, making it possible to configure the total size of data.
# Each row is of size 100 byte, so to generate 1GB of data,
num of rows value is 10000000
Suppose you want to generate data of 10 GB, then the number
of 100-byte rows will be 10⁷.
Explanation : 10 GB =10⁷ * 100
bytes = 10⁹
bytes
SIZE=1G = 7'0 ROWS=10000000
SIZE=10G = 8'0 ROWS=100000000
SIZE=100G = 9'0 ROWS=1000000000
SIZE=500G = 9'0 ROWS=5000000000
SIZE=1T = 10'0 ROWS=10000000000
Step1: Teragen
[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/jars/hadoop-examples.jar teragen
10000000 /user/hasnain/teragen2
[hdfs@host1 ~]$ hadoop fs -du -h /user/hasnain/teragen2
0 0 /user/hasnain/teragen2/_SUCCESS
476.8 M 1.4 G /user/hasnain/teragen2/part-m-00000
476.8 M 1.4 G /user/hasnain/teragen2/part-m-00001
# To
generate a file of 325MB size, with blocksize of 64MB
[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen -D dfs.blocksize=67108864 3407872 /user/hasnain/teragen
[hdfs@host1 ~]$ hadoop fs -du -h /user/hasnain/teragen
0 0 /user/hasnain/teragen/_SUCCESS
162.5 M 487.5 M /user/hasnain/teragen/part-m-00000
162.5 M 487.5 M /user/hasnain/teragen/part-m-00001
--- It's a good idea
to specify the number of Map tasks to the teragen computation
to speed up the data generation. This can be done by specifying the –Dmapred.map.tasks parameter. Like
$ hadoop jar
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen
–Ddfs.block.size=536870912 –Dmapred.map.tasks=4
10000000 tera-in
Step2:
Terasort
[hdfs@host1 ~]$ yarn jar
/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
terasort /user/hasnain/teragen /user/hasnain/terasort
---- It's a good idea to specify the number of Reduce tasks
to the TeraSort computation to speed up the Reducer part of the computation.
This can be done by specifying the
–Dmapred.reduce.tasks parameter as follows:
–Dmapred.reduce.tasks parameter as follows:
$ hadoop jar
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar terasort –Dmapred.reduce.tasks=32
tera-in tera-out
-- for 50GB 4 reducer is fine, the default number
of terasort reducer tasks is set to 1
STEP3:
VALIDATE
If anything is wrong with the sorted output, the output of
this reducer reports the problem.
[hdfs@host1 ~]$ yarn jar
/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teravalidate /user/hasnain/terasort
/user/hasnain/teravalidate
[hdfs@host1 ~]$ hadoop fs -ls /user/hasnain/teravalidate
Found 2 items
-rw-r--r-- 3 hdfs
hasnain 0 2020-04-28 05:45
/user/hasnain/teravalidate/_SUCCESS
-rw-r--r-- 3 hdfs
hasnain 24 2020-04-28 05:45
/user/hasnain/teravalidate/part-r-00000
[hdfs@host1 ~]$ hadoop fs -cat /user/hasnain/teravalidate/part-r-00000
checksum
19fed0702bc126
-- Also, do not forget to clean up
the terasort data between runs (and after testing is finished)
$ hdfs dfs -rm -r -skipTrash Tera*
-- Find job history for Terasort analysis, this will provide all stats
hadoop job -history all
<job output directory>
[hdfs@host1 ~]$ hadoop fs -ls /user/history
Found 2 items
drwxrwx--x - mapred hadoop 0
2020-04-28 05:29 /user/history/done
drwxrwxrwt - mapred hadoop 0
2020-04-28 05:29 /user/history/done_intermediate
[hdfs@host1 ~]$ hadoop fs -ls /user/history/done
Found 1 items
drwxrwx--x - mapred hadoop 0
2020-04-28 05:29 /user/history/done/2020
[hdfs@host1 ~]$ hadoop fs -ls -R
/user/history/done/2020
drwxrwx--x - mapred hadoop 0
2020-04-28 05:29 /user/history/done/2020/04
drwxrwx--x - mapred hadoop 0
2020-04-28 05:29 /user/history/done/2020/04/28
drwxrwx--x - mapred hadoop 0
2020-04-28 05:34 /user/history/done/2020/04/28/000000
-rwxrwx--- 3 hdfs hadoop 20169
2020-04-28 05:29
/user/history/done/2020/04/28/000000/job_1588050916962_0001-1588051764325-hdfs-TeraGen-1588051780683-2-0-SUCCEEDED-root.users.hdfs-1588051771239.jhist
-rwxrwx--- 3 hdfs hadoop 215334
2020-04-28 05:29
/user/history/done/2020/04/28/000000/job_1588050916962_0001_conf.xml
-rwxrwx--- 1 hdfs hadoop 88555
2020-04-28 05:34 /user/history/done/2020/04/28/000000/job_1588050916962_0002-1588052034254-hdfs-TeraSort-1588052056154-6-12-SUCCEEDED-root.users.hdfs-1588052037959.jhist
-rwxrwx--- 1 hdfs hadoop 216449
2020-04-28 05:34
/user/history/done/2020/04/28/000000/job_1588050916962_0002_conf.xml
-- Important parameter CPU for map and reduce and
many other stats you can get using below:
[hdfs@host1 ~]$ hadoop job -history
/user/history/done/2020/04/28/000000/job_1588050916962_0002-1588052034254-hdfs-TeraSort-1588052056154-6-12-SUCCEEDED-root.users.hdfs-1588052037959.jhist
BENCHMARK
IO
TestDFSIO benchmark is a read and write test for HDFS. It is
helpful for tasks such as stress testing HDFS
Run write tests before read tests
[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar
An example program must be given as the first argument.
Valid program names are:
DFSCIOTest:
Distributed i/o benchmark of libhdfs.
MRReliabilityTest: A
program that tests the reliability of the MR framework by injecting
faults/failures
TestDFSIO: Distributed i/o benchmark.
fail: a job that
always fails
gsleep: A sleep job
whose mappers create 1MB buffer for every record.
loadgen: Generic
map/reduce load generator
mapredtest: A
map/reduce test check.
mrbench: A
map/reduce benchmark that can create many small jobs
nnbench: A benchmark
that stresses the namenode w/ MR.
nnbenchWithoutMR: A
benchmark that stresses the namenode w/o MR.
sleep: A job that
sleeps at each map and reduce task.
testbigmapoutput: A
map/reduce program that works on a very big non-splittable file and does
identity map/reduce
testfilesystem: A
test for FileSystem read/write.
testmapredsort: A map/reduce
program that validates the map-reduce framework's sort.
testsequencefile: A
test for flat files of binary key value pairs.
testsequencefileinputformat: A test for sequence file input format.
testtextinputformat:
A test for text input format.
threadedmapbench: A
map/reduce benchmark that compares the performance of maps with multiple spills
over maps with 1 spill
timelineperformance:
A job that launches mappers to test timeline service performance.
# Write test – to run a write test that generates 3 output
files of size 100 MB
Usage: TestDFSIO [genericOptions] -read [-random | -backward | -skip [-skipSize
Size]] | -write |
-append | -truncate | -clean
[-compression codecClassName] [-nrFiles
N] [-size
Size[B|KB|MB|GB|TB]] [-resFile resultFileName] [-bufferSize Bytes]
[-storagePolicy storagePolicyName] [-erasureCodePolicy erasureCodePolicyName]
[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar
TestDFSIO -write -nrFiles 3 -size 100MB
20/04/28 05:50:52 INFO fs.TestDFSIO: TestDFSIO.1.8
20/04/28 05:50:52 INFO fs.TestDFSIO: nrFiles = 3
20/04/28 05:50:52 INFO fs.TestDFSIO: nrBytes (MB) = 100.0
20/04/28 05:50:52 INFO fs.TestDFSIO: bufferSize = 1000000
20/04/28 05:50:52 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO
20/04/28 05:50:53 INFO fs.TestDFSIO: creating control file:
104857600 bytes, 3 files
20/04/28 05:50:53 INFO fs.TestDFSIO: created control files
for: 3 files
20/04/28 05:50:53 INFO client.RMProxy: Connecting to ResourceManager
at adf2.internal.cloudapp.net/10.0.0.5:8032
20/04/28 05:50:53 INFO client.RMProxy: Connecting to
ResourceManager at adf2.internal.cloudapp.net/10.0.0.5:8032
---
---
20/04/28 05:51:10 INFO fs.TestDFSIO: ----- TestDFSIO ----- :
write
20/04/28 05:51:10 INFO fs.TestDFSIO: Date & time: Tue Apr 28
05:51:10 UTC 2020
20/04/28 05:51:10 INFO fs.TestDFSIO: Number of files: 3
20/04/28 05:51:10 INFO fs.TestDFSIO: Total MBytes processed: 300
20/04/28 05:51:10 INFO fs.TestDFSIO: Throughput mb/sec: 133.04
20/04/28 05:51:10 INFO fs.TestDFSIO: Average IO rate mb/sec: 133.06
20/04/28 05:51:10 INFO fs.TestDFSIO: IO rate std deviation: 1.6
20/04/28 05:51:10 INFO fs.TestDFSIO: Test exec time sec: 17.37
20/04/28 05:51:10 INFO fs.TestDFSIO:
[hdfs@host1 ~]$ hadoop fs -ls /benchmarks/TestDFSIO
Found 3 items
drwxr-xr-x - hdfs
supergroup 0 2020-04-28 05:50
/benchmarks/TestDFSIO/io_control
drwxr-xr-x - hdfs
supergroup 0 2020-04-28 05:51
/benchmarks/TestDFSIO/io_data
drwxr-xr-x - hdfs
supergroup 0 2020-04-28 05:51
/benchmarks/TestDFSIO/io_write
[hdfs@host1 ~]$ hadoop fs -ls /benchmarks/TestDFSIO/io_data
Found 3 items
-rw-r--r-- 3 hdfs
supergroup 104857600 2020-04-28 05:51
/benchmarks/TestDFSIO/io_data/test_io_0
-rw-r--r-- 3 hdfs
supergroup 104857600 2020-04-28 05:51
/benchmarks/TestDFSIO/io_data/test_io_1
-rw-r--r-- 3 hdfs
supergroup 104857600 2020-04-28 05:51
/benchmarks/TestDFSIO/io_data/test_io_2
[hdfs@host1 ~]$ hadoop fs -ls /benchmarks/TestDFSIO/io_write
Found 2 items
-rw-r--r-- 3 hdfs
supergroup 0 2020-04-28 05:51
/benchmarks/TestDFSIO/io_write/_SUCCESS
-rw-r--r-- 3 hdfs
supergroup 77 2020-04-28 05:51
/benchmarks/TestDFSIO/io_write/part-00000
[hdfs@host1 ~]$ hadoop fs -cat /benchmarks/TestDFSIO/io_write/part-00000
f:rate 399170.84
f:sqrate 5.3120168E7
l:size 314572800
l:tasks 3
l:time 2255
# Read test
– to run the corresponding read test using 3 input files of size 100 MB
-- same parameter for read also
[hdfs@host1 ~]$ yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -read -nrFiles 3 -size 100MB
20/04/28 05:57:56 INFO fs.TestDFSIO: ----- TestDFSIO ----- :
read
20/04/28 05:57:56 INFO fs.TestDFSIO: Date & time: Tue Apr 28
05:57:56 UTC 2020
20/04/28 05:57:56 INFO fs.TestDFSIO: Number of files: 3
20/04/28 05:57:56 INFO fs.TestDFSIO: Total MBytes processed: 300
20/04/28 05:57:56 INFO fs.TestDFSIO: Throughput mb/sec: 1119.4
20/04/28 05:57:56 INFO fs.TestDFSIO: Average IO rate mb/sec: 1122.85
20/04/28 05:57:56 INFO fs.TestDFSIO: IO rate std deviation: 62.68
20/04/28 05:57:56 INFO fs.TestDFSIO: Test exec time sec: 16.27
20/04/28 05:57:56 INFO fs.TestDFSIO:
-- you can also find one local file created once job run successfully, having summery stats details:
[hdfs@host1 ~]$ ls -ltr
total 4
drwxrwxr-x 3 hdfs hdfs
18 Apr 28 01:29 cache
-rw-rw-r-- 1 hdfs hdfs 265 Apr 28 05:51 TestDFSIO_results.log
[hdfs@host1 ~]$ cat TestDFSIO_results.log
----- TestDFSIO ----- : write
Date & time: Tue Apr 28 05:51:10 UTC
2020
Number of
files: 3
Total MBytes
processed: 300
Throughput
mb/sec: 133.04
Average IO rate
mb/sec: 133.06
IO rate std
deviation: 1.6
Test exec time
sec: 17.37
----- TestDFSIO ----- : read
Date &
time: Tue Apr 28 05:57:56 UTC 2020
Number of
files: 3
Total MBytes
processed: 300
Throughput
mb/sec: 1119.4
Average IO rate
mb/sec: 1122.85
IO rate std
deviation: 62.68
Test exec time
sec: 16.27
-- cleanup all benchmarking file
[hdfs@host1 ~]$ yarn jar
/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar
TestDFSIO -clean
20/04/28 09:24:33 INFO fs.TestDFSIO: TestDFSIO.1.8
20/04/28 09:24:33 INFO fs.TestDFSIO: nrFiles = 1
20/04/28 09:24:33 INFO fs.TestDFSIO: nrBytes (MB) = 1.0
20/04/28 09:24:33 INFO fs.TestDFSIO: bufferSize = 1000000
20/04/28 09:24:33
INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO
20/04/28 09:24:33
INFO fs.TestDFSIO: Cleaning up test files
For other performance benchmarking refer:
Post a Comment
Post a Comment
Thanks for your comment !
I will review your this and will respond you as soon as possible.