### "Gridmix" Benchmark ###
Contents:
0 Overview
1 Getting Started
1.0 Build
1.1 Configure
1.2 Generate test data
2 Running
2.0 General
2.1 Non-Hod cluster
2.2 Hod
2.2.0 Static cluster
2.2.1 Hod cluster
* 0 Overview
The scripts in this package model a cluster workload. The workload is
simulated by generating random data and submitting map/reduce jobs that
mimic observed data-access patterns in user jobs. The full benchmark
generates approximately 2.5TB of (often compressed) input data operated on
by the following simulated jobs:
1) Three stage map/reduce job
Input: 500GB compressed (2TB uncompressed) SequenceFile
(k,v) = (5 words, 100 words)
hadoop-env: FIXCOMPSEQ
Compute1: keep 10% map, 40% reduce
Compute2: keep 100% map, 77% reduce
Input from Compute1
Compute3: keep 116% map, 91% reduce
Input from Compute2
Motivation: Many user workloads are implemented as pipelined map/reduce
jobs, including Pig workloads
2) Large sort of variable key/value size
Input: 500GB compressed (2TB uncompressed) SequenceFile
(k,v) = (5-10 words, 100-10000 words)
hadoop-env: VARCOMPSEQ
Compute: keep 100% map, 100% reduce
Motivation: Processing large, compressed datsets is common.
3) Reference select
Input: 500GB compressed (2TB uncompressed) SequenceFile
(k,v) = (5-10 words, 100-10000 words)
hadoop-env: VARCOMPSEQ
Compute: keep 0.2% map, 5% reduce
1 Reducer
Motivation: Sampling from a large, reference dataset is common.
4) API text sort (java, streaming)
Input: 500GB uncompressed Text
(k,v) = (1-10 words, 0-200 words)
hadoop-env: VARINFLTEXT
Compute: keep 100% map, 100% reduce
Motivation: This benchmark should exercise each of the APIs to
map/reduce
5) Jobs with combiner (word count jobs)
A benchmark load is a mix of different numbers of small, medium, and large jobs of the above types.
The exact mix is specified in an xml file (gridmix_config.xml). We have a Java program to
construct those jobs based on the xml file and put them under the control of a JobControl object.
The JobControl object then submitts the jobs to the cluster and monitors their progress until all jobs complete.
Notes(1-3): Since input data are compressed, this means that each mapper
outputs a lot more bytes than it reads in, typically causing map output
spills.
* 1 Getting Started
1.0 Build
In the src/benchmarks/gridmix dir, type "ant".
gridmix.jar will be created in the build subdir.
copy gridmix.jar to gridmix dir.
1.1 Configure environment variables
One must modify gridmix-env-2 to set the following variables:
HADOOP_HOME The hadoop install location
HADOOP_VERSION The exact hadoop version to be used. e.g. hadoop-0.18.2-dev
HADOOP_CONF_DIR The dir containing the hadoop-site.xml for teh cluster to be used.
USE_REAL_DATA A large data-set will be created and used by the benchmark if it is set to true.
1.2 Configure the job mixture
A default gridmix_conf.xml file is provided.
One may make appropriate changes as necessary on the number of jobs of various types
and sizes. One can also change the number of reducers of each jobs, and specify whether
to compress the output data of a map/reduce job.
Note that one can specify multiple numbers of in the
numOfJobs field and numOfReduces field, like:
javaSort.smallJobs.numOfJobs
8,2
javaSort.smallJobs.numOfReduces
15,70
The above spec means that we will have 8 small java sort jobs with 15 reducers and 2 small java sort
jobs with 17 reducers.
1.3 Generate test data
Test data is generated using the generateGridmix2Data.sh script.
./generateGridmix2Data.sh
One may modify the structure and size of the data generated here.
It is sufficient to run the script without modification, though it may
require up to 4TB of free space in the default filesystem. Changing the size
of the input data (COMPRESSED_DATA_BYTES, UNCOMPRESSED_DATA_BYTES,
INDIRECT_DATA_BYTES) is safe. A 4x compression ratio for generated, block
compressed data is typical.
* 2 Running
You need to set HADOOP_CONF_DIR to the right directory where hadoop-site.xml exists.
Then you just need to type
./rungridmix_2
It will create start.out to record the start time, and at the end, it will create end.out to record the
endi time.