Apache Mahout > Mahout Wiki > Quickstart > Synthetic Control Data

Introduction

The example will demonstrate clustering of control charts which exhibits a time series. Control charts are tools used to determine whether or not a manufacturing or business process is in a state of statistical control. Such control charts are generated / simulated over equal time interval and available for use in UCI machine learning database. The data is described here .

Problem description

A time series of control charts needs to be clustered into their close knit groups. The data set we use is synthetic and so resembles real world information in an anonymized format. It contains six different classes (Normal, Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward shift). With these trends occurring on the input data set, the Mahout clustering algorithm will cluster the data into their corresponding class buckets. At the end of this example, you'll get to learn how to perform clustering using Mahout.

Pre-Prep

Make sure you have the following covered before you work out the example.

  1. Input data set. Download it here .
    1. Sample input data:
      Input consists of 600 rows and 60 columns. The rows from 1 - 100 contains Normal data. Rows from 101 - 200 contains cyclic data and so on.. More info here . Sample of how the data looks is like below.
      _time _time+x _time+2x .. _time+60x
      28.7812 34.4632 31.3381 .. 31.2834
      24.8923 25.741 27.5532 .. 32.8217

      ..
      ..

      35.5351 41.7067 39.1705 48.3964 .. 38.6103
      24.2104 41.7679 45.2228 43.7762 .. 48.8175

      ..
      ..

  2. Setup Hadoop
    1. Assuming that you have installed Hadooop, start the daemons using $HADOOP_HOME/bin/start-all.sh. If you have issues starting Hadoop, please reference the Hadoop quick start guide
    2. Copy the input to HDFS using $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata (HDFS input directory name should be testdata)
  1. Mahout Example job
    Mahout's mahout-examples-$MAHOUT_VERSION.job does the actual clustering task and so it needs to be created. This can be done as
    1. cd $MAHOUT_HOME
    2. mvn install. You will see BUILD SUCCESSFUL once all the corresponding tasks are through. The job will be generated in $MAHOUT_HOME/examples/target/ and it's name will contain the $MAHOUT_VERSION number. For example, when using Mahout 0.3 release, the job will be mahout-examples-0.3.job
      This completes the pre-requisites to perform clustering process using Mahout.

Perform Clustering

With all the pre-work done, clustering the control data gets real simple.

  1. Depending on which clustering technique to use, you can invoke the corresponding job as below
    1. For canopy :
      $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job
    2. For kmeans :
      $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
    3. For fuzzykmeans :
      $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job
    4. For dirichlet :
      $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job
    5. For meanshift : $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.meanshift.Job
  2. Get the data out of HDFS 1 2 and have a look 3 by following the below steps

Read / Analyze Output

  1. Use $HADOOP_HOME/bin/hadoop fs -lsr output_ to view all outputs.
  2. Use $HADOOP_HOME/bin/hadoop fs -get output $MAHOUT_HOME/examples/output to copy them all to your local machine and the output data points are in vector format.
  3. Computed clusters are contained in output/clusters-i
  4. All result clustered points are placed into output/clusteredPoints
  5. you can run the ClusterDumper on them.

Footnotes
Reference Notes
1 See HDFS Shell
2 The output directory is cleared when a new run starts so the results must be retrieved before a new run
3 Dirichlet also prints data to console