Apache Mahout > Mahout Wiki > Quickstart > Class Discovery |
See http://www.cs.bham.ac.uk/~wbl/biblio/gecco1999/GP-417.pdf
CDGA uses a Genetic Algorithm to discover a classification rule for a given dataset.
A dataset can be seen as a table:
attribute 1 | attribute 2 | ... | attribute N | |
---|---|---|---|---|
row 1 | value1 | value2 | ... | valueN |
row 2 | value1 | value2 | ... | valueN |
... | ... | ... | ... | ... |
row M | value1 | value2 | ... | valueN |
An attribute can be numerical, for example a "temperature" attribute, or categorical, for example a "color" attribute. For classification purposes, one of the categorical attributes is designated as a label, which means that its value defines the class of the rows.
A classification rule can be represented as follows:
attribute 1 | attribute 2 | ... | attribute N | |
---|---|---|---|---|
weight | w1 | w2 | ... | wN |
operator | op1 | op2 | ... | opN |
value | value1 | value2 | ... | valueN |
For a given target class and a weight threshold, the classification rule can be read :
for each row of the dataset if (rule.w1 < threshold || (rule.w1 >= threshold && row.value1 rule.op1 rule.value1)) && (rule.w2 < threshold || (rule.w2 >= threshold && row.value2 rule.op2 rule.value2)) && ... (rule.wN < threshold || (rule.wN >= threshold && row.valueN rule.opN rule.valueN)) then row is part of the target class
Important: The label attribute is not evaluated by the rule.
The threshold parameter allows some conditions of the rule to be skipped if their weight is too small. The operators available depend on the attribute types:
The "threshold" and "target" are user defined parameters, and because the label is always a categorical attribute, the target is the (zero based) index of the class label value in all the possible values of the label. For example, if the label attribute can have the following values (blue, brown, green), then a target of 1 means the "blue" class.
For example, we have the following dataset (the label attribute is "Eyes Color"):
Age | Eyes Color | Hair Color | |
---|---|---|---|
row 1 | 16 | brown | dark |
row 2 | 25 | green | light |
row 3 | 12 | blue | light |
and a classification rule:
weight | 0 | 1 |
operator | < | != |
value | 20 | light |
and the following parameters: threshold = 1 and target = 0 (brown).
This rule can be read as follows:
for each row of the dataset if (0 < 1 || (0 >= 1 && row.value1 < 20)) && (1 < 1 || (1 >= 1 && row.value2 != light)) then row is part of the "brown Eye Color" class
Please note how the rule skipped the label attribute (Eye Color), and how the first condition is ignored because its weight is < threshold.
NOTE: Substitute in the appropriate version for the Mahout JOB jar
<HADOOP_HOME>/bin/hadoop dfs -put <MAHOUT_HOME>/examples/src/test/resources/wdbc wdbc
<HADOOP_HOME>/bin/hadoop dfs -put <MAHOUT_HOME>/examples/src/test/resources/wdbc.infos wdbc.infos
<HADOOP_HOME>/bin/hadoop jar <MAHOUT_HOME>/examples/build/apache-mahout-examples-0.1-dev.job org.apache.mahout.ga.watchmaker.cd.CDGA <MAHOUT_HOME>/examples/src/test/resources/wdbc 1 0.9 1 0.033 0.1 0 100 10
CDGA needs 9 parameters:
For more information about 4th parameter, please see Multi-point Crossover.
For a detailed explanation about the 5th, 6th and 7th parameters, please see Real Valued Mutation.
TODO: Fill in where to find the output and what it means.
To run properly, CDGA needs some informations about the dataset. Each dataset should be accompanied by an .infos file that contains the needed informations. for each attribute a corresponding line in the info file describes it, it can be one of the following:
This file can be generated automaticaly using a special tool available with CDGA.
$ <HADOOP_HOME>/bin/hadoop jar <MAHOUT_HOME>/examples/build/apache-mahout-examples-0.1-dev.job org.apache.mahout.ga.watchmaker.cd.tool.CDInfosTool dataset_path