Apache Mahout > Mahout Wiki > Quickstart > Twenty Newsgroups

Twenty Newsgroups Classification Example

Introduction

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering. We will use Mahout Bayes Classifier to create a model that would classify a new document into one of the 20 newsgroup.

Prerequisites

  • Mahout has been downloaded (instructions here)
  • Maven is available
  • Your environment has the following variables:
    HADOOP_HOME Environment variables refers to where Hadoop lives
    MAHOUT_HOME Environment variables refers to where Mahout lives

Instructions for running the example

  1. Start the hadoop daemons by executing the following commands
    $ cd $HADOOP_HOME/bin
    $ ./start-all.sh
    
  2. In the trunk directory of mahout, compile everything and create the mahout job:
    $ cd $MAHOUT_HOME
    $ mvn install
    
  3. Run the 20 newsgroup example by executing the script as below
    $ ./examples/bin/build-20news-bayes.sh
    

    After MAHOUT-857 is committed (available when 0.6 is released), the command will be:

    $ ./examples/bin/classify-20newsgroups.sh
    

    This later version allows you to also try out running Stochastic Gradient Descent (SGD) on the same data.

The script performs the following

    1. Downloads 20news-bydate.tar.gz from the 20newsgroups dataset
    2. Extracts dataset
    3. Generates input dataset for training classifier
    4. Generates input dataset for testing classifier
    5. Trains the classifier
    6. Tests the classifier

Output might look like:

=======================================================
Confusion Matrix
-------------------------------------------------------
a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t   u   <--Classified as
381 0   0   0   0   9   1   0   0   0   1   0   0   2   0   1   0   0   3   0   0    |  398  a     = rec.motorcycles
1   284 0   0   0   0   1   0   6   3   11  0   66  3   0   1   6   0   4   9   0    |  395  b     = comp.windows.x
2   0   339 2   0   3   5   1   0   0   0   0   1   1   12  1   7   0   2   0   0    |  376  c     = talk.politics.mideast
4   0   1   327 0   2   2   0   0   2   1   1   0   5   1   4   12  0   2   0   0    |  364  d     = talk.politics.guns
7   0   4   32  27  7   7   2   0   12  0   0   6   0   100 9   7   31  0   0   0    |  251  e     = talk.religion.misc
10  0   0   0   0   359 2   2   0   1   3   0   1   6   0   1   0   0   11  0   0    |  396  f     = rec.autos
0   0   0   0   0   1   383 9   1   0   0   0   0   0   0   0   0   0   3   0   0    |  397  g     = rec.sport.baseball
1   0   0   0   0   0   9   382 0   0   0   0   1   1   1   0   2   0   2   0   0    |  399  h     = rec.sport.hockey
2   0   0   0   0   4   3   0   330 4   4   0   5   12  0   0   2   0   12  7   0    |  385  i     = comp.sys.mac.hardware
0   3   0   0   0   0   1   0   0   368 0   0   10  4   1   3   2   0   2   0   0    |  394  j     = sci.space
0   0   0   0   0   3   1   0   27  2   291 0   11  25  0   0   1   0   13  18  0    |  392  k     = comp.sys.ibm.pc.hardware
8   0   1   109 0   6   11  4   1   18  0   98  1   3   11  10  27  1   1   0   0    |  310  l     = talk.politics.misc
0   11  0   0   0   3   6   0   10  6   11  0   299 13  0   2   13  0   7   8   0    |  389  m     = comp.graphics
6   0   1   0   0   4   2   0   5   2   12  0   8   321 0   4   14  0   8   6   0    |  393  n     = sci.electronics
2   0   0   0   0   0   4   1   0   3   1   0   3   1   372 6   0   2   1   2   0    |  398  o     = soc.religion.christian
4   0   0   1   0   2   3   3   0   4   2   0   7   12  6   342 1   0   9   0   0    |  396  p     = sci.med
0   1   0   1   0   1   4   0   3   0   1   0   8   4   0   2   369 0   1   1   0    |  396  q     = sci.crypt
10  0   4   10  1   5   6   2   2   6   2   0   2   1   86  15  14  152 0   1   0    |  319  r     = alt.atheism
4   0   0   0   0   9   1   1   8   1   12  0   3   6   0   2   0   0   341 2   0    |  390  s     = misc.forsale
8   5   0   0   0   1   6   0   8   5   50  0   40  2   1   0   9   0   3   256 0    |  394  t     = comp.os.ms-windows.misc
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0    |  0    u     = unknown

Complementary Naive Bayes

To Train a CBayes Classifier using bi-grams

$> $MAHOUT_HOME/bin/mahout trainclassifier \
  -i 20news-input \
  -o newsmodel \
  -type cbayes \
  -ng 2 \
  -source hdfs

To Test a CBayes Classifier using bi-grams

$> $MAHOUT_HOME/bin/mahout testclassifier \
  -m newsmodel \
  -d 20news-input \
  -type cbayes \
  -ng 2 \
  -source hdfs \
  -method mapreduce