Title: Creating Vectors from Text # Creating vectors from text # Introduction For clustering and classifying documents it is usually necessary to convert the raw text into vectors that can then be consumed by the clustering [Algorithms](algorithms.html). These approaches are described below. # From Lucene *NOTE: Your Lucene index must be created with the same version of Lucene used in Mahout. As of Mahout 0.9 this is Lucene 4.6.1. If these versions dont match you will likely get "Exception in thread "main" org.apache.lucene.index.CorruptIndexException: Unknown format version: -11" as an error.* Mahout has utilities that allow one to easily produce Mahout Vector representations from a Lucene (and Solr, since they are they same) index. For this, we assume you know how to build a Lucene/Solr index. For those who don't, it is probably easiest to get up and running using [Solr](http://lucene.apache.org/solr) as it can ingest things like PDFs, XML, Office, etc. and create a Lucene index. For those wanting to use just Lucene, see the [Lucene website](http://lucene.apache.org/core) or check out _Lucene In Action_ by Erik Hatcher, Otis Gospodnetic and Mike McCandless. To get started, make sure you get a fresh copy of Mahout from [GitHub](http://mahout.apache.org/developers/buildingmahout.html) and are comfortable building it. It defines interfaces and implementations for efficiently iterating over a data source (it only supports Lucene currently, but should be extensible to databases, Solr, etc.) and produces a Mahout Vector file and term dictionary which can then be used for clustering. The main code for driving this is the driver program located in the org.apache.mahout.utils.vectors package. The driver program offers several input options, which can be displayed by specifying the --help option. Examples of running the driver are included below: #### Generating an output file from a Lucene Index $MAHOUT_HOME/bin/mahout lucene.vector --dir (-d) dir The Lucene directory --idField idField The field in the index containing the index. If null, then the Lucene internal doc id is used which is prone to error if the underlying index changes --output (-o) output The output file --delimiter (-l) delimiter The delimiter for outputting the dictionary --help (-h) Print out help --field (-f) field The field in the index --max (-m) max The maximum number of vectors to output. If not specified, then it will loop over all docs --dictOut (-t) dictOut The output of the dictionary --seqDictOut (-st) seqDictOut The output of the dictionary as sequence file --norm (-n) norm The norm to use, expressed as either a double or "INF" if you want to use the Infinite norm. Must be greater or equal to 0. The default is not to normalize --maxDFPercent (-x) maxDFPercent The max percentage of docs for the DF. Can be used to remove really high frequency terms. Expressed as an integer between 0 and 100. Default is 99. --weight (-w) weight The kind of weight to use. Currently TF or TFIDF --minDF (-md) minDF The minimum document frequency. Default is 1 --maxPercentErrorDocs (-err) mErr The max percentage of docs that can have a null term vector. These are noise document and can occur if the analyzer used strips out all terms in the target field. This percentage is expressed as a value between 0 and 1. The default is 0. #### Example: Create 50 Vectors from an Index $MAHOUT_HOME/bin/mahout lucene.vector --dir $WORK_DIR/wikipedia/solr/data/index --field body --dictOut $WORK_DIR/solr/wikipedia/dict.txt --output $WORK_DIR/solr/wikipedia/out.txt --max 50 This uses the index specified by --dir and the body field in it and writes out the info to the output dir and the dictionary to dict.txt. It only outputs 50 vectors. If you don't specify --max, then all the documents in the index are output. #### Example: Creating 50 Normalized Vectors from a Lucene Index using the [L_2 Norm](http://en.wikipedia.org/wiki/Lp_space) $MAHOUT_HOME/bin/mahout lucene.vector --dir $WORK_DIR/wikipedia/solr/data/index --field body --dictOut $WORK_DIR/solr/wikipedia/dict.txt --output $WORK_DIR/solr/wikipedia/out.txt --max 50 --norm 2 ## From A Directory of Text documents Mahout has utilities to generate Vectors from a directory of text documents. Before creating the vectors, you need to convert the documents to SequenceFile format. SequenceFile is a hadoop class which allows us to write arbitary (key, value) pairs into it. The DocumentVectorizer requires the key to be a Text with a unique document id, and value to be the Text content in UTF-8 format. You may find [Tika](http://tika.apache.org/) helpful in converting binary documents to text. #### Converting directory of documents to SequenceFile format Mahout has a nifty utility which reads a directory path including its sub-directories and creates the SequenceFile in a chunked manner for us. $MAHOUT_HOME/bin/mahout seqdirectory --input (-i) input Path to job input directory. --output (-o) output The directory pathname for output. --overwrite (-ow) If present, overwrite the output directory before running job --method (-xm) method The execution method to use: sequential or mapreduce. Default is mapreduce --chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. Defaults to 64 --fileFilterClass (-filter) fFilterClass The name of the class to use for file parsing. Default: org.apache.mahout.text.PrefixAdditionFilter --keyPrefix (-prefix) keyPrefix The prefix to be prepended to the key --charset (-c) charset The name of the character encoding of the input files. Default to UTF-8 {accepts: cp1252|ascii...} --method (-xm) method The execution method to use: sequential or mapreduce. Default is mapreduce --overwrite (-ow) If present, overwrite the output directory before running job --help (-h) Print out help --tempDir tempDir Intermediate output directory --startPhase startPhase First phase to run --endPhase endPhase Last phase to run The output of seqDirectory will be a Sequence file < Text, Text > of all documents (/sub-directory-path/documentFileName, documentText). #### Creating Vectors from SequenceFile From the sequence file generated from the above step run the following to generate vectors. $MAHOUT_HOME/bin/mahout seq2sparse --minSupport (-s) minSupport (Optional) Minimum Support. Default Value: 2 --analyzerName (-a) analyzerName The class name of the analyzer --chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. Default Value: 100MB --output (-o) output The directory pathname for output. --input (-i) input Path to job input directory. --minDF (-md) minDF The minimum document frequency. Default is 1 --maxDFSigma (-xs) maxDFSigma What portion of the tf (tf-idf) vectors to be used, expressed in times the standard deviation (sigma) of the document frequencies of these vectors. Can be used to remove really high frequency terms. Expressed as a double value. Good value to be specified is 3.0. In case the value is less than 0 no vectors will be filtered out. Default is -1.0. Overrides maxDFPercent --maxDFPercent (-x) maxDFPercent The max percentage of docs for the DF. Can be used to remove really high frequency terms. Expressed as an integer between 0 and 100. Default is 99. If maxDFSigma is also set, it will override this value. --weight (-wt) weight The kind of weight to use. Currently TF or TFIDF. Default: TFIDF --norm (-n) norm The norm to use, expressed as either a float or "INF" if you want to use the Infinite norm. Must be greater or equal to 0. The default is not to normalize --minLLR (-ml) minLLR (Optional)The minimum Log Likelihood Ratio(Float) Default is 1.0 --numReducers (-nr) numReducers (Optional) Number of reduce tasks. Default Value: 1 --maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams to create (2 = bigrams, 3 = trigrams, etc) Default Value:1 --overwrite (-ow) If set, overwrite the output directory --help (-h) Print out help --sequentialAccessVector (-seq) (Optional) Whether output vectors should be SequentialAccessVectors. Default is false; true required for running some algorithms (LDA,Lanczos) --namedVector (-nv) (Optional) Whether output vectors should be NamedVectors. If set true else false --logNormalize (-lnorm) (Optional) Whether output vectors should be logNormalize. If set true else false This will create SequenceFiles of tokenized documents < Text, StringTuple > (docID, tokenizedDoc) and vectorized documents < Text, VectorWritable > (docID, TF-IDF Vector). As well, seq2sparse will create SequenceFiles for: a dictionary (wordIndex, word), a word frequency count (wordIndex, count) and a document frequency count (wordIndex, DFCount) in the output directory. The --minSupport option is the min frequency for the word to be considered as a feature; --minDF is the min number of documents the word needs to be in; --maxDFPercent is the max value of the expression (document frequency of a word/total number of document) to be considered as good feature to be in the document. These options are helpful in removing high frequency features like stop words. The vectorized documents can then be used as input to many of Mahout's classification and clustering algorithms. #### Example: Creating Normalized [TF-IDF](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) Vectors from a directory of text documents using [trigrams](http://en.wikipedia.org/wiki/N-gram) and the [L_2 Norm](http://en.wikipedia.org/wiki/Lp_space) Create sequence files from the directory of text documents: $MAHOUT_HOME/bin/mahout seqdirectory -i $WORK_DIR/reuters -o $WORK_DIR/reuters-seqdir -c UTF-8 -chunk 64 -xm sequential Vectorize the documents using trigrams, L_2 length normalization and a maximum document frequency cutoff of 85%. $MAHOUT_HOME/bin/mahout seq2sparse -i $WORK_DIR/reuters-out-seqdir/ -o $WORK_DIR/reuters-out-seqdir-sparse-kmeans --namedVec -wt tfidf -ng 3 -n 2 --maxDFPercent 85 The sequence file in the $WORK_DIR/reuters-out-seqdir-sparse-kmeans/tfidf-vectors directory can now be used as input to the Mahout [k-Means](http://mahout.apache.org/users/clustering/k-means-clustering.html) clustering algorithm. ## Background * [Discussion on centroid calculations with sparse vectors](http://markmail.org/thread/l5zi3yk446goll3o) ## Converting existing vectors to Mahout's format If you are in the happy position to already own a document (as in: texts, images or whatever item you wish to treat) processing pipeline, the question arises of how to convert the vectors into the Mahout vector format. Probably the easiest way to go would be to implement your own Iterable (called VectorIterable in the example below) and then reuse the existing VectorWriter classes: VectorWriter vectorWriter = SequenceFile.createWriter(filesystem, configuration, outfile, LongWritable.class, SparseVector.class); long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);