Title: Collocations
# Collocations in Mahout
A collocation is defined as a sequence of words or terms which co-occur
more often than would be expected by chance. Statistically relevant
combinations of terms identify additional lexical units which can be
treated as features in a vector-based representation of a text. A detailed
discussion of collocations can be found on [Wikipedia](http://en.wikipedia.org/wiki/Collocation).
See there for a more detailed discussion of collocations in the [Reuters example](http://comments.gmane.org/gmane.comp.apache.mahout.user/5685).
## Theory behind implementation: Log-Likelihood based Collocation Identification
Mahout provides an implementation of a collocation identification algorithm
which scores collocations using log-likelihood ratio. The log-likelihood
score indicates the relative usefulness of a collocation with regards other
term combinations in the text. Collocations with the highest scores in a
particular corpus will generally be more useful as features.
Calculating the LLR is very straightforward and is described concisely in
[Ted Dunning's blog post](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html)
. Ted describes the series of counts reqired to calculate the LLR for two
events A and B in order to determine if they co-occur more often than pure
chance. These counts include the number of times the events co-occur (k11),
the number of times the events occur without each other (k12 and k21), and
the number of times anything occurs. These counts are summarized in the
following table:
Event A
Everything but Event A
Event B
A and B together (k11)
B but not A (k12)
Everything but Event B
A but not B (k21)
Neither B nor A (k22)
For the purposes of collocation identification, it is useful to begin by
thinking in word pairs, bigrams. In this case the leading or head term from
the pair corresponds to A from the table above, B corresponds to the
trailing or tail term, while neither B nor A is the total number of word
pairs in the corpus less those containing B, A or both B and A.
Given the word pair of 'oscillation overthruster', the Log-Likelihood ratio
is computed by looking at the number of occurences of that word pair in the
corpus, the number of word pairs that begin with 'oscillation' but end with
something other than 'overthruster', the number of word pairs that end with
'overthruster' begin with something other than 'oscillation' and the number
of word pairs in the corpus that contain neither 'oscillation' and
overthruster.
This can be extended from bigrams to trigrams, 4-grams and beyond. In these
cases, the current algorithm uses the first token of the ngram as the head
of the ngram and the remaining n-1 tokens from the ngram, the n-1gram as it
were, as the tail. Given the trigram 'hong kong cavaliers', 'hong' is
treated as the head while 'kong cavaliers' is treated as the tail. Future
versions of this algorithm will allow for variations in which tokens of the
ngram are treated as the head and tail.
Beyond ngrams, it is often useful to inspect cases where individual words
occur around other interesting features of the text such as sentence
boundaries.
## Generating NGrams
The tools that the collocation identification algorithm are embeeded within
either consume tokenized text as input or provide the ability to specify an
implementation of the Lucene Analyzer class perform tokenization in order
to form ngrams. The tokens are passed through a Lucene ShingleFilter to
produce NGrams of the desired length.
Given the text "Alice was beginning to get very tired" as an example,
Lucene's StandardAnalyzer produces the tokens 'alice', 'beginning', 'get',
'very' and 'tired', while the ShingleFilter with a max NGram size set to 3
produces the shingles 'alice beginning', 'alice beginning get', 'beginning
get', 'beginning get very', 'get very', 'get very tired' and 'very tired'.
Note that both bigrams and trigrams are produced here. A future enhancement
to the existing algorithm would involve limiting the output to a particular
gram size as opposed to solely specifiying a max ngram size.
## Running the Collocation Identification Algorithm.
There are a couple ways to run the llr-based collocation algorithm in
mahout
### When creating vectors from a sequence file
The llr collocation identifier is integrated into the process that is used
to create vectors from sequence files of text keys and values. Collocations
are generated when the --maxNGramSize (-ng) option is not specified and
defaults to 2 or is set to a number of 2 or greater. The --minLLR option
can be used to control the cutoff that prevents collocations below the
specified LLR score from being emitted, and the --minSupport argument can
be used to filter out collocations that appear below a certain number of
times.
bin/mahout seq2sparse
Usage:
[--minSupport --analyzerName --chunkSize
--output