Parallel Viterbi

Viterbi algorithm is known as inference algorithm (synonyms: segmentation, decoding etc) for Hidden Markov Model [1] which finds the most likely sequence of hidden states by given sequence of observed states.

Apache Mahout has both sequential and parallel (that's what you're reading about) implementations of the algorithm.

Detailed presentation about parallel viterbi implementation could be found in there (in russian)

Parallelization strategy

is quite straightforward and based on data parallelizm. There are some studies on Viterbi (and Belief Propogation which is inference algorithm for loop-less Markov Random Fields and is quite similar to Viterbi) parallelization, but at the moment of writing this article none of them seem to be applyable for MapReduce paradigm.

For example, forward pass of Viterbi could be represented in terms of matrix computations (as being dynamic programming algorithm) an thus essentially paralleled, but overhead for MapReduce would be greater than profit for parallel matrix multiplication.

Input sequences of observed variables are supposed to be divided into the chunks of some length, enough to store O(N*K) data in main memory. A set of all chunks number N is called a "serie number N". The algorithm process the data from serie number N-1 to serie number N (or vice versa), performing forward and backward Viterbi passes independently for each chunk (and consequently for each sequence) in reducers. Only data that is nescessary for computation of next serie is being outputed by direct output of reducers, all other data is collected in background. For example, when performing forward Viterbi pass only probabilities of last hidden state are nescessary for the next step, backpointers tables could be written in parallel to local store since they would be needed only for backward pass.

If all the sequences are of the same length approximately and the number of sequences to decode is much more that number of reducers, O(N*M/K) time is required to decode them in parallel (N is number of each sequence, M is number of all sequences, K is number of reducers).

Data format

Each sequence of observed states must be stored in sequence files, where key is the name of the sequence and value is ObservedSequenceWritable where number of chunk, data length and data itself are stored. At the moment it is hardcoded requirement, but it seems to be easy to implement any input file format that will output this information.

The easiest way to get adjust plain text files with space-delimeted numbers of observed states to this format is to use "bin/mahout hmmchunks".

After parallel Viterbi is ended, decoded sequences will be stored in sequence files, one for each chunk (key is number of chunk, value is HiddenSequenceWritable). They could be unchunked to plain text space-delimeted numbers of hidden states by "bin/mahout hmmchunks -unchunk".

Usage

Run "bin/mahout pviterbi" and see what it wants from you. That is:

serialized HmmModel (i.e. by LossyHmmModelSerializer class)
input data (observed sequences) in the format described above
paths for temporary storage (i.e. backpointers) and for decoded sequences

References

Wikipedia article