Running Latent Dirichlet Allocation (algorithm) from the Command Line
Since Mahout v0.6
lda has been implemented as Collapsed Variable Bayes (cvb).
Mahout’s LDA can be launched from the same command line invocation whether
you are running on a single machine in stand-alone mode or on a larger
Hadoop cluster. The difference is determined by the $HADOOP_HOME and
$HADOOP_CONF_DIR environment variables. If both are set to an operating
Hadoop cluster on the target machine then the invocation will run the LDA
algorithm on that cluster. If either of the environment variables are
missing then the stand-alone Hadoop configuration will be invoked instead.
./bin/mahout cvb <OPTIONS>
- In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
will be generated in $MAHOUT_HOME/core/target/ and it’s name will contain
the Mahout version number. For example, when using Mahout 0.3 release, the
job will be mahout-core-0.3.job
Testing it on one single machine w/o cluster
Running it on the cluster
Command line options from Mahout cvb version 0.8
mahout cvb -h
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for output.
--maxIter (-x) maxIter The maximum number of iterations.
--convergenceDelta (-cd) convergenceDelta The convergence delta value
--overwrite (-ow) If present, overwrite the output directory before running job
--num_topics (-k) num_topics Number of topics to learn
--num_terms (-nt) num_terms Vocabulary size
--doc_topic_smoothing (-a) doc_topic_smoothing Smoothing for document/topic distribution
--term_topic_smoothing (-e) term_topic_smoothing Smoothing for topic/term distribution
--dictionary (-dict) dictionary Path to term-dictionary file(s) (glob expression supported)
--doc_topic_output (-dt) doc_topic_output Output path for the training doc/topic distribution
--topic_model_temp_dir (-mt) topic_model_temp_dir Path to intermediate model path (useful for restarting)
--iteration_block_size (-block) iteration_block_size Number of iterations per perplexity check
--random_seed (-seed) random_seed Random seed
--test_set_fraction (-tf) test_set_fraction Fraction of data to hold out for testing
--num_train_threads (-ntt) num_train_threads number of threads per mapper to train with
--num_update_threads (-nut) num_update_threads number of threads per mapper to update the model with
--max_doc_topic_iters (-mipd) max_doc_topic_iters max number of iterations per doc for p(topic|doc) learning
--num_reduce_tasks num_reduce_tasks number of reducers to use during model estimation
--backfill_perplexity enable backfilling of missing perplexity values
--help (-h) Print out help
--tempDir tempDir Intermediate output directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run