Running Latent Dirichlet Allocation (algorithm) from the Command Line

Since Mahout v0.6 lda has been implemented as Collapsed Variable Bayes (cvb).

Mahout’s LDA can be launched from the same command line invocation whether you are running on a single machine in stand-alone mode or on a larger Hadoop cluster. The difference is determined by the $HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to an operating Hadoop cluster on the target machine then the invocation will run the LDA algorithm on that cluster. If either of the environment variables are missing then the stand-alone Hadoop configuration will be invoked instead.

./bin/mahout cvb <OPTIONS>
  • In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job will be generated in $MAHOUT_HOME/core/target/ and it’s name will contain the Mahout version number. For example, when using Mahout 0.3 release, the job will be mahout-core-0.3.job

Testing it on one single machine w/o cluster

  • Put the data: cp testdata
  • Run the Job:

    ./bin/mahout cvb -i testdata

Running it on the cluster

  • (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
  • Put the data: $HADOOP_HOME/bin/hadoop fs -put testdata
  • Run the Job:

    export HADOOP_HOME= export HADOOP_CONF_DIR=$HADOOP_HOME/conf ./bin/mahout cvb -i testdata

  • Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output to view all outputs.

Command line options from Mahout cvb version 0.8

mahout cvb -h 
  --input (-i) input					  Path to job input directory.	      
  --output (-o) output					  The directory pathname for output.  
  --maxIter (-x) maxIter				  The maximum number of iterations.		
  --convergenceDelta (-cd) convergenceDelta		  The convergence delta value		    
  --overwrite (-ow)					  If present, overwrite the output directory before running job    
  --num_topics (-k) num_topics				  Number of topics to learn		 
  --num_terms (-nt) num_terms				  Vocabulary size   
  --doc_topic_smoothing (-a) doc_topic_smoothing	  Smoothing for document/topic distribution	     
  --term_topic_smoothing (-e) term_topic_smoothing	  Smoothing for topic/term distribution 	 
  --dictionary (-dict) dictionary			  Path to term-dictionary file(s) (glob expression supported) 
  --doc_topic_output (-dt) doc_topic_output		  Output path for the training doc/topic distribution	     
  --topic_model_temp_dir (-mt) topic_model_temp_dir	  Path to intermediate model path (useful for restarting)       
  --iteration_block_size (-block) iteration_block_size	  Number of iterations per perplexity check  
  --random_seed (-seed) random_seed			  Random seed	    
  --test_set_fraction (-tf) test_set_fraction		  Fraction of data to hold out for testing  
  --num_train_threads (-ntt) num_train_threads		  number of threads per mapper to train with  
  --num_update_threads (-nut) num_update_threads	  number of threads per mapper to update the model with	       
  --max_doc_topic_iters (-mipd) max_doc_topic_iters	  max number of iterations per doc for p(topic|doc) learning		  
  --num_reduce_tasks num_reduce_tasks			  number of reducers to use during model estimation 	   
  --backfill_perplexity 				  enable backfilling of missing perplexity values		
  --help (-h)						  Print out help    
  --tempDir tempDir					  Intermediate output directory	     
  --startPhase startPhase				  First phase to run    
  --endPhase endPhase					  Last phase to run