Standard workflow for unsupervised training of spellchecker: 1) Generate a file of word unigram counts using UnigramPriorGenerator 2) Sort list with sort -k3 -nr into new sorted file. 3) Prune list using heuristic models For example: a) Plot data in gnuplot (log y for visibility) b) Choose cutoff point where curve flattens c) Inspect bottom of list for typo/rare word ratio. d) If there are no 'real' words left, go back to b) and select a new point accordingly e) Shorten sorted list to desired length to create dictionary.txt (retaining counts information) f) Replace all " : " with "\t" in dictionary file for trie data structure. 4) Create frequency-filtered term neighborhoods (edit distance-based errors) using GenerateTermNeighborhood: Inputs: Dictionary file Output: term neighborhood file 5) Create context triples using GenerateContextTriples: Inputs: Neighborhood file (see step 4), raw text directory, output directory name Output: Context triple file for each term in dictionary