Standard workflow for unsupervised training of spellchecker:

1) Generate a file of word unigram counts using UnigramPriorGenerator
2) Sort list with sort -k3 -nr into new sorted file.
3) Prune list using heuristic models
  For example:
  a) Plot data in gnuplot (log y for visibility)
  b) Choose cutoff point where curve flattens
  c) Inspect bottom of list for typo/rare word ratio.
  d) If there are no 'real' words left, go back to b) and select a new point accordingly
  e) Shorten sorted list to desired length to create dictionary.txt (retaining counts information)
  f) Replace all " : " with "\t" in dictionary file for trie data structure.
4) Create frequency-filtered term neighborhoods (edit distance-based errors) using GenerateTermNeighborhood:
   Inputs: Dictionary file
   Output: term neighborhood file
5) Create context triples using GenerateContextTriples:
   Inputs: Neighborhood file (see step 4), raw text directory, output directory name
   Output: Context triple file for each term in dictionary