Class TopCommonTokenCounter

java.lang.Object
org.apache.tika.eval.app.tools.TopCommonTokenCounter

public class TopCommonTokenCounter extends Object
Utility class that reads in a UTF-8 input file with one document per row and outputs the 20000 tokens with the highest document frequencies.

The CommmonTokensAnalyzer intentionally drops tokens shorter than 4 characters, but includes bigrams for cjk.

It also has a include list for __email__ and __url__ and a skip list for common html markup terms.

  • Constructor Details

    • TopCommonTokenCounter

      public TopCommonTokenCounter()
  • Method Details