Generate "more like this" similarity queries.
/// Based on this mail:
///
/// Lucene does let you access the document frequency of terms, with IndexReader.DocFreq().
/// Term frequencies can be computed by re-tokenizing the text, which, for a single document,
/// is usually fast enough. But looking up the DocFreq() of every term in the document is
/// probably too slow.
///
/// You can use some heuristics to prune the set of terms, to avoid calling DocFreq() too much,
/// or at all. Since you're trying to maximize a tf*idf score, you're probably most interested
/// in terms with a high tf. Choosing a tf threshold even as low as two or three will radically
/// reduce the number of terms under consideration. Another heuristic is that terms with a
/// high idf (i.e., a low df) tend to be longer. So you could threshold the terms by the
/// number of characters, not selecting anything less than, e.g., six or seven characters.
/// With these sorts of heuristics you can usually find small set of, e.g., ten or fewer terms
/// that do a pretty good job of characterizing a document.
///
/// It all depends on what you're trying to do. If you're trying to eek out that last percent
/// of precision and recall regardless of computational difficulty so that you can win a TREC
/// competition, then the techniques I mention above are useless. But if you're trying to
/// provide a "more like this" button on a search results page that does a decent job and has
/// good performance, such techniques might be useful.
///
/// An efficient, effective "more-like-this" query generator would be a great contribution, if
/// anyone's interested. I'd imagine that it would take a Reader or a String (the document's
/// text), analyzer Analyzer, and return a set of representative terms using heuristics like those
/// above. The frequency and length thresholds could be parameters, etc.
///
/// Doug
///
///
///
///
/// Initial Usage
///
/// This class has lots of options to try to make it efficient and flexible.
/// See the body of below in the source for real code, or
/// if you want pseudo code, the simpliest possible usage is as follows. The bold
/// fragment is specific to this class.
///
///
///
/// IndexReader ir = ...
/// IndexSearcher is = ...
///
/// MoreLikeThis mlt = new MoreLikeThis(ir);
/// Reader target = ... // orig source of doc you want to find similarities to
/// Query query = mlt.Like( target);
///
/// Hits hits = is.Search(query);
/// // now the usual iteration thru 'hits' - the only thing to watch for is to make sure
/// you ignore the doc if it matches your 'target' document, as it should be similar to itself
///
///
///
/// Thus you:
///
/// - do your normal, Lucene setup for searching,
/// - create a MoreLikeThis,
/// - get the text of the doc you want to find similaries to
/// - then call one of the Like() calls to generate a similarity query
/// - call the searcher to find the similar docs
///
///
/// More Advanced Usage
///
/// You may want to use so you can examine
/// multiple fields (e.g. body and title) for similarity.
///
///
/// Depending on the size of your index and the size and makeup of your documents you
/// may want to call the other set methods to control how the similarity queries are
/// generated:
///
/// -
/// -
/// -
/// -
/// -
/// -
/// -
/// -
/// -
///
///
///
///
/// Changes: Mark Harwood 29/02/04
/// Some bugfixing, some refactoring, some optimisation.
/// - bugfix: retrieveTerms(int docNum) was not working for indexes without a termvector -added missing code
/// - bugfix: No significant terms being created for fields with a termvector - because
/// was only counting one occurence per term/field pair in calculations(ie not including frequency info from TermVector)
/// - refactor: moved common code into isNoiseWord()
/// - optimise: when no termvector support available - used maxNumTermsParsed to limit amount of tokenization
///
///
public sealed class MoreLikeThis
{
///