Apache Mahout: Scalable machine learning and data mining

{excerpt}Is a weight measure often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.{excerpt} In other words if a term/word appears lots in a document but also appears lots in the corpus/collection as a whole it will get a lower score. An example of this would be “the”, “and”, “it” but depending on your source material it maybe other words that are very common to the source matter.

Twitter

Apache Software Foundation

Related Projects