org.apache.solr.update.processor
Class TextProfileSignature
java.lang.Object
org.apache.solr.update.processor.Signature
org.apache.solr.update.processor.MD5Signature
org.apache.solr.update.processor.TextProfileSignature
public class TextProfileSignature
- extends MD5Signature
This implementation is copied from Apache Nutch.
An implementation of a page signature. It calculates an MD5 hash
of a plain text "profile" of a page.
The algorithm to calculate a page "profile" takes the plain text version of
a page and performs the following steps:
- remove all characters except letters and digits, and bring all characters
to lower case,
- split the text into tokens (all consecutive non-whitespace characters),
- discard tokens equal or shorter than MIN_TOKEN_LEN (default 2 characters),
- sort the list of tokens by decreasing frequency,
- round down the counts of tokens to the nearest multiple of QUANT
(
QUANT = QUANT_RATE * maxFreq
, where QUANT_RATE
is 0.01f
by default, and maxFreq
is the maximum token frequency). If
maxFreq
is higher than 1, then QUANT is always higher than 2 (which
means that tokens with frequency 1 are always discarded).
- tokens, which frequency after quantization falls below QUANT, are discarded.
- create a list of tokens and their quantized frequency, separated by spaces,
in the order of decreasing frequency.
This list is then submitted to an MD5 hash calculation.
Fields inherited from class org.apache.solr.update.processor.MD5Signature |
log |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
TextProfileSignature
public TextProfileSignature()
init
public void init(SolrParams params)
- Overrides:
init
in class Signature
getSignature
public byte[] getSignature()
- Overrides:
getSignature
in class MD5Signature
add
public void add(String content)
- Overrides:
add
in class MD5Signature
Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.