|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.nutch.util.EncodingDetector
public class EncodingDetector
A simple class for detecting character encodings.
Broadly this encompasses two functions, which are distinctly separate:
A caller will often have some extra information about what the encoding might be (e.g. from the HTTP header or HTML meta-tags, often wrong but still potentially useful clues). The types of clues may differ from caller to caller. Thus a typical calling sequence is:
Field Summary | |
---|---|
static org.apache.avro.util.Utf8 |
CONTENT_TYPE_UTF8
|
static org.slf4j.Logger |
LOG
|
static String |
MIN_CONFIDENCE_KEY
|
static int |
NO_THRESHOLD
|
Constructor Summary | |
---|---|
EncodingDetector(org.apache.hadoop.conf.Configuration conf)
|
Method Summary | |
---|---|
void |
addClue(String value,
String source)
|
void |
addClue(String value,
String source,
int confidence)
|
void |
autoDetectClues(WebPage page,
boolean filter)
|
void |
clearClues()
Clears all clues. |
String |
guessEncoding(WebPage page,
String defaultValue)
Guess the encoding with the previously specified list of clues. |
static String |
parseCharacterEncoding(org.apache.avro.util.Utf8 contentTypeUtf8)
Parse the character encoding from the specified content type header. |
static String |
resolveEncodingAlias(String encoding)
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final org.apache.avro.util.Utf8 CONTENT_TYPE_UTF8
public static final org.slf4j.Logger LOG
public static final int NO_THRESHOLD
public static final String MIN_CONFIDENCE_KEY
Constructor Detail |
---|
public EncodingDetector(org.apache.hadoop.conf.Configuration conf)
Method Detail |
---|
public void autoDetectClues(WebPage page, boolean filter)
public void addClue(String value, String source, int confidence)
public void addClue(String value, String source)
public String guessEncoding(WebPage page, String defaultValue)
row
- URL's rowdefaultValue
- Default encoding to return if no encoding can be
detected with enough confidence. Note that this will not be
normalized with resolveEncodingAlias(java.lang.String)
public void clearClues()
public static String resolveEncodingAlias(String encoding)
public static String parseCharacterEncoding(org.apache.avro.util.Utf8 contentTypeUtf8)
null
is returned.
contentType
- a content type header
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |