org.apache.nutch.analysis.lang
Class HTMLLanguageParser
java.lang.Object
org.apache.nutch.analysis.lang.HTMLLanguageParser
- All Implemented Interfaces:
- Configurable, HtmlParseFilter, Pluggable
public class HTMLLanguageParser
- extends Object
- implements HtmlParseFilter
Adds metadata identifying language of document if found
We could also run statistical analysis here but we'd miss all other formats
Field Summary |
static org.apache.commons.logging.Log |
LOG
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LOG
public static final org.apache.commons.logging.Log LOG
HTMLLanguageParser
public HTMLLanguageParser()
filter
public ParseResult filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
- Scan the HTML document looking at possible indications of content language
- 1. html lang attribute (http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1)
- 2. meta dc.language (http://dublincore.org/documents/2000/07/16/usageguide/qualified-html.shtml#language)
- 3. meta http-equiv (content-language) (http://www.w3.org/TR/REC-html40/struct/global.html#h-7.4.4.2)
Only the first occurence of language is stored.
- Specified by:
filter
in interface HtmlParseFilter
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interface Configurable
getConf
public Configuration getConf()
- Specified by:
getConf
in interface Configurable
Copyright © 2006 The Apache Software Foundation