org.apache.nutch.analysis.lang
Class HTMLLanguageParser
java.lang.Object
org.apache.nutch.analysis.lang.HTMLLanguageParser
- All Implemented Interfaces:
- Configurable, HtmlParseFilter, Pluggable
public class HTMLLanguageParser
- extends Object
- implements HtmlParseFilter
Field Summary |
static org.slf4j.Logger |
LOG
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LOG
public static final org.slf4j.Logger LOG
HTMLLanguageParser
public HTMLLanguageParser()
filter
public ParseResult filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
- Scan the HTML document looking at possible indications of content
language
- 1. html lang attribute
(http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1)
- 2. meta
dc.language
(http://dublincore.org/documents/2000/07/16/usageguide/qualified
-html.shtml#language)
- 3. meta http-equiv (content-language)
(http://www.w3.org/TR/REC-html40/struct/global.html#h-7.4.4.2)
- Specified by:
filter
in interface HtmlParseFilter
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interface Configurable
getConf
public Configuration getConf()
- Specified by:
getConf
in interface Configurable
Copyright © 2011 The Apache Software Foundation