org.apache.nutch.parse.pdf
Class PdfParser
java.lang.Object
org.apache.nutch.parse.pdf.PdfParser
- All Implemented Interfaces:
- Configurable, Parser, Pluggable
public class PdfParser
- extends Object
- implements Parser
parser for mime type application/pdf.
It is based on org.pdfbox.*. We have to see how well it does the job.
- Author:
- John Xing
Note on 20040614 by Xing:
Some codes are stacked here for convenience (see inline comments).
They may be moved to more appropriate places when new codebase
stabilizes, especially after code for indexing is written.
Field Summary |
static org.apache.commons.logging.Log |
LOG
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LOG
public static final org.apache.commons.logging.Log LOG
PdfParser
public PdfParser()
getParse
public ParseResult getParse(Content content)
- Description copied from interface:
Parser
This method parses the given content and returns a map of
<key, parse> pairs. Parse
instances will be persisted
under the given key.
Note: Meta-redirects should be followed only when they are coming from
the original URL. That is:
Assume fetcher is in parsing mode and is currently processing
foo.bar.com/redirect.html. If this url contains a meta redirect
to another url, fetcher should only follow the redirect if the map
contains an entry of the form <"foo.bar.com/redirect.html",
Parse
with a ParseStatus
indicating the redirect>.
- Specified by:
getParse
in interface Parser
- Parameters:
content
- Content to be parsed
- Returns:
- a map containing <key, parse> pairs
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interface Configurable
getConf
public Configuration getConf()
- Specified by:
getConf
in interface Configurable
Copyright © 2006 The Apache Software Foundation