org.apache.nutch.parse.msword
Class MSWordParser

java.lang.Object
  extended by org.apache.nutch.parse.ms.MSBaseParser
      extended by org.apache.nutch.parse.msword.MSWordParser
All Implemented Interfaces:
Configurable, Parser, Pluggable

public class MSWordParser
extends MSBaseParser

Parser for mime type application/msword. It is based on org.apache.poi.*. We have to see how well it performs.

Author:
John Xing, Andy Hedges, Jérôme Charron

Field Summary
static String MIME_TYPE
          Associated Mime type for Word files (application/msword).
 
Fields inherited from class org.apache.nutch.parse.ms.MSBaseParser
LOG
 
Fields inherited from interface org.apache.nutch.parse.Parser
X_POINT_ID
 
Constructor Summary
MSWordParser()
           
 
Method Summary
 ParseResult getParse(Content content)
           This method parses the given content and returns a map of <key, parse> pairs.
static void main(String[] args)
          Main for testing.
 
Methods inherited from class org.apache.nutch.parse.ms.MSBaseParser
getConf, getParse, main, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MIME_TYPE

public static final String MIME_TYPE
Associated Mime type for Word files (application/msword).

See Also:
Constant Field Values
Constructor Detail

MSWordParser

public MSWordParser()
Method Detail

getParse

public ParseResult getParse(Content content)
Description copied from interface: Parser

This method parses the given content and returns a map of <key, parse> pairs. Parse instances will be persisted under the given key.

Note: Meta-redirects should be followed only when they are coming from the original URL. That is:
Assume fetcher is in parsing mode and is currently processing foo.bar.com/redirect.html. If this url contains a meta redirect to another url, fetcher should only follow the redirect if the map contains an entry of the form <"foo.bar.com/redirect.html", Parse with a ParseStatus indicating the redirect>.

Parameters:
content - Content to be parsed
Returns:
a map containing <key, parse> pairs

main

public static void main(String[] args)
Main for testing. Pass an word document as argument



Copyright © 2006 The Apache Software Foundation