org.apache.nutch.parse.ms
Class MSExtractor

java.lang.Object
  extended by org.apache.nutch.parse.ms.MSExtractor

public abstract class MSExtractor
extends Object

Defines a Microsoft document content extractor.

Author:
Jérôme Charron

Field Summary
protected static org.apache.commons.logging.Log LOG
           
 
Constructor Summary
protected MSExtractor()
          Constructs a new Microsoft document extractor.
 
Method Summary
protected  void extract(InputStream input)
          Extracts properties and text from an MS Document input stream
protected abstract  String extractText(InputStream input)
          Extracts the text content from a Microsoft document input stream.
protected  Properties getProperties()
          Get the Properties of the Microsoft document.
protected  String getText()
          Get the content text of the Microsoft document.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

protected static final org.apache.commons.logging.Log LOG
Constructor Detail

MSExtractor

protected MSExtractor()
Constructs a new Microsoft document extractor.

Method Detail

extract

protected void extract(InputStream input)
                throws Exception
Extracts properties and text from an MS Document input stream

Throws:
Exception

extractText

protected abstract String extractText(InputStream input)
                               throws Exception
Extracts the text content from a Microsoft document input stream.

Throws:
Exception

getText

protected String getText()
Get the content text of the Microsoft document.

Returns:
the content text of the document

getProperties

protected Properties getProperties()
Get the Properties of the Microsoft document.

Returns:
the properties of the document


Copyright © 2006 The Apache Software Foundation