Package org.apache.tika.parser.microsoft
Class OfficeParser
- java.lang.Object
-
- org.apache.tika.parser.AbstractParser
-
- org.apache.tika.parser.microsoft.AbstractOfficeParser
-
- org.apache.tika.parser.microsoft.OfficeParser
-
- All Implemented Interfaces:
Serializable
,Parser
public class OfficeParser extends AbstractOfficeParser
Defines a Microsoft document content extractor.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
OfficeParser.POIFSDocumentType
-
Constructor Summary
Constructors Constructor Description OfficeParser()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static void
extractMacros(org.apache.poi.poifs.filesystem.POIFSFileSystem fs, ContentHandler xhtml, EmbeddedDocumentExtractor embeddedDocumentExtractor)
Helper to extract macros from an NPOIFS/vbaProject.binSet<MediaType>
getSupportedTypes(ParseContext context)
Returns the set of media types supported by this parser when used with the given parse context.void
parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
Extracts properties and text from an MS Document input streamprotected void
parse(org.apache.poi.poifs.filesystem.DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml)
-
Methods inherited from class org.apache.tika.parser.microsoft.AbstractOfficeParser
configure, getByteArrayMaxOverride, getDateFormatOverride, isConcatenatePhoneticRuns, isExtractAllAlternativesFromMSG, isExtractMacros, isIncludeDeletedContent, isIncludeHeadersAndFooters, isIncludeMoveFromContent, isIncludeShapeBasedContent, isUseSAXDocxExtractor, isUseSAXPptxExtractor, setByteArrayMaxOverride, setConcatenatePhoneticRuns, setDateFormatOverride, setExtractAllAlternativesFromMSG, setExtractMacros, setIncludeDeletedContent, setIncludeHeadersAndFooters, setIncludeMoveFromContent, setIncludeShapeBasedContent, setUseSAXDocxExtractor, setUseSAXPptxExtractor
-
Methods inherited from class org.apache.tika.parser.AbstractParser
parse
-
-
-
-
Method Detail
-
extractMacros
public static void extractMacros(org.apache.poi.poifs.filesystem.POIFSFileSystem fs, ContentHandler xhtml, EmbeddedDocumentExtractor embeddedDocumentExtractor) throws IOException, SAXException
Helper to extract macros from an NPOIFS/vbaProject.binAs of POI-3.15-final, there are still some bugs in VBAMacroReader. For now, we are swallowing NPE and other runtime exceptions
- Parameters:
fs
- NPOIFS to extract fromxhtml
- SAX writerembeddedDocumentExtractor
- extractor for embedded documents- Throws:
IOException
- on IOException if it occurs during the extraction of the embedded docSAXException
- on SAXException for writing to xhtml
-
getSupportedTypes
public Set<MediaType> getSupportedTypes(ParseContext context)
Description copied from interface:Parser
Returns the set of media types supported by this parser when used with the given parse context.- Parameters:
context
- parse context- Returns:
- immutable set of media types
-
parse
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
Extracts properties and text from an MS Document input stream- Parameters:
stream
- the document stream (input)handler
- handler for the XHTML SAX events (output)metadata
- document metadata (input and output)context
- parse context- Throws:
IOException
- if the document stream could not be readSAXException
- if the SAX events could not be processedTikaException
- if the document could not be parsed
-
parse
protected void parse(org.apache.poi.poifs.filesystem.DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml) throws IOException, SAXException, TikaException
- Throws:
IOException
SAXException
TikaException
-
-