Package org.apache.tika.parser.microsoft
Class OfficeParserConfig
- java.lang.Object
-
- org.apache.tika.parser.microsoft.OfficeParserConfig
-
- All Implemented Interfaces:
Serializable
public class OfficeParserConfig extends Object implements Serializable
- See Also:
- Serialized Form
-
-
Constructor Summary
Constructors Constructor Description OfficeParserConfig()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
getConcatenatePhoneticRuns()
boolean
getExtractAllAlternativesFromMSG()
boolean
getExtractMacros()
boolean
getIncludeDeletedContent()
boolean
getIncludeHeadersAndFooters()
boolean
getIncludeMissingRows()
boolean
getIncludeMoveFromContent()
boolean
getIncludeShapeBasedContent()
boolean
getIncludeSlideMasterContent()
boolean
getIncludeSlideNotes()
boolean
getUseSAXDocxExtractor()
boolean
getUseSAXPptxExtractor()
void
setConcatenatePhoneticRuns(boolean concatenatePhoneticRuns)
Microsoft Excel files can sometimes contain phonetic (furigana) strings.void
setExtractAllAlternativesFromMSG(boolean extractAllAlternativesFromMSG)
Some .msg files can contain body content in html, rtf and/or text.void
setExtractMacros(boolean extractMacros)
Sets whether or not MSOffice parsers should extract macros.void
setIncludeDeletedContent(boolean includeDeletedContent)
Sets whether or not the parser should include deleted content.void
setIncludeHeadersAndFooters(boolean includeHeadersAndFooters)
Whether or not to include headers and footers.void
setIncludeMissingRows(boolean includeMissingRows)
For table-like formats, and tables within other formats, should missing rows in sparse tables be output where detected? The default is to only output rows defined within the file, which avoid lots of blank lines, but means layout isn't preserved.void
setIncludeMoveFromContent(boolean includeMoveFromContent)
With track changes on, when a section is moved, the content is stored in both the "moveFrom" section and in the "moveTo" section.void
setIncludeShapeBasedContent(boolean includeShapeBasedContent)
In Excel and Word, there can be text stored within drawing shapes.void
setIncludeSlideMasterContent(boolean includeSlideMasterContent)
Whether or not to include contents from any of the three types of masters -- slide, notes, handout -- in a .ppt or ppt[xm] file.void
setIncludeSlideNotes(boolean includeSlideNotes)
Whether or not to process slide notes content.void
setUseSAXDocxExtractor(boolean useSAXDocxExtractor)
Use the experimental SAX-based streaming DOCX parser? If set tofalse
, the classic parser will be used; iftrue
, the new experimental parser will be used.void
setUseSAXPptxExtractor(boolean useSAXPptxExtractor)
Use the experimental SAX-based streaming DOCX parser? If set tofalse
, the classic parser will be used; iftrue
, the new experimental parser will be used.
-
-
-
Method Detail
-
setExtractMacros
public void setExtractMacros(boolean extractMacros)
Sets whether or not MSOffice parsers should extract macros. As of Tika 1.15, the default isfalse
.- Parameters:
extractMacros
-
-
getExtractMacros
public boolean getExtractMacros()
- Returns:
- whether or not to extract macros
-
setIncludeDeletedContent
public void setIncludeDeletedContent(boolean includeDeletedContent)
Sets whether or not the parser should include deleted content. This has only been implemented in the streaming docx parser (SXWPFWordExtractorDecorator
so far!!!- Parameters:
includeDeletedContent
-
-
getIncludeDeletedContent
public boolean getIncludeDeletedContent()
-
setIncludeMoveFromContent
public void setIncludeMoveFromContent(boolean includeMoveFromContent)
With track changes on, when a section is moved, the content is stored in both the "moveFrom" section and in the "moveTo" section. If you'd like to include the section both in its original location (moveFrom) and in its new location (moveTo), set this totrue
Default:false
This has only been implemented in the streaming docx parser (SXWPFWordExtractorDecorator
so far!!!- Parameters:
includeMoveFromContent
-
-
getIncludeMoveFromContent
public boolean getIncludeMoveFromContent()
-
setIncludeShapeBasedContent
public void setIncludeShapeBasedContent(boolean includeShapeBasedContent)
In Excel and Word, there can be text stored within drawing shapes. (In PowerPoint everything is in a Shape) If you'd like to skip processing these to look for text, set this tofalse
Default:true
- Parameters:
includeShapeBasedContent
-
-
getIncludeShapeBasedContent
public boolean getIncludeShapeBasedContent()
-
setIncludeHeadersAndFooters
public void setIncludeHeadersAndFooters(boolean includeHeadersAndFooters)
Whether or not to include headers and footers. This only operates on headers and footers in Word and Excel, not master slide content in Powerpoint. Default:true
- Parameters:
includeHeadersAndFooters
-
-
getIncludeHeadersAndFooters
public boolean getIncludeHeadersAndFooters()
-
getUseSAXDocxExtractor
public boolean getUseSAXDocxExtractor()
-
setUseSAXDocxExtractor
public void setUseSAXDocxExtractor(boolean useSAXDocxExtractor)
Use the experimental SAX-based streaming DOCX parser? If set tofalse
, the classic parser will be used; iftrue
, the new experimental parser will be used. Default:false
(classic DOM parser)- Parameters:
useSAXDocxExtractor
-
-
setUseSAXPptxExtractor
public void setUseSAXPptxExtractor(boolean useSAXPptxExtractor)
Use the experimental SAX-based streaming DOCX parser? If set tofalse
, the classic parser will be used; iftrue
, the new experimental parser will be used. Default:false
(classic DOM parser)- Parameters:
useSAXPptxExtractor
-
-
getUseSAXPptxExtractor
public boolean getUseSAXPptxExtractor()
-
getConcatenatePhoneticRuns
public boolean getConcatenatePhoneticRuns()
-
setConcatenatePhoneticRuns
public void setConcatenatePhoneticRuns(boolean concatenatePhoneticRuns)
Microsoft Excel files can sometimes contain phonetic (furigana) strings. See PHONETIC. This sets whether or not the parser will concatenate the phonetic runs to the original text.This is currently only supported by the xls and xlsx parsers (not the xlsb parser), and the default is
true
.- Parameters:
concatenatePhoneticRuns
-
-
setExtractAllAlternativesFromMSG
public void setExtractAllAlternativesFromMSG(boolean extractAllAlternativesFromMSG)
Some .msg files can contain body content in html, rtf and/or text. The default behavior is to pick the first non-null value and include only that. If you'd like to extract all non-null body content, which is likely duplicative, set this value to true.- Parameters:
extractAllAlternativesFromMSG
- whether or not to extract all alternative parts- Since:
- 1.17
-
getExtractAllAlternativesFromMSG
public boolean getExtractAllAlternativesFromMSG()
-
setIncludeMissingRows
public void setIncludeMissingRows(boolean includeMissingRows)
For table-like formats, and tables within other formats, should missing rows in sparse tables be output where detected? The default is to only output rows defined within the file, which avoid lots of blank lines, but means layout isn't preserved.
-
getIncludeMissingRows
public boolean getIncludeMissingRows()
-
getIncludeSlideNotes
public boolean getIncludeSlideNotes()
-
setIncludeSlideNotes
public void setIncludeSlideNotes(boolean includeSlideNotes)
Whether or not to process slide notes content. If set tofalse
, the parser will skip the text content and all embedded objects from the slide notes in ppt and ppt[xm]. The default istrue
.- Parameters:
includeSlideNotes
- whether or not to process slide notes- Since:
- 1.19.1
-
getIncludeSlideMasterContent
public boolean getIncludeSlideMasterContent()
- Returns:
- whether or not to process content in slide masters
- Since:
- 1.19.1
-
setIncludeSlideMasterContent
public void setIncludeSlideMasterContent(boolean includeSlideMasterContent)
Whether or not to include contents from any of the three types of masters -- slide, notes, handout -- in a .ppt or ppt[xm] file. If set tofalse
, the parser will not extract text or embedded objects from any of the masters.- Parameters:
includeSlideMasterContent
-- Since:
- 1.19.1
-
-