PDFParserConfig (Apache Tika 2.0.0 API)

java.lang.Object
- org.apache.tika.parser.pdf.PDFParserConfig

All Implemented Interfaces:

Serializable
```
public class PDFParserConfig
extends Object
implements Serializable
```
Config for PDFParser.
This allows parameters to be set programmatically:
1. Calls to PDFParser, i.e. parser.getPDFParserConfig().setEnableAutoSpace() (as before)
2. Passing to PDFParser through a ParseContext: context.set(PDFParserConfig.class, config);
See Also:

Serialized Form

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class PDFParserConfig.OCR_RENDERING_STRATEGY

static class PDFParserConfig.OCR_STRATEGY

Nested Classes
Modifier and Type	Class and Description
`static class`	`PDFParserConfig.OCR_RENDERING_STRATEGY`
`static class`	`PDFParserConfig.OCR_STRATEGY`

Constructor Summary

Constructors
Constructor and Description

PDFParserConfig()

Constructors
Constructor and Description
`PDFParserConfig()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`PDFParserConfig`	`cloneAndUpdate(PDFParserConfig updates)`
`void`	`configure(org.apache.tika.parser.pdf.PDF2XHTML pdf2XHTML)` Configures the given pdf2XHTML.
`boolean`	`equals(Object o)`
`AccessChecker`	`getAccessChecker()`
`Float`	`getAverageCharTolerance()`
`Float`	`getDropThreshold()`
`long`	`getMaxMainMemoryBytes()` The maximum amount of memory to use when loading a pdf into a PDDocument.
`int`	`getOcrDPI()` Dots per inch used to render the page image for OCR
`String`	`getOcrImageFormatName()` String representation of the image format used to render the page image for OCR (examples: png, tiff, jpeg)
`float`	`getOcrImageQuality()` Image quality used to render the page image for OCR.
`org.apache.pdfbox.rendering.ImageType`	`getOcrImageType()` Image type used to render the page image for OCR.
`PDFParserConfig.OCR_RENDERING_STRATEGY`	`getOcrRenderingStrategy()`
`PDFParserConfig.OCR_STRATEGY`	`getOcrStrategy()`
`Float`	`getSpacingTolerance()`
`int`	`hashCode()`
`boolean`	`isCatchIntermediateIOExceptions()` See `setCatchIntermediateIOExceptions(boolean)`
`boolean`	`isDetectAngles()`
`boolean`	`isEnableAutoSpace()`
`boolean`	`isExtractAcroFormContent()`
`boolean`	`isExtractActions()`
`boolean`	`isExtractAnnotationText()`
`boolean`	`isExtractBookmarksText()`
`boolean`	`isExtractFontNames()`
`boolean`	`isExtractInlineImages()`
`boolean`	`isExtractMarkedContent()`
`boolean`	`isExtractUniqueInlineImagesOnly()`
`boolean`	`isIfXFAExtractOnlyXFA()`
`boolean`	`isSetKCMS()`
`boolean`	`isSortByPosition()`
`boolean`	`isSuppressDuplicateOverlappingText()`
`void`	`setAccessChecker(AccessChecker accessChecker)`
`void`	`setAverageCharTolerance(Float averageCharTolerance)` See `PDFTextStripper.setAverageCharTolerance(float)`
`void`	`setCatchIntermediateIOExceptions(boolean catchIntermediateIOExceptions)` The PDFBox parser will throw an IOException if there is a problem with a stream.
`void`	`setDetectAngles(boolean detectAngles)`
`void`	`setDropThreshold(Float dropThreshold)` See `PDFTextStripper.setDropThreshold(float)`
`void`	`setEnableAutoSpace(boolean enableAutoSpace)` If true (the default), the parser should estimate where spaces should be inserted between words.
`void`	`setExtractAcroFormContent(boolean extractAcroFormContent)` If true (the default), extract content from AcroForms at the end of the document.
`void`	`setExtractActions(boolean v)` Whether or not to extract PDActions from the file.
`void`	`setExtractAnnotationText(boolean extractAnnotationText)` If true (the default), text in annotations will be extracted.
`void`	`setExtractBookmarksText(boolean extractBookmarksText)` If true, extract bookmarks (document outline) text.
`void`	`setExtractFontNames(boolean extractFontNames)` Extract font names into a metadata field
`void`	`setExtractInlineImages(boolean extractInlineImages)` If `true`, extract the literal inline embedded OBXImages.
`void`	`setExtractMarkedContent(boolean extractMarkedContent)` If the PDF contains marked content, try to extract text and its marked structure.
`void`	`setExtractUniqueInlineImagesOnly(boolean extractUniqueInlineImagesOnly)` Multiple pages within a PDF file might refer to the same underlying image.
`void`	`setIfXFAExtractOnlyXFA(boolean ifXFAExtractOnlyXFA)` If false (the default), extract content from the full PDF as well as the XFA form.
`void`	`setMaxMainMemoryBytes(long maxMainMemoryBytes)`
`void`	`setOcrDPI(int ocrDPI)` Dots per inch used to render the page image for OCR.
`void`	`setOcrImageFormatName(String ocrImageFormatName)`
`void`	`setOcrImageQuality(float ocrImageQuality)` Image quality used to render the page image for OCR.
`void`	`setOcrImageType(org.apache.pdfbox.rendering.ImageType ocrImageType)` Image type used to render the page image for OCR.
`void`	`setOcrImageType(String ocrImageTypeString)` Image type used to render the page image for OCR.
`void`	`setOcrRenderingStrategy(PDFParserConfig.OCR_RENDERING_STRATEGY ocrRenderingStrategy)` When rendering the page for OCR, do you want to include the rendering of the electronic text, ALL, or do you only want to run OCR on the images and vector graphics (NO_TEXT)?
`void`	`setOcrRenderingStrategy(String ocrRenderingStrategyString)`
`void`	`setOcrStrategy(PDFParserConfig.OCR_STRATEGY ocrStrategy)` Which strategy to use for OCR
`void`	`setOcrStrategy(String ocrStrategyString)` Which strategy to use for OCR
`void`	`setSetKCMS(boolean setKCMS)` Whether to call `System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider")`.
`void`	`setSortByPosition(boolean sortByPosition)` If true, sort text tokens by their x/y position before extracting text.
`void`	`setSpacingTolerance(Float spacingTolerance)` See `PDFTextStripper.setSpacingTolerance(float)`
`void`	`setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)` If true, the parser should try to remove duplicated text over the same region.
`String`	`toString()`

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - PDFParserConfig
```
public PDFParserConfig()
```
- Method Detail
  - isExtractMarkedContent
```
public boolean isExtractMarkedContent()
```
  - setExtractMarkedContent
```
public void setExtractMarkedContent(boolean extractMarkedContent)
```
    If the PDF contains marked content, try to extract text and its marked structure. If the PDF does not contain marked content, backoff to the regular PDF2XHTML for text extraction. As of 1.24, this is an "alpha" version.
    
    Parameters:
    
    extractMarkedContent -
    
    Since:
    
    1.24
  - configure
```
public void configure(org.apache.tika.parser.pdf.PDF2XHTML pdf2XHTML)
```
    Configures the given pdf2XHTML.
    
    Parameters:
    
    pdf2XHTML -
  - isExtractAcroFormContent
```
public boolean isExtractAcroFormContent()
```
    See Also:
    
    setExtractAcroFormContent(boolean)
  - setExtractAcroFormContent
```
public void setExtractAcroFormContent(boolean extractAcroFormContent)
```
    If true (the default), extract content from AcroForms at the end of the document. If an XFA is found, try to process that, otherwise, process the AcroForm.
    
    Parameters:
    
    extractAcroFormContent -
  - isIfXFAExtractOnlyXFA
```
public boolean isIfXFAExtractOnlyXFA()
```
    Returns:
    
    how to handle XFA data if it exists
    
    See Also:
    
    setIfXFAExtractOnlyXFA(boolean)
  - setIfXFAExtractOnlyXFA
```
public void setIfXFAExtractOnlyXFA(boolean ifXFAExtractOnlyXFA)
```
    If false (the default), extract content from the full PDF as well as the XFA form. This will likely lead to some duplicative content.
    
    Parameters:
    
    ifXFAExtractOnlyXFA -
  - isExtractBookmarksText
```
public boolean isExtractBookmarksText()
```
    See Also:
    
    setExtractBookmarksText(boolean)
  - setExtractBookmarksText
```
public void setExtractBookmarksText(boolean extractBookmarksText)
```
    If true, extract bookmarks (document outline) text.
    Te default is true
    
    Parameters:
    
    extractBookmarksText -
  - isExtractFontNames
```
public boolean isExtractFontNames()
```
  - setExtractFontNames
```
public void setExtractFontNames(boolean extractFontNames)
```
    Extract font names into a metadata field
    
    Parameters:
    
    extractFontNames -
  - isExtractInlineImages
```
public boolean isExtractInlineImages()
```
    See Also:
    
    setExtractInlineImages(boolean)
  - setExtractInlineImages
```
public void setExtractInlineImages(boolean extractInlineImages)
```
    If true, extract the literal inline embedded OBXImages.
    Beware: some PDF documents of modest size (~4MB) can contain thousands of embedded images totaling > 2.5 GB. Also, at least as of PDFBox 1.8.5, there can be surprisingly large memory consumption and/or out of memory errors.
    Along the same lines, note that this does not extract "logical" images. Some PDF writers break up a single logical image into hundreds of little images. With this option set to true, you might get those hundreds of little images.
    NOTE ALSO: this extracts the raw images without clipping, rotation, masks, color inversion, etc. The images that this extracts may look nothing like what a human would expect given the appearance of the PDF.
    Set to true only with the greatest caution. The default is false.
    
    Parameters:
    
    extractInlineImages -
    
    See Also:
    
    setExtractUniqueInlineImagesOnly(boolean)
  - isExtractUniqueInlineImagesOnly
```
public boolean isExtractUniqueInlineImagesOnly()
```
    See Also:
    
    setExtractUniqueInlineImagesOnly(boolean)
  - setExtractUniqueInlineImagesOnly
```
public void setExtractUniqueInlineImagesOnly(boolean extractUniqueInlineImagesOnly)
```
    Multiple pages within a PDF file might refer to the same underlying image. If extractUniqueInlineImagesOnly is set to false, the parser will call the EmbeddedExtractor each time the image appears on a page. This might be desired for some use cases. However, to avoid duplication of extracted images, set this to true. The default is true.
    Note that uniqueness is determined only by the underlying PDF COSObject id, not by file hash or similar equality metric. If the PDF actually contains multiple copies of the same image -- all with different object ids -- then all images will be extracted.
    For this parameter to have any effect, extractInlineImages must be set to true.
    Because of TIKA-1742 -- to avoid infinite recursion -- no matter the setting of this parameter, the extractor will only pull out one copy of each image per page. This parameter tries to capture uniqueness across the entire document.
    
    Parameters:
    
    extractUniqueInlineImagesOnly -
  - isEnableAutoSpace
```
public boolean isEnableAutoSpace()
```
    See Also:
    
    setEnableAutoSpace(boolean)
  - setEnableAutoSpace
```
public void setEnableAutoSpace(boolean enableAutoSpace)
```
    If true (the default), the parser should estimate where spaces should be inserted between words. For many PDFs this is necessary as they do not include explicit whitespace characters.
  - isSuppressDuplicateOverlappingText
```
public boolean isSuppressDuplicateOverlappingText()
```
    See Also:
    
    setSuppressDuplicateOverlappingText(boolean)
  - setSuppressDuplicateOverlappingText
```
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingText)
```
    If true, the parser should try to remove duplicated text over the same region. This is needed for some PDFs that achieve bolding by re-writing the same text in the same area. Note that this can slow down extraction substantially (PDFBOX-956) and sometimes remove characters that were not in fact duplicated (PDFBOX-1155). By default this is disabled.
  - isExtractAnnotationText
```
public boolean isExtractAnnotationText()
```
    See Also:
    
    setExtractAnnotationText(boolean)
  - setExtractAnnotationText
```
public void setExtractAnnotationText(boolean extractAnnotationText)
```
    If true (the default), text in annotations will be extracted.
  - isSortByPosition
```
public boolean isSortByPosition()
```
    See Also:
    
    setSortByPosition(boolean)
  - setSortByPosition
```
public void setSortByPosition(boolean sortByPosition)
```
    If true, sort text tokens by their x/y position before extracting text. This may be necessary for some PDFs (if the text tokens are not rendered "in order"), while for other PDFs it can produce the wrong result (for example if there are 2 columns, the text will be interleaved). Default is false.
  - getAverageCharTolerance
```
public Float getAverageCharTolerance()
```
    See Also:
    
    setAverageCharTolerance(Float)
  - setAverageCharTolerance
```
public void setAverageCharTolerance(Float averageCharTolerance)
```
    See PDFTextStripper.setAverageCharTolerance(float)
  - getSpacingTolerance
```
public Float getSpacingTolerance()
```
    See Also:
    
    setSpacingTolerance(Float)
  - setSpacingTolerance
```
public void setSpacingTolerance(Float spacingTolerance)
```
    See PDFTextStripper.setSpacingTolerance(float)
  - getDropThreshold
```
public Float getDropThreshold()
```
    See Also:
    
    setDropThreshold(Float)
  - setDropThreshold
```
public void setDropThreshold(Float dropThreshold)
```
    See PDFTextStripper.setDropThreshold(float)
  - getAccessChecker
```
public AccessChecker getAccessChecker()
```
  - setAccessChecker
```
public void setAccessChecker(AccessChecker accessChecker)
```
  - isCatchIntermediateIOExceptions
```
public boolean isCatchIntermediateIOExceptions()
```
    See setCatchIntermediateIOExceptions(boolean)
    
    Returns:
    
    whether or not to catch IOExceptions
  - setCatchIntermediateIOExceptions
```
public void setCatchIntermediateIOExceptions(boolean catchIntermediateIOExceptions)
```
    The PDFBox parser will throw an IOException if there is a problem with a stream. If this is set to true, Tika's PDFParser will catch these exceptions and try to parse the rest of the document. After the parse is completed, Tika's PDFParser will throw the first caught exception.
    
    Parameters:
    
    catchIntermediateIOExceptions -
  - getOcrStrategy
```
public PDFParserConfig.OCR_STRATEGY getOcrStrategy()
```
    Returns:
    
    strategy to use for OCR
  - setOcrStrategy
```
public void setOcrStrategy(PDFParserConfig.OCR_STRATEGY ocrStrategy)
```
    Which strategy to use for OCR
    
    Parameters:
    
    ocrStrategy -
  - setOcrStrategy
```
public void setOcrStrategy(String ocrStrategyString)
```
    Which strategy to use for OCR
    
    Parameters:
    
    ocrStrategyString -
  - getOcrRenderingStrategy
```
public PDFParserConfig.OCR_RENDERING_STRATEGY getOcrRenderingStrategy()
```
  - setOcrRenderingStrategy
```
public void setOcrRenderingStrategy(String ocrRenderingStrategyString)
```
  - setOcrRenderingStrategy
```
public void setOcrRenderingStrategy(PDFParserConfig.OCR_RENDERING_STRATEGY ocrRenderingStrategy)
```
    When rendering the page for OCR, do you want to include the rendering of the electronic text, ALL, or do you only want to run OCR on the images and vector graphics (NO_TEXT)?
    
    Parameters:
    
    ocrRenderingStrategy -
  - getOcrImageFormatName
```
public String getOcrImageFormatName()
```
    String representation of the image format used to render the page image for OCR (examples: png, tiff, jpeg)
    
    Returns:
  - setOcrImageFormatName
```
public void setOcrImageFormatName(String ocrImageFormatName)
```
    Parameters:
    
    ocrImageFormatName - name of image format used to render page image
    
    See Also:
    
    getOcrImageFormatName()
  - getOcrImageType
```
public org.apache.pdfbox.rendering.ImageType getOcrImageType()
```
    Image type used to render the page image for OCR.
    
    Returns:
    
    image type
    
    See Also:
    
    setOcrImageType(ImageType)
  - setOcrImageType
```
public void setOcrImageType(org.apache.pdfbox.rendering.ImageType ocrImageType)
```
    Image type used to render the page image for OCR.
    
    Parameters:
    
    ocrImageType -
  - setOcrImageType
```
public void setOcrImageType(String ocrImageTypeString)
```
    Image type used to render the page image for OCR.
    
    See Also:
    
    setOcrImageType(ImageType)
  - getOcrDPI
```
public int getOcrDPI()
```
    Dots per inch used to render the page image for OCR
    
    Returns:
    
    dots per inch
  - setOcrDPI
```
public void setOcrDPI(int ocrDPI)
```
    Dots per inch used to render the page image for OCR. This does not apply to all image formats.
    
    Parameters:
    
    ocrDPI -
  - getOcrImageQuality
```
public float getOcrImageQuality()
```
    Image quality used to render the page image for OCR. This does not apply to all image formats
    
    Returns:
  - setOcrImageQuality
```
public void setOcrImageQuality(float ocrImageQuality)
```
    Image quality used to render the page image for OCR. This does not apply to all image formats
  - isExtractActions
```
public boolean isExtractActions()
```
    Returns:
    
    whether or not to extract PDActions
    
    See Also:
    
    setExtractActions(boolean)
  - setExtractActions
```
public void setExtractActions(boolean v)
```
    Whether or not to extract PDActions from the file. Most Action types are handled inline; javascript macros are processed as embedded documents.
    
    Parameters:
    
    v -
  - getMaxMainMemoryBytes
```
public long getMaxMainMemoryBytes()
```
    The maximum amount of memory to use when loading a pdf into a PDDocument. Additional buffering is done using a temp file.
    
    Returns:
  - setMaxMainMemoryBytes
```
public void setMaxMainMemoryBytes(long maxMainMemoryBytes)
```
  - isSetKCMS
```
public boolean isSetKCMS()
```
  - setSetKCMS
```
public void setSetKCMS(boolean setKCMS)
```
    Whether to call System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider"). KCMS is the unmaintained, legacy provider and is far faster than the newer replacement. However, there are stability and security risks with using the unmaintained legacy provider.
    
    Note, of course, that this is not thread safe. If the value is false in your first thread, and the second thread changes this to true, the system property in the first thread will now be true.
    
    Default is false.
    
    Parameters:
    
    setKCMS - whether or not to set KCMS
  - isDetectAngles
```
public boolean isDetectAngles()
```
  - setDetectAngles
```
public void setDetectAngles(boolean detectAngles)
```
  - cloneAndUpdate
```
public PDFParserConfig cloneAndUpdate(PDFParserConfig updates)
                               throws TikaException
```
    Throws:
    
    TikaException
  - equals
```
public boolean equals(Object o)
```
    Overrides:
    
    equals in class Object
  - hashCode
```
public int hashCode()
```
    Overrides:
    
    hashCode in class Object
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class Object

Class PDFParserConfig

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

PDFParserConfig

Method Detail

isExtractMarkedContent

setExtractMarkedContent

configure

isExtractAcroFormContent

setExtractAcroFormContent

isIfXFAExtractOnlyXFA

setIfXFAExtractOnlyXFA

isExtractBookmarksText

setExtractBookmarksText

isExtractFontNames

setExtractFontNames

isExtractInlineImages

setExtractInlineImages

isExtractUniqueInlineImagesOnly

setExtractUniqueInlineImagesOnly

isEnableAutoSpace

setEnableAutoSpace

isSuppressDuplicateOverlappingText

setSuppressDuplicateOverlappingText

isExtractAnnotationText

setExtractAnnotationText

isSortByPosition

setSortByPosition

getAverageCharTolerance

setAverageCharTolerance

getSpacingTolerance

setSpacingTolerance

getDropThreshold

setDropThreshold

getAccessChecker

setAccessChecker

isCatchIntermediateIOExceptions

setCatchIntermediateIOExceptions

getOcrStrategy

setOcrStrategy

setOcrStrategy

getOcrRenderingStrategy

setOcrRenderingStrategy

setOcrRenderingStrategy

getOcrImageFormatName

setOcrImageFormatName

getOcrImageType

setOcrImageType

setOcrImageType

getOcrDPI

setOcrDPI

getOcrImageQuality

setOcrImageQuality

isExtractActions

setExtractActions

getMaxMainMemoryBytes

setMaxMainMemoryBytes

isSetKCMS

setSetKCMS

isDetectAngles

setDetectAngles

cloneAndUpdate

equals

hashCode

toString