Package org.apache.tika.parser.ocr
Class TesseractOCRConfig
- java.lang.Object
-
- org.apache.tika.parser.ocr.TesseractOCRConfig
-
- All Implemented Interfaces:
Serializable
public class TesseractOCRConfig extends Object implements Serializable
Configuration for TesseractOCRParser.This allows to enable TesseractOCRParser and set its parameters:
TesseractOCRConfig config = new TesseractOCRConfig();
config.setTesseractPath(tesseractFolder);
parseContext.set(TesseractOCRConfig.class, config);
Parameters can also be set by either editing the existing TesseractOCRConfig.properties file in, tika-parser/src/main/resources/org/apache/tika/parser/ocr, or overriding it by creating your own and placing it in the package org/apache/tika/parser/ocr on the classpath.
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
TesseractOCRConfig.OUTPUT_TYPE
-
Constructor Summary
Constructors Constructor Description TesseractOCRConfig()
Default contructor.TesseractOCRConfig(InputStream is)
Loads properties from InputStream and then tries to close InputStream.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addOtherTesseractConfig(String key, String value)
Add a key-value pair to pass to Tesseract using its -c command line option.boolean
getApplyRotation()
String
getColorspace()
int
getDensity()
int
getDepth()
String
getFilter()
String
getImageMagickPath()
String
getLanguage()
long
getMaxFileSizeToOcr()
long
getMinFileSizeToOcr()
Map<String,String>
getOtherTesseractConfig()
TesseractOCRConfig.OUTPUT_TYPE
getOutputType()
String
getPageSegMode()
String
getPageSeparator()
boolean
getPreserveInterwordSpacing()
int
getResize()
String
getTessdataPath()
String
getTesseractPath()
int
getTimeout()
int
isEnableImageProcessing()
void
setApplyRotation(boolean applyRotation)
Sets whether or not a rotation value should be calculated and passed to ImageMagick.void
setColorspace(String colorspace)
void
setDensity(int density)
void
setDepth(int depth)
void
setEnableImageProcessing(int enableImageProcessing)
Set the value to true if processing is to be enabled.void
setFilter(String filter)
void
setImageMagickPath(String imageMagickPath)
Set the path to the ImageMagick executable directory, needed if it is not on system path.void
setLanguage(String language)
Set tesseract language dictionary to be used.void
setMaxFileSizeToOcr(long maxFileSizeToOcr)
Set maximum file size to submit file to ocr.void
setMinFileSizeToOcr(long minFileSizeToOcr)
Set minimum file size to submit file to ocr.void
setOutputType(String outputType)
void
setOutputType(TesseractOCRConfig.OUTPUT_TYPE outputType)
Set output type from ocr process.void
setPageSegMode(String pageSegMode)
Set tesseract page segmentation mode.void
setPageSeparator(String pageSeparator)
The page separator to use in plain text output.void
setPreserveInterwordSpacing(boolean preserveInterwordSpacing)
Whether or not to maintain interword spacing.void
setResize(int resize)
void
setTessdataPath(String tessdataPath)
Set the path to the 'tessdata' folder, which contains language files and config files.void
setTesseractPath(String tesseractPath)
Set the path to the Tesseract executable's directory, needed if it is not on system path.void
setTimeout(int timeout)
Set maximum time (seconds) to wait for the ocring process to terminate.void
setTrustedPageSeparator(String pageSeparator)
Same assetPageSeparator(String)
but does not perform any checks on the string.
-
-
-
Constructor Detail
-
TesseractOCRConfig
public TesseractOCRConfig()
Default contructor.
-
TesseractOCRConfig
public TesseractOCRConfig(InputStream is)
Loads properties from InputStream and then tries to close InputStream. If there is an IOException, this silently swallows the exception and goes back to the default.- Parameters:
is
-
-
-
Method Detail
-
getTesseractPath
public String getTesseractPath()
- See Also:
setTesseractPath(String tesseractPath)
-
setTesseractPath
public void setTesseractPath(String tesseractPath)
Set the path to the Tesseract executable's directory, needed if it is not on system path.Note that if you set this value, it is highly recommended that you also set the path to the 'tessdata' folder using
setTessdataPath(java.lang.String)
.
-
getTessdataPath
public String getTessdataPath()
- See Also:
setTessdataPath(String tessdataPath)
-
setTessdataPath
public void setTessdataPath(String tessdataPath)
Set the path to the 'tessdata' folder, which contains language files and config files. In some cases (such as on Windows), this folder is found in the Tesseract installation, but in other cases (such as when Tesseract is built from source), it may be located elsewhere.
-
getLanguage
public String getLanguage()
- See Also:
setLanguage(String language)
-
setLanguage
public void setLanguage(String language)
Set tesseract language dictionary to be used. Default is "eng". Multiple languages may be specified, separated by plus characters. e.g. "chi_tra+chi_sim"
-
getPageSegMode
public String getPageSegMode()
- See Also:
setPageSegMode(String pageSegMode)
-
setPageSegMode
public void setPageSegMode(String pageSegMode)
Set tesseract page segmentation mode. Default is 1 = Automatic page segmentation with OSD (Orientation and Script Detection)
-
getPageSeparator
public String getPageSeparator()
- See Also:
setPageSeparator(String pageSeparator)
-
setPageSeparator
public void setPageSeparator(String pageSeparator)
The page separator to use in plain text output. This corresponds to Tesseract's page_separator config option. The default here is the empty string (i.e. no page separators). Note that this is also the default in Tesseract 3.x, but in Tesseract 4.0 the default is to use the form feed control character. We are overriding Tesseract 4.0's default here.- Parameters:
pageSeparator
-
-
setTrustedPageSeparator
public void setTrustedPageSeparator(String pageSeparator)
Same assetPageSeparator(String)
but does not perform any checks on the string.- Parameters:
pageSeparator
-
-
setPreserveInterwordSpacing
public void setPreserveInterwordSpacing(boolean preserveInterwordSpacing)
Whether or not to maintain interword spacing. Default isfalse
.- Parameters:
preserveInterwordSpacing
-
-
getPreserveInterwordSpacing
public boolean getPreserveInterwordSpacing()
- Returns:
- whether or not to maintain interword spacing.
-
getMinFileSizeToOcr
public long getMinFileSizeToOcr()
-
setMinFileSizeToOcr
public void setMinFileSizeToOcr(long minFileSizeToOcr)
Set minimum file size to submit file to ocr. Default is 0.
-
getMaxFileSizeToOcr
public long getMaxFileSizeToOcr()
-
setMaxFileSizeToOcr
public void setMaxFileSizeToOcr(long maxFileSizeToOcr)
Set maximum file size to submit file to ocr. Default is Integer.MAX_VALUE.
-
setTimeout
public void setTimeout(int timeout)
Set maximum time (seconds) to wait for the ocring process to terminate. Default value is 120s.
-
getTimeout
public int getTimeout()
- Returns:
- timeout value for Tesseract
- See Also:
setTimeout(int timeout)
-
setOutputType
public void setOutputType(TesseractOCRConfig.OUTPUT_TYPE outputType)
Set output type from ocr process. Default is "txt", but can be "hocr". Default value isTesseractOCRConfig.OUTPUT_TYPE.TXT
.
-
setOutputType
public void setOutputType(String outputType)
-
getOutputType
public TesseractOCRConfig.OUTPUT_TYPE getOutputType()
- See Also:
setOutputType(OUTPUT_TYPE outputType)
-
isEnableImageProcessing
public int isEnableImageProcessing()
- Returns:
- image processing is enabled or not
- See Also:
setEnableImageProcessing(int)
-
setEnableImageProcessing
public void setEnableImageProcessing(int enableImageProcessing)
Set the value to true if processing is to be enabled. Default value is false.
-
getDensity
public int getDensity()
- Returns:
- the density
-
setDensity
public void setDensity(int density)
- Parameters:
density
- the density to set. Valid range of values is 150-1200. Default value is 300.
-
getDepth
public int getDepth()
- Returns:
- the depth
-
setDepth
public void setDepth(int depth)
- Parameters:
depth
- the depth to set. Valid values are 2, 4, 8, 16, 32, 64, 256, 4096. Default value is 4.
-
getColorspace
public String getColorspace()
- Returns:
- the colorspace
-
setColorspace
public void setColorspace(String colorspace)
- Parameters:
colorspace
- the colorspace to set Deafult value is gray.
-
getFilter
public String getFilter()
- Returns:
- the filter
-
setFilter
public void setFilter(String filter)
- Parameters:
filter
- the filter to set. Valid values are point, hermite, cubic, box, gaussian, catrom, triangle, quadratic and mitchell. Default value is triangle.
-
getResize
public int getResize()
- Returns:
- the resize
-
setResize
public void setResize(int resize)
- Parameters:
resize
- the resize to set. Valid range of values is 100-900. Default value is 900.
-
getImageMagickPath
public String getImageMagickPath()
- Returns:
- path to ImageMagick executable directory.
- See Also:
setImageMagickPath(String imageMagickPath)
-
setImageMagickPath
public void setImageMagickPath(String imageMagickPath)
Set the path to the ImageMagick executable directory, needed if it is not on system path.- Parameters:
imageMagickPath
- to ImageMagick executable directory.
-
getApplyRotation
public boolean getApplyRotation()
- Returns:
- Whether or not a rotation value should be calculated and passed to ImageMagick before performing OCR. (Requires that Python is installed).
-
setApplyRotation
public void setApplyRotation(boolean applyRotation)
Sets whether or not a rotation value should be calculated and passed to ImageMagick.- Parameters:
applyRotation
- to calculate and apply rotation, false to skip. Default is false, true required Python installed.
-
getOtherTesseractConfig
public Map<String,String> getOtherTesseractConfig()
- See Also:
addOtherTesseractConfig(String, String)
-
addOtherTesseractConfig
public void addOtherTesseractConfig(String key, String value)
Add a key-value pair to pass to Tesseract using its -c command line option. To see the possible options, run tesseract --print-parameters. You may also add these parameters in TesseractOCRConfig.properties; any key-value pair in the properties file where the key contains an underscore is passed directly to Tesseract.- Parameters:
key
-value
-
-
-