org.apache.tika.parser.ocr.TesseractOCRConfig

All Implemented Interfaces:: Serializable

public class TesseractOCRConfig extends Object implements Serializable

Configuration for TesseractOCRParser. This class is not thread safe and must be synchronized externally.

This class will remember all set* field forever, and on cloneAndUpdate(TesseractOCRConfig), it will update all the fields that have been set on the "update" config. So, for example, if you want to change language to "fra" from "eng" and then on another parse, you want to change depth to 5 on the same update object, but you expect the language to revert to "eng", you'll be wrong. Create a new update config for each parse unless you're only changing the same field(s) with every parse.

See Also:

Serialized Form

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static enum

TesseractOCRConfig.OUTPUT_TYPE
Constructor Summary

Constructors

Constructor

Description

TesseractOCRConfig()
Method Summary

Modifier and Type

Method

Description

void

addOtherTesseractConfig(String key, String value)

Add a key-value pair to pass to Tesseract using its -c command line option.

TesseractOCRConfig

cloneAndUpdate(TesseractOCRConfig updates)

String

getColorspace()

int

getDensity()

int

getDepth()

String

getFilter()

static void

getLangs(String language, Set<String> validLangs, Set<String> invalidLangs)

This takes a language string, parses it and then bins individual langs into valid or invalid based on regexes against the language codes

String

getLanguage()

long

getMaxFileSizeToOcr()

long

getMinFileSizeToOcr()

Map<String,String>

getOtherTesseractConfig()

TesseractOCRConfig.OUTPUT_TYPE

getOutputType()

String

getPageSegMode()

String

getPageSeparator()

int

getResize()

int

getTimeoutSeconds()

boolean

isApplyRotation()

boolean

isEnableImagePreprocessing()

boolean

isPreserveInterwordSpacing()

boolean

isSkipOcr()

void

setApplyRotation(boolean applyRotation)

Sets whether or not a rotation value should be calculated and passed to ImageMagick.

void

setColorspace(String colorspace)

void

setDensity(int density)

void

setDepth(int depth)

void

setEnableImagePreprocessing(boolean enableImagePreprocessing)

Set the value to true if processing is to be enabled.

void

setFilter(String filter)

void

setLanguage(String languageString)

Set tesseract language dictionary to be used.

void

setMaxFileSizeToOcr(long maxFileSizeToOcr)

Set maximum file size to submit file to ocr.

void

setMinFileSizeToOcr(long minFileSizeToOcr)

Set minimum file size to submit file to ocr.

void

setOutputType(String outputType)

void

setOutputType(TesseractOCRConfig.OUTPUT_TYPE outputType)

Set output type from ocr process.

void

setPageSegMode(String pageSegMode)

Set tesseract page segmentation mode.

void

setPageSeparator(String pageSeparator)

The page separator to use in plain text output.

void

setPreserveInterwordSpacing(boolean preserveInterwordSpacing)

Whether or not to maintain interword spacing.

void

setResize(int resize)

void

setSkipOcr(boolean skipOcr)

If you want to turn off OCR at run time for a specific file, set this to true

void

setTimeoutSeconds(int timeoutSeconds)

Set maximum time (seconds) to wait for the ocring process to terminate.

void

setTrustedPageSeparator(String pageSeparator)

Same as setPageSeparator(String) but does not perform any checks on the string.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- TesseractOCRConfig
  
  public TesseractOCRConfig()
Method Details
- getLangs
  
  public static void getLangs(String language, Set<String> validLangs, Set<String> invalidLangs)
  
  This takes a language string, parses it and then bins individual langs into valid or invalid based on regexes against the language codes
  
  Parameters:
  
  language -
  
  validLangs -
  
  invalidLangs -
- getLanguage
  
  public String getLanguage()
  See Also:
  
  setLanguage(String language)
- setLanguage
  
  public void setLanguage(String languageString)
  Set tesseract language dictionary to be used. Default is "eng". languages are either:
  
  Nominally an ISO-639-2 code but compound codes are allowed separated by underscore: e.g., chi_tra_vert, aze_cyrl
  
  A file path in the script directory. The name starts with upper-case letter. Some of them have underscores and other upper-case letters: e.g., script/Arabic, script/HanS_vert, script/Japanese_vert, script/Canadian_Aboriginal
  
  Multiple languages may be specified, separated by plus characters. e.g. "chi_tra+chi_sim+script/Arabic"
- getPageSegMode
  
  public String getPageSegMode()
  See Also:
  
  setPageSegMode(String pageSegMode)
- setPageSegMode
  
  public void setPageSegMode(String pageSegMode)
  
  Set tesseract page segmentation mode. Default is 1 = Automatic page segmentation with OSD (Orientation and Script Detection)
- getPageSeparator
  
  public String getPageSeparator()
  See Also:
  
  setPageSeparator(String pageSeparator)
- setPageSeparator
  
  public void setPageSeparator(String pageSeparator)
  
  The page separator to use in plain text output. This corresponds to Tesseract's page_separator config option. The default here is the empty string (i.e. no page separators). Note that this is also the default in Tesseract 3.x, but in Tesseract 4.0 the default is to use the form feed control character. We are overriding Tesseract 4.0's default here.
  
  Parameters:
  
  pageSeparator -
- setTrustedPageSeparator
  
  public void setTrustedPageSeparator(String pageSeparator)
  
  Same as setPageSeparator(String) but does not perform any checks on the string.
  
  Parameters:
  
  pageSeparator -
- isPreserveInterwordSpacing
  
  public boolean isPreserveInterwordSpacing()
  
  Returns:
  
  whether or not to maintain interword spacing.
- setPreserveInterwordSpacing
  
  public void setPreserveInterwordSpacing(boolean preserveInterwordSpacing)
  
  Whether or not to maintain interword spacing. Default is false.
  
  Parameters:
  
  preserveInterwordSpacing -
- getMinFileSizeToOcr
  
  public long getMinFileSizeToOcr()
  See Also:
  
  setMinFileSizeToOcr(long minFileSizeToOcr)
- setMinFileSizeToOcr
  
  public void setMinFileSizeToOcr(long minFileSizeToOcr)
  
  Set minimum file size to submit file to ocr. Default is 0.
- getMaxFileSizeToOcr
  
  public long getMaxFileSizeToOcr()
  See Also:
  
  setMaxFileSizeToOcr(long maxFileSizeToOcr)
- setMaxFileSizeToOcr
  
  public void setMaxFileSizeToOcr(long maxFileSizeToOcr)
  
  Set maximum file size to submit file to ocr. Default is Integer.MAX_VALUE.
- getTimeoutSeconds
  
  public int getTimeoutSeconds()
  Returns:
  
  timeout value for Tesseract
  
  See Also:
  
  setTimeoutSeconds(int timeout)
- setTimeoutSeconds
  
  public void setTimeoutSeconds(int timeoutSeconds)
  
  Set maximum time (seconds) to wait for the ocring process to terminate. Default value is 120s.
- getOutputType
  
  public TesseractOCRConfig.OUTPUT_TYPE getOutputType()
  See Also:
  
  setOutputType(OUTPUT_TYPE outputType)
- setOutputType
  
  public void setOutputType(TesseractOCRConfig.OUTPUT_TYPE outputType)
  
  Set output type from ocr process. Default is "txt", but can be "hocr". Default value is TesseractOCRConfig.OUTPUT_TYPE.TXT.
- setOutputType
  
  public void setOutputType(String outputType)
- isEnableImagePreprocessing
  
  public boolean isEnableImagePreprocessing()
  Returns:
  
  image processing is enabled or not
  
  See Also:
  
  setEnableImagePreprocessing(boolean)
- setEnableImagePreprocessing
  
  public void setEnableImagePreprocessing(boolean enableImagePreprocessing)
  
  Set the value to true if processing is to be enabled. Default value is false.
- getDensity
  
  public int getDensity()
  
  Returns:
  
  the density
- setDensity
  
  public void setDensity(int density)
  
  Parameters:
  
  density - the density to set. Valid range of values is 150-1200. Default value is 300.
- getDepth
  
  public int getDepth()
  
  Returns:
  
  the depth
- setDepth
  
  public void setDepth(int depth)
  
  Parameters:
  
  depth - the depth to set. Valid values are 2, 4, 8, 16, 32, 64, 256, 4096. Default value is 4.
- getColorspace
  
  public String getColorspace()
  
  Returns:
  
  the colorspace
- setColorspace
  
  public void setColorspace(String colorspace)
  
  Parameters:
  
  colorspace - the colorspace to set Deafult value is gray.
- getFilter
  
  public String getFilter()
  
  Returns:
  
  the filter
- setFilter
  
  public void setFilter(String filter)
  
  Parameters:
  
  filter - the filter to set. Valid values are point, hermite, cubic, box, gaussian, catrom, triangle, quadratic and mitchell. Default value is triangle.
- isSkipOcr
  
  public boolean isSkipOcr()
- setSkipOcr
  
  public void setSkipOcr(boolean skipOcr)
  
  If you want to turn off OCR at run time for a specific file, set this to true
  
  Parameters:
  
  skipOcr -
- getResize
  
  public int getResize()
  
  Returns:
  
  the resize
- setResize
  
  public void setResize(int resize)
  
  Parameters:
  
  resize - the resize to set. Valid range of values is 100-900. Default value is 900.
- isApplyRotation
  
  public boolean isApplyRotation()
  
  Returns:
  
  Whether or not a rotation value should be calculated and passed to ImageMagick before performing OCR.
- setApplyRotation
  
  public void setApplyRotation(boolean applyRotation)
  
  Sets whether or not a rotation value should be calculated and passed to ImageMagick.
  
  Parameters:
  
  applyRotation - to calculate and apply rotation, false to skip. Default is false
- getOtherTesseractConfig
  
  public Map<String,String> getOtherTesseractConfig()
  See Also:
  
  addOtherTesseractConfig(String, String)
- addOtherTesseractConfig
  
  public void addOtherTesseractConfig(String key, String value)
  
  Add a key-value pair to pass to Tesseract using its -c command line option. To see the possible options, run tesseract --print-parameters.
  You may also add these parameters in TesseractOCRConfig.properties; any key-value pair in the properties file where the key contains an underscore is passed directly to Tesseract.
  
  Parameters:
  
  key -
  
  value -
- cloneAndUpdate
  
  public TesseractOCRConfig cloneAndUpdate(TesseractOCRConfig updates) throws TikaException
  
  Throws:
  
  TikaException

Class TesseractOCRConfig

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

TesseractOCRConfig

Method Details

getLangs

getLanguage

setLanguage

getPageSegMode

setPageSegMode

getPageSeparator

setPageSeparator

setTrustedPageSeparator

isPreserveInterwordSpacing

setPreserveInterwordSpacing

getMinFileSizeToOcr

setMinFileSizeToOcr

getMaxFileSizeToOcr

setMaxFileSizeToOcr

getTimeoutSeconds

setTimeoutSeconds

getOutputType

setOutputType

setOutputType

isEnableImagePreprocessing

setEnableImagePreprocessing

getDensity

setDensity

getDepth

setDepth

getColorspace

setColorspace

getFilter

setFilter

isSkipOcr

setSkipOcr

getResize

setResize

isApplyRotation

setApplyRotation

getOtherTesseractConfig

addOtherTesseractConfig

cloneAndUpdate