All Classes and Interfaces (Apache Tika 3.0.0 API)


AttributeMetadataHandler

SAX event handler that maps the contents of an XML attribute into
 a metadata field.

AudioFrame

An Audio Frame in an MP3 file.

AudioParser
 
AutoDetectParser
 
AutoDetectParserConfig

This config object can be used to tune how conservative we want to be
 when parsing data that is extremely compressible and resembles a ZIP
 bomb.

AutoDetectParserFactory

Simple class for AutoDetectParser

AutoDetectParserFactory

Factory for an AutoDetectParser

AutoDetectReader

An input stream reader that automatically detects the character encoding
 to be used for converting bytes to characters.

AutoDetectTransformer
 
AZBlobEmitter

Emit files to Azure blob storage.

AZBlobFetcher

Fetches files from Azure blob storage.

AZBlobFetcherConfig
 
AZBlobPipesIterator
 
BasicContentHandlerFactory

Basic factory for creating common types of ContentHandlers

BasicContentHandlerFactory.HANDLER_TYPE

Common handler types for content.

BasicEmbeddedBytesSelector
 
BasicEmbeddedDocumentBytesHandler

For now, this is an in-memory EmbeddedDocumentBytesHandler that stores
 all the bytes in memory.

BasicObject

Base object for FSSHTTPB.

BasicTikaFSConsumer

Basic FileResourceConsumer that reads files from an input
 directory and writes content to the output directory.

BasicTikaFSConsumersBuilder
 
BasicTokenCountStatsCalculator
 
BatchNoRestartError

FileResourceConsumers should throw this if something
 catastrophic has happened and the BatchProcess should shutdown
 and not be restarted.

BatchProcess

This is the main processor class for a single process.

BatchProcess.BATCH_CONSTANTS
 
BatchProcessBuilder

Builds a BatchProcessor from a combination of runtime arguments and the
 config file.

BatchProcessDriverCLI
 
BatchTopCommonTokenCounter

Utility class that runs TopCommonTokenCounter against a directory
 of table files (named {lang}_table.gz or leipzip-like afr_...

BinaryItem
 
Bit

The class is used to read/set bit value for a byte array

BitConverter
 
BitReader

A class is used to extract values across byte boundaries with arbitrary bit positions.

BitWriter
 
BodyContentHandler

Content handler decorator that only passes everything inside
 the XHTML <body/> tag to the underlying handler.

BoilerpipeContentHandler

Uses the boilerpipe
 library to automatically extract the main content from a web page.

BOMDetector
 
BouncyCastleDigester

Digester that relies on BouncyCastle for MessageDigest implementations.

BoundedInputStream

Very slight modification of Commons' BoundedInputStream
 so that we can figure out if this hit the bound or not.

BPGParser

Parser for the Better Portable Graphics (BPG) File Format.

BPListDetector

Detector for BPList with utility functions for PList.

ByteDeleter
 
ByteFlipper
 
ByteInjector
 
BytesRefCalculator<T>

Interface for calculators that require a string

BytesRefCalculator.BytesRefCalcInstance<T>
 
ByteUtil
 
CachedTranslator

CachedTranslator.

CallablePipesIterator

This is a simple wrapper around PipesIterator
 that allows it to be called in its own thread.

CantFuzzException
 
CaptionObject

A model for caption objects from graphics and texts typically includes
 human readable sentence, language of the sentence and confidence score.

CaptureGroupMetadataFilter

This filter runs a regex against the first value in the "sourceField".

Cell

Cell of content.

CellDecorator

Cell decorator.

CellID
 
CellIDArray
 
CellManifestCurrentRevision
 
CellManifestDataElementData

Cell manifest data element

CharsetDetector

CharsetDetector provides a facility for detecting the
 charset or encoding of character data in an unknown format.

CharsetMatch

This class represents a charset that has been identified by a CharsetDetector
 as a possible encoding for a set of input data.

CharsetUtils
 
ChildMatcher

Intermediate evaluation state of a .../*... XPath expression.

ChmAccessor<T>

Defines an accessor interface

ChmAssert

Contains chm extractor assertions

ChmBlockInfo

A container that contains chm block information such as: i. initial block is
 using to reset main tree ii. start block is using for knowing where to start
 iii. end block is using for knowing where to stop iv. start offset is using
 for knowing where to start reading v. end offset is using for knowing where
 to stop reading

ChmCommons
 
ChmCommons.EntryType

Represents entry types: uncompressed, compressed

ChmCommons.IntelState

Represents intel file states during decompression

ChmCommons.LzxState

Represents lzx states: started decoding, not started decoding

ChmConstants
 
ChmDirectoryListingSet

Holds chm listing entries

ChmExtractor

Extracts text from chm file.

ChmItsfHeader

The Header 0000: char[4] 'ITSF' 0004: DWORD 3 (Version number) 0008: DWORD
 Total header length, including header section table and following data. 000C:
 DWORD 1 (unknown) 0010: DWORD a timestamp 0014: DWORD Windows Language ID
 0018: GUID {7C01FD10-7BAA-11D0-9E0C-00A0-C922-E6EC} 0028: GUID
 {7C01FD11-7BAA-11D0-9E0C-00A0-C922-E6EC} Note: a GUID is $10 bytes, arranged
 as 1 DWORD, 2 WORDs, and 8 BYTEs. 0000: QWORD Offset of section from
 beginning of file 0008: QWORD Length of section Following the header section
 table is 8 bytes of additional header data.

ChmItspHeader

Directory header The directory starts with a header; its format is as
 follows: 0000: char[4] 'ITSP' 0004: DWORD Version number 1 0008: DWORD Length
 of the directory header 000C: DWORD $0a (unknown) 0010: DWORD $1000 Directory
 chunk size 0014: DWORD "Density" of quickref section, usually 2 0018: DWORD
 Depth of the index tree - 1 there is no index, 2 if there is one level of
 PMGI chunks 001C: DWORD Chunk number of root index chunk, -1 if there is none
 (though at least one file has 0 despite there being no index chunk, probably
 a bug) 0020: DWORD Chunk number of first PMGL (listing) chunk 0024: DWORD
 Chunk number of last PMGL (listing) chunk 0028: DWORD -1 (unknown) 002C:
 DWORD Number of directory chunks (total) 0030: DWORD Windows language ID
 0034: GUID {5D02926A-212E-11D0-9DF9-00A0C922E6EC} 0044: DWORD $54 (This is
 the length again) 0048: DWORD -1 (unknown) 004C: DWORD -1 (unknown) 0050:
 DWORD -1 (unknown)

ChmLzxBlock

Decompresses a chm block.

ChmLzxcControlData

::DataSpace/Storage//ControlData This file contains $20 bytes of
 information on the compression.

ChmLzxcResetTable

LZXC reset table For ensuring a decompression.

ChmLzxState
 
ChmParser
 
ChmParsingException
 
ChmPmgiHeader

Description Note: not always exists An index chunk has the following format:
 0000: char[4] 'PMGI' 0004: DWORD Length of quickref/free area at end of
 directory chunk 0008: Directory index entries (to quickref/free area) The
 quickref area in an PMGI is the same as in an PMGL The format of a directory
 index entry is as follows: BYTE: length of name BYTEs: name (UTF-8 encoded)
 ENCINT: directory listing chunk which starts with name Encoded Integers aka
 ENCINT An ENCINT is a variable-length integer.

ChmPmglHeader

Description There are two types of directory chunks -- index chunks, and
 listing chunks.

ChmSection
 
ChmWrapper
 
ChunkingFactory

This class is used to create instance of AbstractChunking.

ChunkingMethod
 
CJKBigramAwareLengthFilterFactory

Creates a very narrowly focused TokenFilter that limits tokens based on length
 _unless_ they've been identified as <DOUBLE> or <SINGLE>
 by the CJKBigramFilter.

ClassLoaderUtil
 
ClassParser

Parser for Java .class files.

CleanPhoneText

Class to help de-obfuscate phone numbers in text.

ClearByAttachmentTypeMetadataFilter

This class clears the entire metadata object if the
 attachment type matches one of the types.

ClearByMimeMetadataFilter

This class clears the entire metadata object if the
 mime matches the mime filter.

Client2CertificateCredentialsConfig
 
ClientCertificateCredentialsConfig
 
ClientSecretCredentialsConfig
 
ClimateForcast

Met keys from NCAR CCSM files in the Climate Forecast Convention.

ColInfo
 
Cols
 
CommandLineParserBuilder

Reads configurable options from a config file and returns org.apache.commons.cli.Options
 object to be used in commandline parser.

CommonsDigester

Implementation of DigestingParser.Digester
 that relies on commons.codec.digest.DigestUtils to calculate digest hashes.

CommonsDigester.DigestAlgorithm
 
CommonsDigesterFactory

Simple factory for CommonsDigester with
 default markLimit = 1000000 and md5 digester.

CommonTokenCountManager
 
CommonTokenOverlapCounter
 
CommonTokenResult
 
CommonTokens
 
CommonTokensBhattacharyya
 
CommonTokensCosine
 
CommonTokensHellinger
 
CommonTokensKLDivergence
 
CommonTokensKLDNormed
 
Compact64bitInt

A 9-byte encoding of values in the range 0x0002000000000000 through 0xFFFFFFFFFFFFFFFF

CompactID

This class is used to represent the CompactID structrue.

CompareUtils
 
CompositeDetector

Content type detector that combines multiple different detection mechanisms.

CompositeDigester
 
CompositeEncodingDetector
 
CompositeExternalParser

A Composite Parser that wraps up all the available External Parsers,
 and provides an easy way to access them.

CompositeMatcher

Composite XPath evaluation state.

CompositeMetadataFilter
 
CompositeParseContextConfig
 
CompositeParser

Composite parser that delegates parsing tasks to a component parser
 based on the declared content type of the incoming document.

CompositePipesReporter
 
CompositeRenderer
 
CompositeTagHandler

Takes an array of ID3Tags in preference order, and when asked for
 a given tag, will return it from the first ID3Tags that has it.

CompositeTextStatsCalculator
 
CompressorConstants
 
CompressorParser

Parser for various compression formats.

CompressorParserOptions

Interface for setting options for the CompressorParser by passing
 via the ParseContext.

ConcurrentUtils

Utility Class for Concurrency in Tika

ConfigBase
 
ConfigurableThreadPoolExecutor

Allows Thread Pool to be Configurable.

ConsumersManager

Simple interface around a collection of consumers that allows
 for initializing and shutting shared resources (e.g. db connection, index, writer, etc.)

ContainerExtractor

Tika container extractor interface.

ContentHandlerDecorator

Decorator base class for the ContentHandler interface.

ContentHandlerDecoratorFactory
 
ContentHandlerExample

Examples of using different Content Handlers to
 get different parts of the file's contents

ContentHandlerFactory

Interface to allow easier injection of code for getting a new ContentHandler

ContentLengthCalculator
 
ContentTagParser
 
ContentTags
 
ContrastStatistics
 
CoreNLPNERecogniser

This class offers an implementation of NERecogniser based on
 CRF classifiers from Stanford CoreNLP.

CorruptedFileException

This exception should be thrown when the parse absolutely, positively has to stop.

CreativeCommons

A collection of Creative Commons properties names.

CryptoParser

Decrypts the incoming document stream and delegates further parsing to
 another parser instance.

CSVMessageBodyWriter
 
CSVParams
 
CSVPipesIterator

Iterates through a UTF-8 CSV file.

CSVResult
 
CTAKESAnnotationProperty

This enumeration includes the properties that an IdentifiedAnnotation object can provide.

CTAKESConfig

Configuration for CTAKESContentHandler.

CTAKESContentHandler

Class used to extract biomedical information while parsing.

CTAKESParser

CTAKESParser decorates a Parser and leverages on
 CTAKESContentHandler to extract biomedical information from
 clinical text using Apache cTAKES.

CTAKESSerializer

Enumeration for types of cTAKES (UIMA) CAS serializer supported by cTAKES.

CTAKESUtils

This class provides methods to extract biomedical information from plain text
 using CTAKESContentHandler that relies on Apache cTAKES.

CustomMimeInfo
 
Database
 
DataElement
 
DataElementData

Base class of data element

DataElementHash

Specifies an data element hash stream object

DataElementPackage
 
DataElementParseErrorException
 
DataElementType

The enumeration of the data element type

DataElementUtils
 
DataHashObject
 
DataNodeObjectData

Data Node Object data

DataSizeObject

Data Size Object

DataURIScheme
 
DataURISchemeParseException
 
DataURISchemeUtil

Not thread safe.

DateNormalizingMetadataFilter

Some dates in some file formats do not have a timezone.

DateUtils

Date related utility methods and constants

DBBuffer
 
DBConsumersManager
 
DBFParser

This is a Tika wrapper around the DBFReader.

DBWriter

This is still in its early stages.

DcXMLParser

Dublin Core metadata parser

DefaultContentHandlerFactoryBuilder

Builds BasicContentHandler with type defined by attribute "basicHandlerType"
 with possible values: xml, html, text, body, ignore.

DefaultDetector

A composite detector based on all the Detector implementations
 available through the service provider mechanism.

DefaultEmbeddedStreamTranslator

Loads EmbeddedStreamTranslators via service loading.

DefaultEncodingDetector

A composite encoding detector based on all the EncodingDetector implementations
 available through the service provider mechanism.

DefaultHtmlMapper

The default HTML mapping rules in Tika.

DefaultInputStreamFactory

Passthrough -- returns InputStream as is

DefaultMetadataFilter
 
DefaultParser

A composite parser based on all the Parser implementations
 available through the
 service provider mechanism.

DefaultProbDetector

A version of DefaultDetector for probabilistic mime
 detectors, which use statistical techniques to blend the
 results of differing underlying detectors when attempting
 to detect the type of a given file.

DefaultTranslator

A translator which picks the first available Translator
 implementations available through the
 service provider mechanism.

DefaultZipContainerDetector
 
DelegatingParser

Base class for parser implementations that want to delegate parts of the
 task of parsing an input document to another parser.

DeprecatedStreamingZipContainerDetector
 
DeprecatedZipContainerDetector

A detector that works on Zip documents and tries to figure out
 basic types -- epub, jar, ear, war, kmz and StarOffice

DescribeMetadata

Print the supported Tika Metadata models and their fields.

Detector

Content type detector.

DetectorResource
 
DGN8Parser

This is a VERY LIMITED parser.

DIFContentHandler
 
DIFContentHandler
 
DIFParser
 
DigestingAutoDetectParserFactory
 
DigestingParser
 
DigestingParser.Digester

Interface for digester.

DigestingParser.DigesterFactory

This is used in AutoDetectParserConfig to (optionally)
 wrap the parser in a digesting parser.

DigestingParser.Encoder

Encodes byte array from a MessageDigest to String

DirectoryListingEntry

The format of a directory listing entry is as follows: BYTE: length of name
 BYTEs: name (UTF-8 encoded) ENCINT: content section ENCINT: offset ENCINT:
 length The offset is from the beginning of the content section the file is
 in, after the section has been decompressed (if appropriate).

DirListParser

Parses the output of /bin/ls and counts the number of files and the number of
 executables using Tika.

DisplayMetInstance

Grabs a PDF file from a URL and prints its Metadata

DL4JInceptionV3Net

DL4JInceptionV3Net is an implementation of ObjectRecogniser.

DL4JVGG16Net
 
DocumentSelector

Interface for different document selection strategies for purposes like
 embedded document extraction by a ContainerExtractor instance.

DocumentSelectorConfig
 
DublinCore

A collection of Dublin Core metadata names.

DumpTikaConfigExample

This class shows how to dump a TikaConfig object to a configuration file.

DurationFormatUtils

Functionality and naming conventions (roughly) copied from org.apache.commons.lang3
 so that we didn't have to add another dependency.

DWGParser

DWG (CAD Drawing) parser.

DWGParserConfig
 
DWGReadFormatRemover

DWGReadFormatRemover removes the formatting from the text from libredwg files so only
 the raw text remains.

DWGReadParser

DWGReadParser (CAD Drawing) parser.

EightBytesOfData

This class is used to represent the property contains 8 bytes of data in the PropertySet.rgData stream field.

ElementMappingContentHandler

Content handler decorator that maps element QNames using
 a Map.

ElementMappingContentHandler.TargetElement
 
ElementMatcher

Final evaluation state of an XPath expression that targets an element.

ElementMetadataHandler

SAX event handler that maps the contents of an XML element into
 a metadata field.

EmailVisitor
 
EmbeddedBytesSelector
 
EmbeddedBytesSelector.AcceptAll
 
EmbeddedContentHandler

Content handler decorator that prevents the EmbeddedContentHandler.startDocument()
 and EmbeddedContentHandler.endDocument() events from reaching the decorated handler.

EmbeddedDocumentBytesConfig
 
EmbeddedDocumentBytesConfig.SUFFIX_STRATEGY
 
EmbeddedDocumentBytesHandler
 
EmbeddedDocumentByteStoreExtractorFactory

This factory creates EmbeddedDocumentExtractors that require an
 EmbeddedDocumentBytesHandler in the
 ParseContext should extend this.

EmbeddedDocumentExtractor
 
EmbeddedDocumentExtractorFactory
 
EmbeddedDocumentUtil

Utility class to handle common issues with embedded documents.

EmbeddedPartMetadata

This class records metadata about embedded parts that exists in the xml
 of the main document.

EmbeddedResourceHandler

Tika container extractor callback interface.

EmbeddedStreamTranslator

Interface for different filtering of embedded streams.

Embedder

Tika embedder interface

EMFParser

Extracts files embedded in EMF and offers a
 very rough capability to extract text if there
 is text stored in the EMF.

EmitData
 
EmitKey
 
Emitter
 
EmitterManager

Utility class that will apply the appropriate fetcher
 to the fetcherString based on the prefix.

EmittingEmbeddedDocumentBytesHandler
 
EmptyDetector

Dummy detector that returns application/octet-stream for all documents.

EmptyEmitter
 
EmptyFetcher
 
EmptyParser

Dummy parser that always produces an empty XHTML document without even
 attempting to parse the given document stream.

EmptyTranslator

Dummy translator that always declines to give any text.

EncodingDetector

Character encoding detector.

EncryptedDocumentException
 
EncryptedPrescriptionDetector
 
EncryptedPrescriptionParser
 
EndDocumentShieldingContentHandler

A wrapper around a ContentHandler which will ignore normal
 SAX calls to EndDocumentShieldingContentHandler.endDocument(), and only fire them later.

EndianUtils

General Endian Related Utilties.

EndianUtils.BufferUnderrunException
 
EnviHeaderParser
 
Epub

EPub properties collection.

EpubContentParser

Parser for EPUB OPS *.html files.

EpubParser

Epub parser

Error
 
ErrorParser

Dummy parser that always throws a TikaException without even
 attempting to parse the given document stream.

EvalConsumerBuilder
 
EvalConsumersBuilder
 
EvalExceptionUtils
 
EvilCOSWriter
 
ExcelExtractor

Excel parser implementation which uses POI's Event API
 to handle the contents of a Workbook.

ExceptionUtils
 
ExcludeFieldMetadataFilter
 
ExecutableParser

Parser for executable files.

ExGuid
 
ExGUIDArray
 
ExpandedTitleContentHandler

Content handler decorator which wraps a TransformerHandler in order to
 allow the TITLE tag to render as <title></title>
 rather than <title/> which is accomplished
 by calling the ContentHandler.characters(char[], int, int) method
 with a length of 1 but a zero length char array.

ExtendedGUID
 
ExternalEmbedder

Embedder that uses an external program (like sed or exiftool) to embed text
 content and metadata into a given document.

ExternalParser

Parser that uses an external program (like catdoc or pdf2txt) to extract
 text content and metadata from a given document.

ExternalParser

This is a next generation external parser that uses some of the more
 recent additions to Tika.

ExternalParser.LineConsumer

Consumer contract

ExternalParsersConfigReader

Builds up ExternalParser instances based on XML file(s)
 which define what to run, for what, and how to process
 any output metadata.

ExternalParsersConfigReaderMetKeys

Met Keys used by the ExternalParsersConfigReader.

ExternalParsersFactory

Creates instances of ExternalParser based on XML
 configuration files.

ExternalProcess
 
ExternalTranslator

Abstract class used to interact with command line/external Translators.

ExtractComparer
 
ExtractComparerBuilder
 
ExtractEmbeddedFiles
 
ExtractProfiler
 
ExtractProfilerBuilder
 
ExtractReader
 
ExtractReader.ALTER_METADATA_LIST
 
ExtractReaderException

Exception when trying to read extract

ExtractReaderException.TYPE
 
FailedToStartClientException

This should be catastrophic

FallbackParser

Tries multiple parsers in turn, until one succeeds.

FeedParser

Feed parser.

FetchEmitTuple
 
FetchEmitTuple.ON_PARSE_EXCEPTION
 
Fetcher

Interface for an object that will fetch an InputStream given
 a fetch string.

FetcherConfigContainer
 
FetcherManager

Utility class to hold multiple fetchers.

FetcherStreamFactory

This class looks for "fetcherName" in the http header.

FetcherStringException

If something goes wrong in parsing the fetcher string

FetchKey

Pair of fetcherName (which fetcher to call) and the key
 to send to that fetcher to retrieve a specific file.

FictionBookParser
 
Field

Field annotation is a contract for binding Param value from
 Tika Configuration to an object.

FieldNameMappingFilter
 
FileCommandDetector

This runs the linux 'file' command against a file.

FileListPipesIterator

Reads a list of file names/relative paths from a UTF-8 file.

FilenameUtils
 
FileProcessResult
 
FileProfiler

This class profiles actual files as opposed to extracts e.g.

FileProfilerBuilder
 
FileResource

This is a basic interface to handle a logical "file".

FileResourceConsumer

This is a base class for file consumers.

FileResourceCrawler
 
FileSystem

A collection of metadata elements for file system level metadata

FileSystemEmitter

Emitter to write to a file system.

FileSystemFetcher
 
FileSystemFetcherConfig
 
FileSystemPipesIterator
 
FileSystemStatusReporter

This is intended to write summary statistics to disk
 periodically.

FileTooLongException
 
FlatOpenDocumentParser
 
FLVParser


 Parser for metadata contained in Flash Videos (.flv).

Font
 
ForkParser
 
ForkProxy
 
ForkResource
 
FormattingUtils
 
FormattingUtils.Tag
 
FourBytesOfData

This class is used to represent the property contains 4 bytes of data in the PropertySet.rgData stream field.

FrictionlessPackageDetector
 
FSBatchProcessCLI
 
FSConsumersManager
 
FSCrawlerBuilder

Builds either an FSDirectoryCrawler or an FSListCrawler.

FSDirectoryCrawler
 
FSDirectoryCrawler.CRAWL_ORDER
 
FSDocumentSelector

Selector that chooses files based on their file name
 and their size, as determined by TikaCoreProperties.RESOURCE_NAME_KEY and Metadata.CONTENT_LENGTH.

FSFileResource

FileSystem(FS)Resource wraps a file name.

FSListCrawler

Class that "crawls" a list of files.

FSOutputStreamFactory
 
FSOutputStreamFactory.COMPRESSION
 
FSProperties
 
FSUtil

Utility class to handle some common issues when
 reading from and writing to a file system (FS).

FSUtil.HANDLE_EXISTING
 
FuzzingCLI
 
FuzzingCLIConfig
 
FuzzOne

Forked process that runs against a single input file

GCSEmitter
 
GCSFetcher

Fetches files from google cloud storage.

GCSFetcherConfig
 
GCSPipesIterator
 
GDALParser

Wraps execution of the Geospatial Data Abstraction
 Library (GDAL) gdalinfo tool used to extract geospatial
 information out of hundreds of geo file formats.

GeneralTransformer
 
GenericConverter

Trys to convert as much of the properties in the Metadata map to XMP namespaces.

GeoGazetteerClient
 
Geographic

Geographic schema.

GeographicInformationParser
 
GeoParser
 
GeoParserConfig
 
GeoPkgParser

Customization of sqlite parser to skip certain common blob columns.

GeoPointMetadataFilter

If Metadata contains a TikaCoreProperties.LATITUDE and
 a TikaCoreProperties.LONGITUDE, this filter concatenates those with a
 comma in the order LATITUDE,LONGITUDE.

GeoTag
 
GlobalIdTableEntry3FNDX
 
GlobalIdTableEntryFNDX
 
GoogleTranslator

An implementation of a REST client to the Google Translate v2
 API.

GrabPhoneNumbersExample

Class to demonstrate how to use the PhoneExtractingContentHandler
 to get a list of all of the phone numbers from every file in a directory.

GribParser
 
GrobidNERecogniser
 
GrobidRESTParser
 
GUID
 
GuidUtil
 
GZipSpecializationDetector

This is designed to detect commonly gzipped file types such as warc.gz.

H2Util
 
HandlerConfig
 
HandlerConfig.PARSE_MODE

HandlerConfig.PARSE_MODE.RMETA "recursive metadata" is the same as the -J option
 in tika-app and the /rmeta endpoint in tika-server.

HDFParser

Since the NetCDFParser depends on the NetCDF-Java API,
 we are able to use it to parse HDF files as well.

HeaderCell
 
HeifParser
 
HexCoDec

A set of Hex encoding and decoding utility methods.

HSLFExtractor
 
HTML
 
HtmlEncodingDetector

Character encoding detector for determining the character encoding of a
 HTML document based on the potential charset parameter found in a
 Content-Type http-equiv meta tag somewhere near the beginning.

HTMLHelper

Helps produce user facing HTML output.

HtmlMapper

HTML mapper used to make incoming HTML documents easier to handle by
 Tika clients.

HttpClientFactory

This holds quite a bit of state and is not thread safe.

HttpClientUtil
 
HttpFetcher

Based on Apache httpclient

HttpFetcherConfig
 
HttpHeaders

A collection of HTTP header names.

HttpHeaders
 
HttpParser
 
HwpStreamReader
 
HwpTextExtractorV5
 
HwpV5Parser
 
ICNSParser

A basic parser class for Apple ICNS icon files

IContentHandlerFactoryBuilder
 
ICrawlerBuilder
 
Icu4jEncodingDetector
 
ID3Tags

Interface that defines the common interface for ID3 tag parsers,
 such as ID3v1 and ID3v2.3.

ID3Tags.ID3Comment

Represents a comments in ID3 (especially ID3 v2), where are
 made up of several parts

ID3v1Handler

This is used to parse ID3 Version 1 Tag information from an MP3 file,
 if available.

ID3v22Handler

This is used to parse ID3 Version 2.2 Tag information from an MP3 file,
 if available.

ID3v23Handler

This is used to parse ID3 Version 2.3 Tag information from an MP3 file,
 if available.

ID3v24Handler

This is used to parse ID3 Version 2.4 Tag information from an MP3 file,
 if available.

ID3v2Frame

A frame of ID3v2 data, which is then passed to a handler to
 be turned into useful data.

ID3v2Frame.RawTag
 
ID3v2Frame.TextEncoding
 
IDBWriter
 
IdentityHtmlMapper

Alternative HTML mapping rules that pass the input HTML as-is without any
 modifications.

IDMLParser

Adobe InDesign IDML Parser.

IFileProcessorFutureResult

stub interface to allow for different result types from different processors

IFSSHTTPBSerializable

FSSHTTPB Serialize interface.

ImageDeskew
 
ImageDeskew.HoughLine
 
ImageGraphicsEngine

Copied nearly verbatim from PDFBox

ImageGraphicsEngineFactory
 
ImageMetadataExtractor

Uses the Metadata Extractor library
 to read EXIF and IPTC image metadata and map to Tika fields.

ImageParser
 
ImageUtil
 
ImportContextImpl

ImportContextImpl...

IncludeFieldMetadataFilter
 
IncrementalUpdateRecord
 
Initializable

Components that must do special processing across multiple fields
 at initialization time should implement this interface.

InitializableProblemHandler

This is to be used to handle potential recoverable problems that
 might arise during initialization.

InputStreamDigester
 
InputStreamFactory

A factory which returns a fresh InputStream for the same
 resource each time.

InputStreamFactory

Interface to allow for custom/consistent creation of InputStream

IntermediateNodeObject
 
IntermediateNodeObject.RootNodeObjectBuilder

The class is used to build a root node object.

InterruptableParsingExample

This example demonstrates how to interrupt document parsing if
 some condition is met.

Interrupter

Class that waits for input on System.in.

InterrupterBuilder

Builds an Interrupter

InterrupterFutureResult
 
IOUtils
 
IPADetector
 
IParserFactoryBuilder
 
IProperty

The interface of the property in OneNote file.

IPTC

IPTC photo metadata schema.

IptcAnpaParser

Parser for IPTC ANPA New Wire Feeds

ISArchiveParser
 
ISATabUtils
 
IsIncrementalUpdate
 
ITikaToXMPConverter

Interface for the specific Metadata to XMP converters

IWork13PackageParser
 
IWork13PackageParser.IWork13DocumentType
 
IWork18PackageParser

For now, this parser isn't even registered.

IWork18PackageParser.IWork18DocumentType
 
IWorkDetector
 
IWorkPackageParser

A parser for the IWork container files.

IWorkPackageParser.IWORKDocumentType
 
JackcessParser

Parser that handles Microsoft Access files via
 Jackcess

JarDetector
 
JCID

This class is used to represent a JCID

JCIDObject

This class is used to represent the JCID object.

JDBCEmitter

This is only an initial, basic implementation of an emitter for JDBC.

JDBCEmitter.AttachmentStrategy
 
JDBCEmitter.MultivaluedFieldStrategy
 
JDBCPipesIterator

Iterates through a the results from a sql call via jdbc.

JDBCPipesReporter

This is an initial draft of a JDBCPipesReporter.

JDBCTableReader

General base class to iterate through rows of a JDBC table

JDBCUtil
 
JDBCUtil.CREATE_TABLE
 
JempboxExtractor
 
JoshuaNetworkTranslator

This translator is designed to work with a TCP-IP available
 Joshua translation server, specifically the
 
 REST-based Joshua server.

JournalParser
 
JpegParser
 
JsonEmitData
 
JsonFetchEmitTuple
 
JsonFetchEmitTupleList
 
JSONMessageBodyWriter
 
JsonMetadata
 
JsonMetadataList
 
JSONObjWriter
 
JsonPipesIterator

Iterates through a UTF-8 text file with one FetchEmitTuple
 json object per line.

JsonResponse
 
JsonResponse
 
JsonStreamingSerializer
 
JSoupParser

HTML parser.

JwtCreds
 
JwtGenerator
 
JwtPrivateKeyCreds
 
JwtSecretCreds
 
JXLParser

Tries to scrape XMP out of JXL

KafkaEmitter

Emits the now-parsed documents into a specified Apache Kafka topic.

KafkaPipesIterator
 
KMZDetector
 
LangModel
 
Language
 
LanguageAwareTokenCountStats<T>

Interface for calculators that require language probabilities and token stats

LanguageConfidence
 
LanguageDetectingParser
 
LanguageDetector
 
LanguageDetectorExample
 
LanguageDetectorTest
 
LanguageHandler

SAX content handler that updates a language detector based on all the
 received character content.

LanguageIdentifier

Identifier of the language that best matches a given content profile.

LanguageIDWrapper
 
LanguageNames

Support for language tags (as defined by https://tools.ietf.org/html/bcp47)

LanguageProfile

Language profile based on ngram counts.

LanguageProfilerBuilder

This class runs a ngram analysis over submitted text, results might be used
 for automatic language identification.

LanguageResource
 
LanguageResult
 
LanguageWriter

Writer that builds a language profile based on all the written content.

Latin1StringsParser

Parser to extract printable Latin1 strings from arbitrary files with pure java
 without running any external process.

LeafNodeObject
 
LeafNodeObject.IntermediateNodeObjectBuilder

The class is used to build a intermediate node object.

LeipzigHelper
 
LeipzigSampler
 
LibPstParser

This is an optional PST parser that relies on the user installing
 the GPL-3 libpst/readpst commandline tool and configuring
 Tika to call this library via tika-config.xml

LibPstParserConfig
 
Lingo24LangDetector

An implementation of a Language Detector using the
 Premium MT API v1.

Lingo24Translator

An implementation of a REST client for the
 Premium MT API v1.

Link
 
LinkContentHandler

Content handler that collects links from an XHTML document.

LinkedCell

Linked cell.

ListDescriptor

Contains the information for a single list in the list or list override tables.

ListManager

Computes the number text which goes at the beginning of each list paragraph

LittleEndianBitConverter

Implement a converter which converts to/from little-endian byte arrays

LoadErrorHandler

Interface for error handling strategies in service class loading.

Location
 
LoggingPipesReporter

Simple PipesReporter that logs everything at the debug level.

LookaheadInputStream

Stream wrapper that make it easy to read up to n bytes ahead from
 a stream that supports the mark feature.

LuceneIndexer
 
LuceneIndexerExtended
 
LyricsHandler

This is used to parse Lyrics3 tag information
 from an MP3 file, if available.

MachineMetadata

Metadata for describing machines, such as their
 architecture, type and endian-ness

MachineMetadata.Endian
 
MagicDetector

Content type detection based on magic bytes, i.e. type-specific patterns
 near the beginning of the document input stream.

MailDateParser

Dates in emails are a mess.

MailUtil
 
MarianTranslator

Translator that uses the Marian NMT decoder for translation.

MarianTranslator.MarianServerClient

Internal Client for marian-server Web Socket Server.

Matcher

XPath element matcher.

MatchingContentHandler

Content handler decorator that only passes the elements, attributes,
 and text nodes that match the given XPath expression.

MatParser
 
MboxParser

Mbox (mailbox) parser.

MediaType

Internet media type.

MediaTypeExample
 
MediaTypeRegistry

Registry of known Internet media types.

Message

A collection of Message related property names.

Metadata

A multi-valued metadata container.

MetadataAwareLuceneIndexer

Builds on the LuceneIndexer from Chapter 5 and adds indexing of Metadata.

MetadataExtractor

OOXML metadata extractor.

MetadataFields

Knowns about all declared Metadata fields.

MetadataFilter

Filters the metadata in place after the parse

MetadataHandler
Deprecated.
Use the AttributeMetadataHandler and
 ElementMetadataHandler classes instead

MetadataList

wrapper class to make isWriteable in MetadataListMBW simpler

MetadataListMessageBodyWriter
 
MetadataResource
 
MetadataWriteFilter
 
MetadataWriteFilterFactory
 
MicrosoftGraphFetcher

Fetches files from Microsoft Graph API.

MicrosoftGraphFetcherConfig
 
MicrosoftTranslator

Wrapper class to access the Windows translation service.

MidiParser
 
MIFContentHandler

Content handler for MIF Content and Metadata.

MIFExtractor

Helper Class to Parse and Extract Adobe MIF Files.

MIFParser
 
MimeBuffer
 
MimeType

Internet media type.

MimeTypeException

A class to encapsulate MimeType related exceptions.

MimeTypes

This class is a MimeType repository.

MimeTypesFactory

Creates instances of MimeTypes.

MimeTypesReader

A reader for XML files compliant with the freedesktop MIME-info DTD.

MimeTypesReaderMetKeys

Met Keys used by the MimeTypesReader.

MiscOLEDetector

A detector that works on a POIFS OLE2 document
 to figure out exactly what the file is.

MITIENERecogniser

This class offers an implementation of NERecogniser based on
 trained models using state-of-the-art information extraction tools.

MosesTranslator

Translator that uses the Moses decoder for translation.

MP3Frame

A frame in an MP3 file, such as ID3v2 Tags or some
 audio.

Mp3Parser

The Mp3Parser is used to parse ID3 Version 1 Tag information
 from an MP3 file, if available.

Mp3Parser.ID3TagsAndAudio
 
MP4Parser

Parser for the MP4 media container format, as well as the older
 QuickTime format that MP4 is based on.

MSEmbeddedStreamTranslator
 
MSOfficeBinaryConverter

Tika to XMP mapping for the binary MS formats Word (.doc), Excel (.xls) and PowerPoint (.ppt).

MSOfficeXMLConverter

Tika to XMP mapping for the Office Open XML formats Word (.docx), Excel (.xlsx) and PowerPoint
 (.pptx).

MSOneStorePackage
 
MSOneStoreParser
 
MSOwnerFileParser

Parser for temporary MSOFfice files.

MuPDFRenderer
 
MyFirstTika

Demonstrates how to call the different components within Tika: its
 Detector framework (aka MIME identification and repository), its
 Parser interface, its org.apache.tika.language.LanguageIdentifier and other goodies.

NamedAttributeMatcher

Final evaluation state of a ...


NamedElementMatcher

Intermediate evaluation state of a ...


NamedEntityParser

This implementation of Parser extracts
 entity names from text content and adds it to the metadata.

NameDetector

Content type detection based on the resource name.

NameEntityExtractor
 
Namespace

Utility class to hold namespace information.

NERecogniser

Defines a contract for named entity recogniser.

NetCDFParser

A Parser for NetCDF
 files using the UCAR, MIT-licensed NetCDF for Java
 API.

NetworkParser
 
NLTKNERecogniser

This class offers an implementation of NERecogniser based on
 ne_chunk() module of NLTK.

NNExampleModelDetector
 
NNTrainedModel
 
NNTrainedModelBuilder
 
NoData

This class is used to represent the property contains no data.

NodeMatcher

Final evaluation state of a ...


NodeObject
 
NonDetectingEncodingDetector

Always returns the charset passed in via the initializer

NoOpFilter

This filter performs no operations on the metadata
 and leaves it untouched.

NoTextPDFRenderer

This class extends the PDFRenderer to exclude rendering of electronic text.

NSNormalizerContentHandler

Content handler decorator that:
 Maps old OpenOffice 1.0 Namespaces to the OpenDocument ones
 Returns a fake DTD when parser requests OpenOffice DTD
 

NumberCell

Number cell.

ObjectFromDOMAndQueueBuilder<T>

Same as ObjectFromDOMAndQueueBuilder,
 but this is for objects that require access to the shared queue.

ObjectFromDOMBuilder<T>

Interface for things that build objects from a DOM Node and a map of runtime attributes

ObjectGroupData

The ObjectGroupData class.

ObjectGroupDataElementData
 
ObjectGroupDataElementData.Builder

The internal class for build a list of DataElement from a node object.

ObjectGroupDeclarations

Object Group Declarations

ObjectGroupMetadata

Specifies an object group metadata

ObjectGroupMetadataDeclarations

Object Metadata Declaration

ObjectGroupObjectBLOBDataDeclaration

object data BLOB declaration

ObjectGroupObjectData
 
ObjectGroupObjectDataBLOBReference

object data BLOB reference

ObjectGroupObjectDeclare
 
ObjectRecogniser

This is a contract for object recognisers used by ObjectRecognitionParser

ObjectRecognitionParser

This parser recognises objects from Images.

ObjectSpaceObjectPropSet

This class is used to represent a ObjectSpaceObjectPropSet.

ObjectSpaceObjectPropSet
 
ObjectSpaceObjectStreamHeader
 
ObjectSpaceObjectStreamOfContextIDs

This class is used to represent a ObjectSpaceObjectStreamOfContextIDs.

ObjectSpaceObjectStreamOfOIDs

This class is used to represent a ObjectSpaceObjectStreamOfOIDs.

ObjectSpaceObjectStreamOfOSIDs

This class is used to represent a ObjectSpaceObjectStreamOfOSIDs.

OCRPageCounter

This counts the number of pages that OCR would have been
 run or was run depending on the settings.

OfferLargerThanQueueSize
 
Office

Office Document properties collection.

OfficeOpenXMLCore

Core properties as defined in the Office Open XML specification part Two that are not
 in the DublinCore namespace.

OfficeOpenXMLExtended

Extended properties as defined in the Office Open XML specification part Four.

OfficeParser

Defines a Microsoft document content extractor.

OfficeParser.POIFSDocumentType
 
OfficeParserConfig
 
OfflineContentHandler

Content handler decorator that always returns an empty stream from the
 OfflineContentHandler.resolveEntity(String, String) method to prevent potential
 network or other external resources from being accessed by an XML parser.

OldExcelParser

A POI-powered Tika Parser for very old versions of Excel, from
 pre-OLE2 days, such as Excel 4.

OneByteOfData

This class is used to represent the property contains 1 byte of data in the PropertySet.rgData stream field.

OneNoteParser

OneNote tika parser capable of parsing Microsoft OneNote files.

OneNotePropertyEnum
 
OneNoteTreeWalkerOptions

Options when walking the one note tree.

OOXMLExtractor

Interface implemented by all Tika OOXML extractors.

OOXMLExtractorFactory

Figures out the correct OOXMLExtractor for the supplied document and
 returns it.

OOXMLParser

Office Open XML (OOXML) parser.

OOXMLTikaBodyPartHandler
 
OOXMLWordAndPowerPointTextHandler

This class is intended to handle anything that might contain IBodyElements:
 main document, headers, footers, notes, slides, etc.

OOXMLWordAndPowerPointTextHandler.EditType
 
OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler
 
OPCPackageDetector
 
OPCPackageWrapper

This is a wrapper around OPCPackage that calls revert() instead of close().

OpenDocumentContentParser

Parser for ODF content.xml files.

OpenDocumentConverter

Tika to XMP mapping for the Open Document formats: Text (.odt), Spreatsheet (.ods), Graphics
 (.odg) and Presentation (.odp).

OpenDocumentDetector
 
OpenDocumentMetaParser

Parser for OpenDocument meta.xml files.

OpenDocumentParser

OpenOffice parser

OpenNLPDetector


 This is based on OpenNLP's language detector.

OpenNLPMetadataFilter
 
OpenNLPNameFinder

An implementation of NERecogniser that finds names in text using Open NLP Model.

OpenNLPNERecogniser

This implementation of NERecogniser chains an array of
 OpenNLPNameFinders for which NER models are
 available in classpath.

OpenSearchClient
 
OpenSearchClient
 
OpenSearchEmitter
 
OpenSearchEmitter.AttachmentStrategy
 
OpenSearchEmitter.UpdateStrategy
 
OpenSearchPipesReporter

As of the 2.5.0 release, this is ALPHA version.

OPFParser

Use this to parse the .opf files

OptimaizeLangDetector

Implementation of the LanguageDetector API that uses
 https://github.com/optimaize/language-detector

OptimaizeMetadataFilter
 
OutlookExtractor

Outlook Message Parser.

OutlookExtractor.RECIPIENT_TYPE
 
OutlookPSTParser

Parser for MS Outlook PST email storage files

OutputStreamFactory
 
OverrideDetector
Deprecated.
after 2.5.0 this functionality was moved to the CompositeDetector

PackageConstants
 
PackageParser

Parser for various packaging formats.

PageBasedRenderResults
 
PagedText

XMP Paged-text schema.

PageRangeRequest

The range of pages to render.

ParagraphProperties
 
ParallelFileProcessingResult
 
Param<T>

This is a serializable model class for parameters from configuration file.

ParamField

This class stores metdata for Field annotation are used to map them
 to Param at runtime

ParentContentHandler

Simple pointer class to allow parsers to pass on the parent contenthandler through
 to the embedded document's parse

ParseContext

Parse context.

ParseContextConfig

Implementations must be thread-safe!

ParseContextDeserializer
 
ParseContextSerializer
 
Parser

Tika parser interface.

ParserContainerExtractor

An implementation of ContainerExtractor powered by the regular
 Parser API.

ParserDecorator

Decorator base class for the Parser interface.

ParseRecord

Use this class to store exceptions, warnings and other information
 during the parse.

ParserFactory
 
ParserFactory
 
ParserFactoryBuilder
 
ParserFactoryFactory

Lightweight, easily serializable class that contains enough information
 to build a ParserFactory

ParserPostProcessor

Parser decorator that post-processes the results from a decorated parser.

ParserUtils

Helper util methods for Parsers themselves.

ParsingEmbeddedDocumentExtractor

Helper class for parsers of package archives or other compound document
 formats that support embedded or attached component documents.

ParsingEmbeddedDocumentExtractorFactory
 
ParsingExample
 
ParsingReader

Reader for the text content from a given binary stream.

PasswordProvider

Interface for providing a password to a Parser for handling Encrypted
 and Password Protected Documents.

PasswordProviderConfig
 
PDDocumentRenderer

stub interface for the PDFParser to use to figure out if it needs
 to pass on the PDDocument or create a temp file to be used
 by a file-based renderer down the road.

PDF

PDF properties collection.

PDFBoxRenderer
 
PDFMarkedContent2XHTML

This was added in Tika 1.24 as an alpha version of a text extractor
 that builds the text from the marked text tree and includes/normalizes
 some of the structural tags.

PDFParser

PDF parser.

PDFParserConfig

Config for PDFParser.

PDFParserConfig.IMAGE_STRATEGY
 
PDFParserConfig.OCR_RENDERING_STRATEGY
 
PDFParserConfig.OCR_STRATEGY
 
PDFParserConfig.OCRStrategyAuto

Encapsulate the numbers used to control OCR Strategy when set to auto

PDFParserConfig.TikaImageType
 
PDFRenderingState
 
PDFServerConfig

PDF parser configuration, for the request

PDFTransformer
 
PDFTransformerConfig
 
PDMetadataExtractor
 
Pharmacy
 
PhoneExtractingContentHandler

Class used to extract phone numbers while parsing.

Photoshop

XMP Photoshop metadata schema.

PickBestTextEncodingParser
Deprecated.
Currently not suitable for real use, more a demo / prototype!

PipesClient

The PipesClient is designed to be single-threaded.

PipesConfig
 
PipesConfigBase
 
PipesException

Fatal exception that means that something went seriously wrong.

PipesIterator

Abstract class that handles the testing for timeouts/thread safety
 issues.

PipesParser
 
PipesReporter

This is called asynchronously by the AsyncProcessor.

PipesReporterBase

Base class that includes filtering by PipesResult.STATUS

PipesResource
 
PipesResult
 
PipesResult.STATUS
 
PipesServer

This server is forked from the PipesClient.

PipesServer.STATUS
 
Pkcs7Parser

Basic parser for PKCS7 data.

PListParser

Parser for Apple's plist and bplist.

POIFSContainerDetector

A detector that works on a POIFS OLE2 document
 to figure out exactly what the file is.

POIXMLTextExtractorDecorator
 
PooledTimeSeriesParser

Uses the Pooled Time Series algorithm + command line tool, to
 generate a numeric representation of the video suitable for
 similarity searches.

PrescriptionParser
 
PrettyMetadataKeyComparator
 
ProbabilisticMimeDetectionSelector

Selector for combining different mime detection results
 based on probability

ProbabilisticMimeDetectionSelector.Builder

build class for probability parameters setting

ProcessUtils
 
ProduceTypeResourceComparator

Resource comparator based to produce type.

ProfilingWriter

Writer that builds a language profile based on all the written content.

Property

XMP property definition.

Property.PropertyType
 
Property.ValueType
 
PropertyID

This class is used to represent a PropertyID.

PropertySet

This class is used to represent a PropertySet.

PropertySetObject

This class is used to represent the property set.

PropertyType
 
PropertyTypeException

XMP property definition violation exception.

PropsUtil

Utility class to handle properties.

PrtArrayOfPropertyValues

The class is used to represent the prtArrayOfPropertyValues .

PrtFourBytesOfLengthFollowedByData

This class is used to represent the prtFourBytesOfLengthFollowedByData.

PRTParser

A basic text extracting parser for the CADKey PRT (CAD Drawing)
 format.

PSDParser

Parser for the Adobe Photoshop PSD File Format.

PST
 
PSTMailItemParser
 
QuattroPro

QuattroPro properties collection.

QuattroProParser

Parser for Corel QuattroPro documents (part of Corel WordPerfect
 Office Suite).

RangeFetcher

This class extracts a range of bytes from a given fetch key.

RarParser

Parser for Rar files.

RDCAnalysisChunking

This class is used to process RDC analysis chunking

RecentFiles

Builds on top of the LuceneIndexer and the Metadata discussions in Chapter 6
 to output an RSS (or RDF) feed of files crawled by the LuceneIndexer within
 the last N minutes.

RecognisedObject

A model for recognised objects from graphics and texts typically includes
 human readable label for the object, language of the label, id and confidence score.

RecursiveMetadataResource
 
RecursiveParserWrapper

This is a helper class that wraps a parser in a recursive handler.

RecursiveParserWrapperFSConsumer

This runs a RecursiveParserWrapper against an input file
 and outputs the json metadata to an output file.

RecursiveParserWrapperHandler

This is the default implementation of AbstractRecursiveParserWrapperHandler.

RegexCaptureParser
 
RegexNERecogniser

This class offers an implementation of NERecogniser based on
 Regular Expressions.

RegexUtils

Inspired from Nutch code class OutlinkExtractor.

Renderer

Interface for a renderer.

Rendering
 
RenderingParser
 
RenderingState

This should be to track state for each file (embedded or otherwise).

RenderingTracker

Use this in the ParseContext to keep track of unique ids for rendered
 images in embedded docs.

RenderRequest

Empty interface for requests to a renderer.

RenderResult
 
RenderResult.STATUS
 
RenderResults
 
ReplacementCharset

An implementation of the standard "replacement" charset defined by the W3C.

Report

This class represents a single report.

ReporterBuilder

Interface for reporter builders

RequestTypes

The enumeration of request type.

RereadableInputStream

Wraps an input stream, reading it only once, but making it available
 for rereading an arbitrary number of times.

ResultsReporter
 
RevisionManifest
 
RevisionManifestDataElementData
 
RevisionManifestObjectGroupReferences

Specifies a revision manifest object group references, each followed by object group extended GUIDs

RevisionManifestRootDeclare

Specifies a revision manifest root declare, each followed by root and object extended GUIDs

RevisionStoreObject

The class is used to represent the revision store object.

RevisionStoreObjectGroup
 
RFC822Parser

Uses apache-mime4j to parse emails.

RichTextContentHandler

Content handler for Rich Text, it will extract XHTML <img/>
 tag <alt/> attribute and XHTML <a/> tag <name/>
 attribute into the output.

RollbackSoftware

Demonstrates Tika and its ability to sense symlinks.

RTFConverter

Tika to XMP mapping for the RTF format.

RTFMetadata
 
RTFParser

RTF parser

RTGTranslator

This translator is designed to work with a TCP-IP available
 RTG translation server, specifically the
 
 REST-based RTG server.

RUnpackExtractor

Recursive Unpacker and text and metadata extractor.

RUnpackExtractorFactory
 
RunProperties

WARNING: This class is mutable.

RuntimeSAXException

Use this to throw a SAXException in subclassed methods that don't throw SAXExceptions

S3Emitter

Emits to existing s3 bucket

S3Fetcher

Fetches files from s3.

S3FetcherConfig
 
S3PipesIterator
 
SafeContentHandler

Content handler decorator that makes sure that the character events
 (SafeContentHandler.characters(char[], int, int) or
 SafeContentHandler.ignorableWhitespace(char[], int, int)) passed to the decorated
 content handler contain only valid XML characters.

SafeContentHandler.Output

Internal interface that allows both character and
 ignorable whitespace content to be filtered the same way.

SAS7BDATParser

Processes the SAS7BDAT data columnar database file used by SAS and
 other similar languages.

SecureContentHandler

Content handler decorator that attempts to prevent denial of service
 attacks against Tika parsers.

SentimentAnalysisParser

This parser classifies documents based on the sentiment of document.

SequenceNumberGenerator
 
SerialNumber
 
ServerStatus
 
ServerStatus.STATUS
 
ServerStatus.TASK
 
ServerStatusResource
 
ServerStatusWatcher
 
ServiceLoader

Internal utility class that Tika uses to look up service providers.

ServiceLoaderUtils

Service Loading and Ordering related utils

SiegfriedDetector

Simple wrapper around Siegfried https://github.com/richardlehane/siegfried
 The default behavior is to run detection, report the results in the
 metadata and then return null so that other detectors will be used.

SignatureObject

Signature Object

SimpleChunking
 
SimpleLogReporterBuilder
 
SimpleTextExtractor
 
SimpleThreadPoolExecutor

Simple Thread Pool Executor

SimpleTypeDetector
 
SlowCompositeReaderWrapper

COPIED VERBATIM FROM LUCENE
 This class forces a composite reader (eg a MultiReader or DirectoryReader) to emulate a
 LeafReader.

SolrEmitter
 
SolrEmitter.AttachmentStrategy
 
SolrEmitter.UpdateStrategy
 
SolrPipesIterator

Iterates through results from a Solr query.

SourceCodeParser

Generic Source code parser for Java, Groovy, C++.

SpanSwapper

randomly swaps spans from the input

SpreadsheetMLParser

Parses wordml 2003 format Excel files.

SpringExample
 
SQLite3DBParser

This is the implementation of the db parser for SQLite.

SQLite3Parser

This is the main class for parsing SQLite3 files.

SQLite3TableReader

Concrete class for SQLLite table parsing.

StandardHtmlEncodingDetector

An encoding detector that tries to respect the spirit of the HTML spec
 part 12.2.3 "The input byte stream", or at least the part that is compatible with
 the implementation of tika.

StandardOrganizations

This class provides a collection of the most important technical standard organizations.

StandardReference

Class that represents a standard reference.

StandardReference.StandardReferenceBuilder
 
StandardsExtractingContentHandler

StandardsExtractingContentHandler is a Content Handler used to extract
 standard references while parsing.

StandardsExtractionExample

Class to demonstrate how to use the StandardsExtractingContentHandler
 to get a list of the standard references from every file in a directory.

StandardsText

StandardText relies on regular expressions to extract standard references
 from text.

StandardWriteFilter

This is to be used to limit the amount of metadata that a
 parser can add based on the StandardWriteFilter.maxTotalEstimatedSize,
 StandardWriteFilter.maxFieldSize, StandardWriteFilter.maxValuesPerField, and
 StandardWriteFilter.maxKeySize.

StandardWriteFilterFactory

Factory class for StandardWriteFilter.

StarOfficeDetector
 
StartXRefOffset
 
StartXRefScanner

This is a first draft of a scanner to extract incremental updates
 out of PDFs.

StatefulParser

The RecursiveParserWrapper wraps the parser sent
 into the parsecontext and then uses that parser
 to store state (among many other things).

StatusReporter

Basic class to use for reporting status from both the crawler and the consumers.

StatusReporterBuilder
 
StatusReporterFutureResult

Empty class for what a StatusReporter returns when it finishes.

StoppingEarlyException

Sentinel exception to stop parsing xml once target is found
 while SAX parsing.

StorageIndexCellMapping

Specifies the storage index cell mappings (with cell identifier, cell mapping extended GUID,
 and cell mapping serial number)

StorageIndexDataElementData
 
StorageIndexManifestMapping
 
StorageIndexRevisionMapping

Specifies the storage index revision mappings (with revision and revision mapping
 extended GUIDs, and revision mapping serial number)

StorageManifestDataElementData
 
StorageManifestRootDeclare

Specifies one or more storage manifest root declare.

StorageManifestSchemaGUID

Specifies a storage manifest schema GUID

StrawManTikaAppDriver

Simple single-threaded class that calls tika-app against every file in a directory.

StreamEmitter
 
StreamGobbler
 
StreamingDetectContext
 
StreamingZipContainerDetector

Currently only used in tests.

StreamObject
 
StreamObjectHeaderEnd
 
StreamObjectHeaderEnd16bit

An 16-bit header for a compound object would indicate the end of a stream object

StreamObjectHeaderEnd8bit

An 8-bit header for a compound object would indicate the end of a stream object

StreamObjectHeaderStart

This class specifies the base class for 16-bit or 32-bit stream object header start

StreamObjectHeaderStart16bit

An 16-bit header for a compound object would indicate the start of a stream object

StreamObjectHeaderStart32bit

An 32-bit header for a compound object would indicate the start of a stream object

StreamObjectParseErrorException
 
StreamObjectTypeHeaderEnd
 
StreamObjectTypeHeaderStart

The enumeration of the stream object type header start

StreamOutRPWFSConsumer

This uses the JsonStreamingSerializer to write out a
 single metadata object at a time.

StringsConfig

Configuration for the "strings" (or strings-alternative) command.

StringsEncoding

Character encoding of the strings that are to be found using the "strings" command.

StringsParser

Parser that uses the "strings" (or strings-alternative) command to find the
 printable strings in a object, or other binary, file
 (application/octet-stream).

StringStatsCalculator<T>

Interface for calculators that require a string

StringUtils
 
SubtreeMatcher

Evaluation state of a ...//... XPath expression.

SummaryExtractor

Extractor for Common OLE2 (HPSF) metadata

SupplementingParser

Runs the input stream through all available parsers,
 merging the metadata from them based on the
 AbstractMultipleParser.MetadataPolicy chosen.

SXSLFPowerPointExtractorDecorator

SAX/Streaming pptx extractior

SXWPFWordExtractorDecorator

This is an experimental, alternative extractor for docx files.

SystemUtils

Copied from commons-lang to avoid requiring the dependency

TableInfo
 
TaggedContentHandler

A content handler decorator that tags potential exceptions so that the
 handler that caused the exception can easily be identified.

TaggedSAXException

A SAXException wrapper that tags the wrapped exception with
 a given object reference.

TailStream


 A specialized input stream implementation which records the last portion read
 from an underlying stream.

TarWriter
 
TaskStatus
 
TeeContentHandler

Content handler proxy that forwards the received SAX events to zero or
 more underlying content handlers.

TEIDOMParser
 
TemporaryResources

Utility class for tracking and ultimately closing or otherwise disposing
 a collection of temporary resources.

TensorflowImageRecParser

This is an implementation of ObjectRecogniser powered by
  Tensorflow 
 convolutional neural network (CNN).


TensorflowRESTCaptioner

Tensorflow image captioner.

TensorflowRESTRecogniser

Tensor Flow image recogniser which has high performance.

TensorflowRESTVideoRecogniser

Tensor Flow video recogniser which has high performance.

TesseractOCRConfig

Configuration for TesseractOCRParser.

TesseractOCRConfig.OUTPUT_TYPE
 
TesseractOCRParser

TesseractOCRParser powered by tesseract-ocr engine.

TesseractServerConfig

Tesseract configuration, for the request

TextAndAttributeContentHandler
 
TextAndAttributeXMLParser
 
TextAndCSVConfig
 
TextAndCSVParser

Unless the TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE is set,
 this parser tries to assess whether the file is a text file, csv or tsv.

TextCell

Text cell.

TextContentHandler

Content handler decorator that only passes the
 TextContentHandler.characters(char[], int, int) and
 (@link TextContentHandler.ignorableWhitespace(char[], int, int)
 (plus TextContentHandler.startDocument() and TextContentHandler.endDocument() events to
 the decorated content handler.

TextDetector

Content type detection of plain text documents.

TextLangDetector

Language Detection using MIT Lincoln Lab’s Text.jl library
 https://github.com/trevorlewis/TextREST.jl

TextMatcher

Final evaluation state of a ...


TextMessageBodyWriter

Returns simple text string for a particular metadata value.

TextOnlyPDFRenderer

This class extends the PDFRenderer to render only the textual
 elements

TextProfileSignature

Copied nearly directly from Apache Nutch:
 https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/TextProfileSignature.java

TextSha256Signature

Calculates the base32 encoded SHA-256 checksum on the analyzed text

TextStatistics

Utility class for computing a histogram of the bytes seen in a stream.

TextStatsCalculator

Base text stats interface

TextStatsFromTikaEval

These examples create a new CompositeTextStatsCalculator
 for each call.

TIAParsingExample
 
TIFF

XMP Exif TIFF schema.

TiffParser
 
Tika

Facade class for accessing Tika functionality.

TikaActivator

Bundle activator that adjust the class loading mechanism of the
 ServiceLoader class to work correctly in an OSGi environment.

TikaAsyncCLI
 
TikaCLI

Simple command line interface for Apache Tika.

TikaClient
 
TikaClientCLI
 
TikaClientConfigException
 
TikaClientException
 
TikaConfig

Parse xml config file.

TikaConfigException

Tika Config Exception is an exception to occur when there is an error
 in Tika config file and/or one or more of the parsers failed to initialize
 from that erroneous config.

TikaConfigSerializer
 
TikaConfigSerializer.Mode
 
TikaCoreProperties

Contains a core set of basic Tika metadata properties, which all parsers
 will attempt to supply (where the file format permits).

TikaCoreProperties.EmbeddedResourceType

A file might contain different types of embedded documents.

TikaDetectors

Provides details of all the Detectors registered with
 Apache Tika, similar to --list-detectors with the Tika CLI.

TikaEmitterException
 
TikaEmitterResult
 
TikaEvalCLI
 
TikaEvalMetadataFilter
 
TikaEvalResource
 
TikaExcelDataFormatter

Overrides Excel's General format to include more
 significant digits than the MS Spec allows.

TikaExcelGeneralFormat

A Format that allows up to 15 significant digits for integers.

TikaException

Tika exception

TikaFileTypeDetector
 
TikaGUI

Simple Swing GUI for Apache Tika.

TikaInputStream

Input stream with extended capabilities.

TikaJsonDeserializer

See the notes @link{TikaJsonSerializer}.

TikaJsonSerializer

This is a basic serializer that requires that an object:
 a) have a no-arg constructor
 b) have both setters and getters for the same parameters with the same names, e.g. setXYZ and getXYZ
 c) setters and getters have to follow the pattern setX where x is a capital letter
 d) have maps as parameters where the keys are strings (and the values are strings for now)
 e) at deserialization time, objects that have setters for enums also have to have a setter for a string value of that enum

TikaLanguageDetector

This is Tika's original legacy, homegrown language detector.

TikaLoggingFilter
 
TikaMemoryLimitException
 
TikaMimeKeys

A collection of Tika metadata keys used in Mime Type resolution

TikaMimeTypes

Provides details of all the mimetypes known to Apache Tika,
 similar to --list-supported-types with the Tika CLI.

TikaMp4BoxHandler
 
TikaPagedText

Metadata properties for paged text, metadata appropriate
 for an individual page (useful for embedded document handlers
 called on individual pages).

TikaParsers

Provides details of all the Parsers registered with
 Apache Tika, similar to --list-parsers and
 --list-parser-details within the Tika CLI.

TikaResource
 
TikaSerializationException
 
TikaServerCli
 
TikaServerClientConfig
 
TikaServerConfig
 
TikaServerParseException

Simple wrapper exception to be thrown for consistent handling
 of exceptions that can happen during a parse.

TikaServerParseExceptionMapper
 
TikaServerProcess
 
TikaServerResource

Stub interface to allow for loading of resources via SPI

TikaServerStatus
 
TikaServerWatchDog
 
TikaServerWriter<T>

Stub interface to allow for SPI loading from other modules
 without opening up service loading to any generic MessageBodyWriter

TikaTaskTimeout
 
TikaTimeoutException

Runtime/unchecked version of TimeoutException

TikaToXMP
 
TikaUserDataBox
 
TikaVersion
 
TikaWelcome

Provides a basic welcome to the Apache Tika Server.

TikaWelcome.Endpoint
 
TimeoutConfig
 
TlsConfig
 
TMXContentHandler

Content Handler for Translation Memory eXchange (TMX) files.

TMXParser

Parser for Translation Memory eXchange (TMX) files.

TNEFParser

A POI-powered Tika Parser for TNEF (Transport Neutral
 Encoding Format) messages, aka winmail.dat

ToHTMLContentHandler

SAX event handler that serializes the HTML document to a character stream.

TokenContraster

Computes some corpus contrast statistics.

TokenCounter
Deprecated.
use CompositeTextStatsCalculator
 with TokenEntropy,
 TokenLengths
 and TopNTokens.

TokenCountPriorityQueue
 
TokenCountPriorityQueue
 
TokenCounts
 
TokenCountStatsCalculator<T>

Interface for calculators that require token stats

TokenEntropy
 
TokenIntPair
 
TokenLengths
 
TokenStatistics
 
TopCommonTokenCounter

Utility class that reads in a UTF-8 input file with one document per row
 and outputs the 20000 tokens with the highest document frequencies.

TopNTokens
 
TotalCounter

Interface for pipesiterators that allow counting of total
 documents.

TotalCountResult
 
TotalCountResult.STATUS
 
ToTextContentHandler

SAX event handler that writes all character content out to a character
 stream.

ToXMLContentHandler

SAX event handler that serializes the XML document to a character stream.

TrainedModel
 
TrainedModelDetector
 
TrainTestSplit
 
TranscribeTranslateExample

This example demonstrates primitive logic for
 chaining Tika API calls.

Transformer
 
TranslateResource
 
Translator

Interface for Translator services.

TranslatorExample
 
TrecDocumentGenerator

Generates document summaries for corpus analysis in the Open Relevance
 project.

TrueTypeParser

Parser for TrueType font files (TTF).

Truncator
 
TSDParser

Tika parser for Time Stamped Data Envelope (application/timestamped-data)

TwoBytesOfData

This class is used to represent the property contains 2 bytes of data in the PropertySet.rgData stream field.

TXTParser

Plain text parser.

TypeDetector

Content type detection based on a content type hint.

UByte

The unsigned byte type

UInteger

The unsigned int type

ULong

The unsigned long type

UMath
 
UnicodeBlockCounter
 
UniversalEncodingDetector
 
UnpackerResource
 
UnrarParser

Parser for Rar files.

Unsigned

A utility class for static access to unsigned number functionality.

UnsupportedFormatException

Parsers should throw this exception when they encounter
 a file format that they do not support.

UNumber

A base type for unsigned numbers.

URLEmailNormalizingFilterFactory

Factory for filter that normalizes urls and emails to __url__ and __email__
 respectively.

UrlFetcher

Simple fetcher for URLs.

UShort

The unsigned short type

UuidUtils
 
VectorGraphicsOnlyPDFRenderer

This class extends the PDFRenderer to render only the textual
 elements

WACZParser
 
WARC
 
WARCParser

This uses jwarc to parse warc files and arc files

WatchDogResult
 
WebPParser
 
WMFParser

This parser offers a very rough capability to extract text if there
 is text stored in the WMF files.

Word2006MLParser
 
WordExtractor
 
WordExtractor.TagAndStyle
 
WordMLParser

Parses wordml 2003 format word files.

WordPerfect

WordPerfect properties collection.

WordPerfectParser

Parser for Corel WordPerfect documents.

WriteLimiter
 
WriteLimitReachedException
 
WriteOutContentHandler

SAX event handler that writes content up to an optional write
 limit out to a character stream or other decorated handler.

XHTMLContentHandler

Content handler decorator that simplifies the task of producing XHTML
 events for Tika content parsers.

XLIFF12ContentHandler

Content Handler for XLIFF 1.2 documents.

XLIFF12Parser

Parser for XLIFF 1.2 files.

XLSXHREFFormatter
 
XLZParser

Parser for XLZ Archives.

XMLDOMUtil
 
XMLErrorLogUpdater

This is a very task specific class that reads a log file and updates
 the "comparisons" table.

XMLLogMsgHandler
 
XMLLogReader
 
XMLParser

XML parser.

XMLProfiler

XMLReaderUtils

Utility functions for reading XML.

XmlRootExtractor

Utility class that uses a SAXParser to determine
 the namespace URI and local name of the root element of an XML file.

XMP
 
XMPContentHandler

Content handler decorator that simplifies the task of producing XMP output.

XMPDM

XMP Dynamic Media schema.

XMPDM.ChannelTypePropertyConverter
Deprecated.
Experimental method, will change shortly

XMPIdq
 
XMPMessageBodyWriter
 
XMPMetadata

Provides a conversion of the Metadata map from Tika to the XMP data model by also providing the
 Metadata API for clients to ease transition.

XMPMetadataExtractor

XMP Metadata Extractor based on Apache XmpBox.

XMPMetadataResource
 
XMPMM
 
XMPPacketScanner

This class is a parser for XMP packets.

XMPRights

XMP Rights management schema.

XMPSchemaIllustrator
 
XMPSchemaPDFUA
 
XMPSchemaPDFVT
 
XMPSchemaPDFX

This is somewhat of a hack to handle the older pdfx:
 See also the more modern XMPSchemaPDFXId

XMPSchemaPDFXId
 
XPathParser

Parser for a very simple XPath subset.

XPSExtractorDecorator
 
XPSTextExtractor

Currently, mostly a pass-through class to hold pkg and properties
 and keep the general framework similar to our other POI-integrated
 extractors.

XSLFEventBasedPowerPointExtractor
 
XSLFPowerPointExtractorDecorator
 
XSSFBExcelExtractorDecorator
 
XSSFExcelExtractorDecorator
 
XSSFExcelExtractorDecorator.HeaderFooterFromString
 
XSSFExcelExtractorDecorator.SheetTextAsHTML

Turns formatted sheet events into HTML

XSSFExcelExtractorDecorator.XSSFSheetInterestingPartsCapturer

Captures information on interesting tags, whilst
 delegating the main work to the formatting handler

XUserDefinedCharset
 
XUserDefinedCharset.NotImplementedException
 
XWPFEventBasedWordExtractor

Experimental class that is based on POI's XSSFEventBasedExcelExtractor

XWPFListManager
 
XWPFNumberingShim

Stub class of POI's XWPFNumbering because onDocumentRead() is protected

XWPFStylesShim

For Tika, all we need (so far) is a mapping between styleId and a style's name.

XWPFWordExtractorDecorator
 
YandexTranslator

An implementation of a REST client for the YANDEX Translate API.

ZeroByteFileException

Exception thrown by the AutoDetectParser when a file contains zero-bytes.

ZeroByteFileException.IgnoreZeroByteFileException
 
ZeroSizeFileDetector

Detector to identify zero length files as application/x-zerovalue

ZipContainerDetector

Classes that implement this must be able to detect on a ZipFile and in streaming mode.

ZipFilesChunking

This class is used to process zip file chunking

ZipHeader
 
ZipListFiles

Example code listing from Chapter 1.

ZipSalvager
 
ZipWriter