All Classes and Interfaces (Apache Tika 2.9.0 API)

Utility class that runs TopCommonTokenCounter against a directory of table files (named {lang}_table.gz or leipzip-like afr_...-sentences.txt) and outputs common tokens files for each input table file in the output directory.

BinaryItem

Bit

The class is used to read/set bit value for a byte array

BitConverter

BitReader

A class is used to extract values across byte boundaries with arbitrary bit positions.

BitWriter

BodyContentHandler

Content handler decorator that only passes everything inside the XHTML <body/> tag to the underlying handler.

BoilerpipeContentHandler

Uses the boilerpipe library to automatically extract the main content from a web page.

BouncyCastleDigester

Digester that relies on BouncyCastle for MessageDigest implementations.

BoundedInputStream

Very slight modification of Commons' BoundedInputStream so that we can figure out if this hit the bound or not.

BPGParser

Parser for the Better Portable Graphics (BPG) File Format.

BPListDetector

Detector for BPList with utility functions for PList.

ByteDeleter

ByteFlipper

ByteInjector

BytesRefCalculator<T>

Interface for calculators that require a string

BytesRefCalculator.BytesRefCalcInstance<T>

ByteUtil

CachedTranslator

CachedTranslator.

CallablePipesIterator

This is a simple wrapper around PipesIterator that allows it to be called in its own thread.

CantFuzzException

CaptionObject

A model for caption objects from graphics and texts typically includes human readable sentence, language of the sentence and confidence score.

Cell of content.

Cell decorator.

CellManifestCurrentRevision

CellManifestDataElementData

Cell manifest data element

CharsetDetector

CharsetDetector provides a facility for detecting the charset or encoding of character data in an unknown format.

CharsetMatch

This class represents a charset that has been identified by a CharsetDetector as a possible encoding for a set of input data.

CharsetUtils

ChildMatcher

Intermediate evaluation state of a .../*... XPath expression.

ChmAccessor<T>

Defines an accessor interface

ChmAssert

Contains chm extractor assertions

ChmBlockInfo

A container that contains chm block information such as: i.

ChmCommons

ChmCommons.EntryType

Represents entry types: uncompressed, compressed

ChmCommons.IntelState

Represents intel file states during decompression

ChmCommons.LzxState

Represents lzx states: started decoding, not started decoding

ChmConstants

ChmDirectoryListingSet

Holds chm listing entries

ChmExtractor

Extracts text from chm file.

ChmItsfHeader

The Header 0000: char[4] 'ITSF' 0004: DWORD 3 (Version number) 0008: DWORD Total header length, including header section table and following data.

ChmItspHeader

Directory header The directory starts with a header; its format is as follows: 0000: char[4] 'ITSP' 0004: DWORD Version number 1 0008: DWORD Length of the directory header 000C: DWORD $0a (unknown) 0010: DWORD $1000 Directory chunk size 0014: DWORD "Density" of quickref section, usually 2 0018: DWORD Depth of the index tree - 1 there is no index, 2 if there is one level of PMGI chunks 001C: DWORD Chunk number of root index chunk, -1 if there is none (though at least one file has 0 despite there being no index chunk, probably a bug) 0020: DWORD Chunk number of first PMGL (listing) chunk 0024: DWORD Chunk number of last PMGL (listing) chunk 0028: DWORD -1 (unknown) 002C: DWORD Number of directory chunks (total) 0030: DWORD Windows language ID 0034: GUID {5D02926A-212E-11D0-9DF9-00A0C922E6EC} 0044: DWORD $54 (This is the length again) 0048: DWORD -1 (unknown) 004C: DWORD -1 (unknown) 0050: DWORD -1 (unknown)

ChmLzxBlock

Decompresses a chm block.

ChmLzxcControlData

::DataSpace/Storage//ControlData This file contains $20 bytes of information on the compression.

ChmLzxcResetTable

LZXC reset table For ensuring a decompression.

Description Note: not always exists An index chunk has the following format: 0000: char[4] 'PMGI' 0004: DWORD Length of quickref/free area at end of directory chunk 0008: Directory index entries (to quickref/free area) The quickref area in an PMGI is the same as in an PMGL The format of a directory index entry is as follows: BYTE: length of name BYTEs: name (UTF-8 encoded) ENCINT: directory listing chunk which starts with name Encoded Integers aka ENCINT An ENCINT is a variable-length integer.

ChmPmglHeader

Description There are two types of directory chunks -- index chunks, and listing chunks.

ChmSection

ChmWrapper

ChunkingFactory

This class is used to create instance of AbstractChunking.

ChunkingMethod

CJKBigramAwareLengthFilterFactory

Creates a very narrowly focused TokenFilter that limits tokens based on length _unless_ they've been identified as <DOUBLE> or <SINGLE> by the CJKBigramFilter.

ClassLoaderUtil

ClassParser

Parser for Java .class files.

CleanPhoneText

Class to help de-obfuscate phone numbers in text.

ClearByMimeMetadataFilter

This class clears the entire metadata object if the mime matches the mime filter.

ClimateForcast

Met keys from NCAR CCSM files in the Climate Forecast Convention.

ColInfo

Cols

CommandLineParserBuilder

Reads configurable options from a config file and returns org.apache.commons.cli.Options object to be used in commandline parser.

CommonsDigester

Implementation of DigestingParser.Digester that relies on commons.codec.digest.DigestUtils to calculate digest hashes.

CommonsDigester.DigestAlgorithm

CommonsDigesterFactory

Simple factory for CommonsDigester with default markLimit = 1000000 and md5 digester.

CommonTokenCountManager

CommonTokenOverlapCounter

CommonTokenResult

CommonTokens

CommonTokensBhattacharyya

CommonTokensCosine

CommonTokensHellinger

CommonTokensKLDivergence

CommonTokensKLDNormed

Compact64bitInt

A 9-byte encoding of values in the range 0x0002000000000000 through 0xFFFFFFFFFFFFFFFF

CompactID

This class is used to represent the CompactID structrue.

CompareUtils

CompositeDetector

Content type detector that combines multiple different detection mechanisms.

CompositeDigester

CompositeEncodingDetector

CompositeExternalParser

A Composite Parser that wraps up all the available External Parsers, and provides an easy way to access them.

CompositeMatcher

Composite XPath evaluation state.

CompositeMetadataFilter

CompositeParseContextConfig

CompositeParser

Composite parser that delegates parsing tasks to a component parser based on the declared content type of the incoming document.

CompositePipesReporter

CompositeRenderer

CompositeTagHandler

Takes an array of ID3Tags in preference order, and when asked for a given tag, will return it from the first ID3Tags that has it.

CompositeTextStatsCalculator

CompressorConstants

CompressorParser

Parser for various compression formats.

CompressorParserOptions

Interface for setting options for the CompressorParser by passing via the ParseContext.

ConcurrentUtils

Utility Class for Concurrency in Tika

ConfigBase

ConfigurableThreadPoolExecutor

Allows Thread Pool to be Configurable.

ConsumersManager

Simple interface around a collection of consumers that allows for initializing and shutting shared resources (e.g.

ContainerExtractor

Tika container extractor interface.

ContentHandlerDecorator

Decorator base class for the ContentHandler interface.

ContentHandlerDecoratorFactory

ContentHandlerExample

Examples of using different Content Handlers to get different parts of the file's contents

ContentHandlerFactory

Interface to allow easier injection of code for getting a new ContentHandler

ContentLengthCalculator

This class offers an implementation of NERecogniser based on CRF classifiers from Stanford CoreNLP.

CorruptedFileException

This exception should be thrown when the parse absolutely, positively has to stop.

CreativeCommons

A collection of Creative Commons properties names.

CryptoParser

Decrypts the incoming document stream and delegates further parsing to another parser instance.

CSVMessageBodyWriter

CSVParams

CSVPipesIterator

Iterates through a UTF-8 CSV file.

CSVResult

CTAKESAnnotationProperty

This enumeration includes the properties that an IdentifiedAnnotation object can provide.

CTAKESConfig

Configuration for CTAKESContentHandler.

CTAKESContentHandler

Class used to extract biomedical information while parsing.

CTAKESParser

CTAKESParser decorates a Parser and leverages on CTAKESContentHandler to extract biomedical information from clinical text using Apache cTAKES.

CTAKESSerializer

Enumeration for types of cTAKES (UIMA) CAS serializer supported by cTAKES.

CTAKESUtils

This class provides methods to extract biomedical information from plain text using CTAKESContentHandler that relies on Apache cTAKES.

Base class of data element

DataElementHash

Specifies an data element hash stream object

DataElementPackage

DataElementParseErrorException

DataElementType

The enumeration of the data element type

DataElementUtils

DataHashObject

DataNodeObjectData

Data Node Object data

DataSizeObject

Data Size Object

DataURIScheme

DataURISchemeParseException

DataURISchemeUtil

Not thread safe.

DateNormalizingMetadataFilter

Some dates in some file formats do not have a timezone.

DateUtils

Date related utility methods and constants

DBBuffer

DBConsumersManager

DBFParser

This is a Tika wrapper around the DBFReader.

DBWriter

This is still in its early stages.

DcXMLParser

Dublin Core metadata parser

DefaultContentHandlerFactoryBuilder

Builds BasicContentHandler with type defined by attribute "basicHandlerType" with possible values: xml, html, text, body, ignore.

DefaultDetector

A composite detector based on all the Detector implementations available through the service provider mechanism.

DefaultEmbeddedStreamTranslator

Loads EmbeddedStreamTranslators via service loading.

DefaultEncodingDetector

A composite encoding detector based on all the EncodingDetector implementations available through the service provider mechanism.

DefaultHtmlMapper

The default HTML mapping rules in Tika.

DefaultInputStreamFactory

Passthrough -- returns InputStream as is

DefaultMetadataFilter

DefaultParser

A composite parser based on all the Parser implementations available through the service provider mechanism.

DefaultProbDetector

A version of DefaultDetector for probabilistic mime detectors, which use statistical techniques to blend the results of differing underlying detectors when attempting to detect the type of a given file.

DefaultTranslator

A translator which picks the first available Translator implementations available through the service provider mechanism.

DefaultZipContainerDetector

DelegatingParser

Base class for parser implementations that want to delegate parts of the task of parsing an input document to another parser.

DeprecatedStreamingZipContainerDetector

DeprecatedZipContainerDetector

A detector that works on Zip documents and tries to figure out basic types -- epub, jar, ear, war, kmz and StarOffice

DescribeMetadata

Print the supported Tika Metadata models and their fields.

Detector

Content type detector.

DetectorResource

DGN8Parser

This is a VERY LIMITED parser.

DIFContentHandler

DIFParser

DigestingAutoDetectParserFactory

DigestingParser

DigestingParser.Digester

Interface for digester.

DigestingParser.DigesterFactory

This is used in AutoDetectParserConfig to (optionally) wrap the parser in a digesting parser.

DigestingParser.Encoder

Encodes byte array from a MessageDigest to String

DirectoryListingEntry

The format of a directory listing entry is as follows: BYTE: length of name BYTEs: name (UTF-8 encoded) ENCINT: content section ENCINT: offset ENCINT: length The offset is from the beginning of the content section the file is in, after the section has been decompressed (if appropriate).

DirListParser

Parses the output of /bin/ls and counts the number of files and the number of executables using Tika.

DisplayMetInstance

Grabs a PDF file from a URL and prints its Metadata

DL4JInceptionV3Net

DL4JInceptionV3Net is an implementation of ObjectRecogniser.

DL4JVGG16Net

DocumentSelector

Interface for different document selection strategies for purposes like embedded document extraction by a ContainerExtractor instance.

DocumentSelectorConfig

DublinCore

A collection of Dublin Core metadata names.

DumpTikaConfigExample

This class shows how to dump a TikaConfig object to a configuration file.

DurationFormatUtils

Functionality and naming conventions (roughly) copied from org.apache.commons.lang3 so that we didn't have to add another dependency.

DWGParser

DWG (CAD Drawing) parser.

DWGParserConfig

DWGReadFormatRemover

DWGReadFormatRemover removes the formatting from the text from libredwg files so only the raw text remains.

DWGReadParser

DWGReadParser (CAD Drawing) parser.

EightBytesOfData

This class is used to represent the property contains 8 bytes of data in the PropertySet.rgData stream field.

ElementMappingContentHandler

Content handler decorator that maps element QNames using a Map.

ElementMappingContentHandler.TargetElement

ElementMatcher

Final evaluation state of an XPath expression that targets an element.

ElementMetadataHandler

SAX event handler that maps the contents of an XML element into a metadata field.

EmbeddedContentHandler

Content handler decorator that prevents the EmbeddedContentHandler.startDocument() and EmbeddedContentHandler.endDocument() events from reaching the decorated handler.

EmbeddedDocumentExtractor

EmbeddedDocumentExtractorFactory

EmbeddedDocumentUtil

Utility class to handle common issues with embedded documents.

EmbeddedPartMetadata

This class records metadata about embedded parts that exists in the xml of the main document.

EmbeddedResourceHandler

Tika container extractor callback interface.

EmbeddedStreamTranslator

Interface for different filtering of embedded streams.

Embedder

Tika embedder interface

EMFParser

Extracts files embedded in EMF and offers a very rough capability to extract text if there is text stored in the EMF.

Utility class that will apply the appropriate fetcher to the fetcherString based on the prefix.

EmptyDetector

Dummy detector that returns application/octet-stream for all documents.

EmptyEmitter

EmptyFetcher

EmptyParser

Dummy parser that always produces an empty XHTML document without even attempting to parse the given document stream.

EmptyTranslator

Dummy translator that always declines to give any text.

EncodingDetector

Character encoding detector.

EncryptedDocumentException

EncryptedPrescriptionDetector

EncryptedPrescriptionParser

EndDocumentShieldingContentHandler

A wrapper around a ContentHandler which will ignore normal SAX calls to EndDocumentShieldingContentHandler.endDocument(), and only fire them later.

EndianUtils

General Endian Related Utilties.

EndianUtils.BufferUnderrunException

EnviHeaderParser

Epub

EPub properties collection.

EpubContentParser

Parser for EPUB OPS *.html files.

EpubParser

Epub parser

Error

ErrorParser

Dummy parser that always throws a TikaException without even attempting to parse the given document stream.

Excel parser implementation which uses POI's Event API to handle the contents of a Workbook.

ExceptionUtils

ExcludeFieldMetadataFilter

ExecutableParser

Parser for executable files.

ExGuid

ExGUIDArray

ExpandedTitleContentHandler

Content handler decorator which wraps a TransformerHandler in order to allow the TITLE tag to render as <title></title> rather than <title/> which is accomplished by calling the ContentHandler.characters(char[], int, int) method with a length of 1 but a zero length char array.

ExtendedGUID

ExternalEmbedder

Embedder that uses an external program (like sed or exiftool) to embed text content and metadata into a given document.

ExternalParser

Parser that uses an external program (like catdoc or pdf2txt) to extract text content and metadata from a given document.

ExternalParser

This is a next generation external parser that uses some of the more recent additions to Tika.

ExternalParser.LineConsumer

Consumer contract

ExternalParsersConfigReader

Builds up ExternalParser instances based on XML file(s) which define what to run, for what, and how to process any output metadata.

ExternalParsersConfigReaderMetKeys

Met Keys used by the ExternalParsersConfigReader.

ExternalParsersFactory

Creates instances of ExternalParser based on XML configuration files.

ExternalProcess

ExternalTranslator

Abstract class used to interact with command line/external Translators.

ExtractComparer

ExtractComparerBuilder

ExtractEmbeddedFiles

ExtractProfiler

ExtractProfilerBuilder

ExtractReader

ExtractReader.ALTER_METADATA_LIST

ExtractReaderException

Exception when trying to read extract

ExtractReaderException.TYPE

FailedToStartClientException

This should be catastrophic

FallbackParser

Tries multiple parsers in turn, until one succeeds.

FeedParser

Feed parser.

FetchEmitTuple

FetchEmitTuple.ON_PARSE_EXCEPTION

Fetcher

Interface for an object that will fetch an InputStream given a fetch string.

FetcherManager

Utility class to hold multiple fetchers.

FetcherStreamFactory

This class looks for "fetcherName" in the http header.

FetcherStringException

If something goes wrong in parsing the fetcher string

FetchKey

Pair of fetcherName (which fetcher to call) and the key to send to that fetcher to retrieve a specific file.

FictionBookParser

Field

Field annotation is a contract for binding Param value from Tika Configuration to an object.

FieldNameMappingFilter

FileCommandDetector

This runs the linux 'file' command against a file.

FileListPipesIterator

Reads a list of file names/relative paths from a UTF-8 file.

FilenameUtils

FileProcessResult

FileProfiler

This class profiles actual files as opposed to extracts e.g.

FileProfilerBuilder

FileResource

This is a basic interface to handle a logical "file".

FileResourceConsumer

This is a base class for file consumers.

FileResourceCrawler

FileSystem

A collection of metadata elements for file system level metadata

FileSystemEmitter

Emitter to write to a file system.

FileSystemFetcher

FileSystemPipesIterator

FileSystemStatusReporter

This is intended to write summary statistics to disk periodically.

FileTooLongException

FlatOpenDocumentParser

FLVParser

Parser for metadata contained in Flash Videos (.flv).

This class is used to represent the property contains 4 bytes of data in the PropertySet.rgData stream field.

FrictionlessPackageDetector

FSBatchProcessCLI

FSConsumersManager

FSCrawlerBuilder

Builds either an FSDirectoryCrawler or an FSListCrawler.

FSDirectoryCrawler

FSDirectoryCrawler.CRAWL_ORDER

FSDocumentSelector

Selector that chooses files based on their file name and their size, as determined by TikaCoreProperties.RESOURCE_NAME_KEY and Metadata.CONTENT_LENGTH.

FSFileResource

FileSystem(FS)Resource wraps a file name.

FSListCrawler

Class that "crawls" a list of files.

FSOutputStreamFactory

FSOutputStreamFactory.COMPRESSION

FSProperties

FSUtil

Utility class to handle some common issues when reading from and writing to a file system (FS).

FSUtil.HANDLE_EXISTING

FuzzingCLI

FuzzingCLIConfig

FuzzOne

Forked process that runs against a single input file

GCSEmitter

GCSFetcher

Fetches files from google cloud storage.

GCSPipesIterator

GDALParser

Wraps execution of the Geospatial Data Abstraction Library (GDAL) gdalinfo tool used to extract geospatial information out of hundreds of geo file formats.

GeneralTransformer

GenericConverter

Trys to convert as much of the properties in the Metadata map to XMP namespaces.

GeoGazetteerClient

Geographic

Geographic schema.

GeographicInformationParser

GeoParser

GeoParserConfig

GeoPointMetadataFilter

If Metadata contains a TikaCoreProperties.LATITUDE and a TikaCoreProperties.LONGITUDE, this filter concatenates those with a comma in the order LATITUDE,LONGITUDE.

GeoTag

GlobalIdTableEntry3FNDX

GlobalIdTableEntryFNDX

GoogleTranslator

An implementation of a REST client to the Google Translate v2 API.

GrabPhoneNumbersExample

Class to demonstrate how to use the PhoneExtractingContentHandler to get a list of all of the phone numbers from every file in a directory.

GZipSpecializationDetector

This is designed to detect commonly gzipped file types such as warc.gz.

H2Util

HandlerConfig

HandlerConfig.PARSE_MODE

HandlerConfig.PARSE_MODE.RMETA "recursive metadata" is the same as the -J option in tika-app and the /rmeta endpoint in tika-server.

HDFParser

Since the NetCDFParser depends on the NetCDF-Java API, we are able to use it to parse HDF files as well.

HeaderCell

HeifParser

HexCoDec

A set of Hex encoding and decoding utility methods.

HSLFExtractor

HTML

HtmlEncodingDetector

Character encoding detector for determining the character encoding of a HTML document based on the potential charset parameter found in a Content-Type http-equiv meta tag somewhere near the beginning.

HTMLHelper

Helps produce user facing HTML output.

HtmlMapper

HTML mapper used to make incoming HTML documents easier to handle by Tika clients.

HtmlParser

HTML parser.

HttpClientFactory

This holds quite a bit of state and is not thread safe.

HttpClientUtil

HttpFetcher

Based on Apache httpclient

HttpHeaders

A collection of HTTP header names.

A basic parser class for Apple ICNS icon files

IContentHandlerFactoryBuilder

ICrawlerBuilder

Icu4jEncodingDetector

ID3Tags

Interface that defines the common interface for ID3 tag parsers, such as ID3v1 and ID3v2.3.

ID3Tags.ID3Comment

Represents a comments in ID3 (especially ID3 v2), where are made up of several parts

ID3v1Handler

This is used to parse ID3 Version 1 Tag information from an MP3 file, if available.

ID3v22Handler

This is used to parse ID3 Version 2.2 Tag information from an MP3 file, if available.

ID3v23Handler

This is used to parse ID3 Version 2.3 Tag information from an MP3 file, if available.

ID3v24Handler

This is used to parse ID3 Version 2.4 Tag information from an MP3 file, if available.

ID3v2Frame

A frame of ID3v2 data, which is then passed to a handler to be turned into useful data.

ID3v2Frame.RawTag

ID3v2Frame.TextEncoding

IDBWriter

IdentityHtmlMapper

Alternative HTML mapping rules that pass the input HTML as-is without any modifications.

IDMLParser

Adobe InDesign IDML Parser.

IFileProcessorFutureResult

stub interface to allow for different result types from different processors

IFSSHTTPBSerializable

FSSHTTPB Serialize interface.

ImageDeskew

ImageDeskew.HoughLine

ImageGraphicsEngine

Copied nearly verbatim from PDFBox

ImageGraphicsEngineFactory

ImageMetadataExtractor

Uses the Metadata Extractor library to read EXIF and IPTC image metadata and map to Tika fields.

ImageParser

ImageUtil

ImportContextImpl

ImportContextImpl...

IncludeFieldMetadataFilter

IncrementalUpdateRecord

Initializable

Components that must do special processing across multiple fields at initialization time should implement this interface.

InitializableProblemHandler

This is to be used to handle potential recoverable problems that might arise during initialization.

InputStreamDigester

InputStreamFactory

A factory which returns a fresh InputStream for the same resource each time.

InputStreamFactory

Interface to allow for custom/consistent creation of InputStream

IntermediateNodeObject

IntermediateNodeObject.RootNodeObjectBuilder

The class is used to build a root node object.

InterruptableParsingExample

This example demonstrates how to interrupt document parsing if some condition is met.

Interrupter

Class that waits for input on System.in.

InterrupterBuilder

Builds an Interrupter

InterrupterFutureResult

IOUtils

IPADetector

IParserFactoryBuilder

IProperty

The interface of the property in OneNote file.

IPTC

IPTC photo metadata schema.

IptcAnpaParser

Parser for IPTC ANPA New Wire Feeds

Interface for the specific Metadata to XMP converters

IWork13PackageParser

IWork13PackageParser.IWork13DocumentType

IWork18PackageParser

For now, this parser isn't even registered.

IWork18PackageParser.IWork18DocumentType

IWorkDetector

IWorkPackageParser

A parser for the IWork container files.

IWorkPackageParser.IWORKDocumentType

JackcessParser

Parser that handles Microsoft Access files via Jackcess

JarDetector

JCID

This class is used to represent a JCID

JCIDObject

This class is used to represent the JCID object.

JDBCEmitter

This is only an initial, basic implementation of an emitter for JDBC.

JDBCEmitter.AttachmentStrategy

JDBCEmitter.MultivaluedFieldStrategy

JDBCPipesIterator

Iterates through a the results from a sql call via jdbc.

JDBCPipesReporter

This is an initial draft of a JDBCPipesReporter.

JDBCTableReader

General base class to iterate through rows of a JDBC table

JDBCUtil

JDBCUtil.CREATE_TABLE

JempboxExtractor

JoshuaNetworkTranslator

This translator is designed to work with a TCP-IP available Joshua translation server, specifically the REST-based Joshua server.

JsonFetchEmitTupleList

JSONMessageBodyWriter

JsonMetadata

JsonMetadataDeserializer

JsonMetadataList

JsonMetadataSerializer

JSONObjWriter

JsonResponse

JsonStreamingSerializer

JXLParser

Tries to scrape XMP out of JXL

KafkaEmitter

Emits the now-parsed documents into a specified Apache Kafka topic.

LanguageAwareTokenCountStats<T>

Interface for calculators that require language probabilities and token stats

LanguageConfidence

LanguageDetectingParser

LanguageDetector

LanguageDetectorExample

LanguageDetectorTest

LanguageHandler

SAX content handler that updates a language detector based on all the received character content.

LanguageIdentifier

Identifier of the language that best matches a given content profile.

LanguageIDWrapper

LanguageNames

Support for language tags (as defined by https://tools.ietf.org/html/bcp47)

LanguageProfile

Language profile based on ngram counts.

LanguageProfilerBuilder

This class runs a ngram analysis over submitted text, results might be used for automatic language identification.

LanguageResource

LanguageResult

LanguageWriter

Writer that builds a language profile based on all the written content.

Latin1StringsParser

Parser to extract printable Latin1 strings from arbitrary files with pure java without running any external process.

LeafNodeObject

LeafNodeObject.IntermediateNodeObjectBuilder

The class is used to build a intermediate node object.

LeipzigHelper

LeipzigSampler

Lingo24LangDetector

An implementation of a Language Detector using the Premium MT API v1.

Lingo24Translator

An implementation of a REST client for the Premium MT API v1.

Link

LinkContentHandler

Content handler that collects links from an XHTML document.

LinkedCell

Linked cell.

ListDescriptor

Contains the information for a single list in the list or list override tables.

ListManager

Computes the number text which goes at the beginning of each list paragraph

LittleEndianBitConverter

Implement a converter which converts to/from little-endian byte arrays

LoadErrorHandler

Interface for error handling strategies in service class loading.

Location

LoggingPipesReporter

Simple PipesReporter that logs everything at the debug level.

LookaheadInputStream

Stream wrapper that make it easy to read up to n bytes ahead from a stream that supports the mark feature.

LuceneIndexer

LuceneIndexerExtended

LyricsHandler

This is used to parse Lyrics3 tag information from an MP3 file, if available.

MachineMetadata

Metadata for describing machines, such as their architecture, type and endian-ness

MachineMetadata.Endian

MagicDetector

Content type detection based on magic bytes, i.e.

MailDateParser

Dates in emails are a mess.

MailUtil

MappedBufferCleaner

Copied/pasted from the Apache Lucene/Solr project.

MarianTranslator

Translator that uses the Marian NMT decoder for translation.

MarianTranslator.MarianServerClient

Internal Client for marian-server Web Socket Server.

Matcher

XPath element matcher.

MatchingContentHandler

Content handler decorator that only passes the elements, attributes, and text nodes that match the given XPath expression.

MatParser

MboxParser

Mbox (mailbox) parser.

MediaType

Internet media type.

MediaTypeExample

MediaTypeRegistry

Registry of known Internet media types.

Message

A collection of Message related property names.

Metadata

A multi-valued metadata container.

MetadataAwareLuceneIndexer

Builds on the LuceneIndexer from Chapter 5 and adds indexing of Metadata.

MetadataExtractor

OOXML metadata extractor.

MetadataFields

Knowns about all declared Metadata fields.

MetadataFilter

Filters the metadata in place after the parse

MetadataHandler

Deprecated.

Use the AttributeMetadataHandler and ElementMetadataHandler classes instead

MetadataList

wrapper class to make isWriteable in MetadataListMBW simpler

MetadataListMessageBodyWriter

MetadataResource

MetadataWriteFilter

MetadataWriteFilterFactory

MicrosoftTranslator

Wrapper class to access the Windows translation service.

MidiParser

MIFContentHandler

Content handler for MIF Content and Metadata.

MIFExtractor

Helper Class to Parse and Extract Adobe MIF Files.

Internet media type.

A class to encapsulate MimeType related exceptions.

MimeTypes

This class is a MimeType repository.

MimeTypesFactory

Creates instances of MimeTypes.

MimeTypesReader

A reader for XML files compliant with the freedesktop MIME-info DTD.

MimeTypesReaderMetKeys

Met Keys used by the MimeTypesReader.

MiscOLEDetector

A detector that works on a POIFS OLE2 document to figure out exactly what the file is.

MITIENERecogniser

This class offers an implementation of NERecogniser based on trained models using state-of-the-art information extraction tools.

MosesTranslator

Translator that uses the Moses decoder for translation.

MP3Frame

A frame in an MP3 file, such as ID3v2 Tags or some audio.

Mp3Parser

The Mp3Parser is used to parse ID3 Version 1 Tag information from an MP3 file, if available.

Mp3Parser.ID3TagsAndAudio

MP4Parser

Parser for the MP4 media container format, as well as the older QuickTime format that MP4 is based on.

MSEmbeddedStreamTranslator

MSOfficeBinaryConverter

Tika to XMP mapping for the binary MS formats Word (.doc), Excel (.xls) and PowerPoint (.ppt).

MSOfficeXMLConverter

Tika to XMP mapping for the Office Open XML formats Word (.docx), Excel (.xlsx) and PowerPoint (.pptx).

MSOneStorePackage

MSOneStoreParser

MSOwnerFileParser

Parser for temporary MSOFfice files.

MuPDFRenderer

MyFirstTika

Demonstrates how to call the different components within Tika: its Detector framework (aka MIME identification and repository), its Parser interface, its org.apache.tika.language.LanguageIdentifier and other goodies.

NamedAttributeMatcher

Final evaluation state of a .../@name XPath expression.

NamedElementMatcher

Intermediate evaluation state of a .../name... XPath expression.

NamedEntityParser

This implementation of Parser extracts entity names from text content and adds it to the metadata.

NameDetector

Content type detection based on the resource name.

NameEntityExtractor

Namespace

Utility class to hold namespace information.

NERecogniser

Defines a contract for named entity recogniser.

NetCDFParser

A Parser for NetCDF files using the UCAR, MIT-licensed NetCDF for Java API.

NetworkParser

NLTKNERecogniser

This class offers an implementation of NERecogniser based on ne_chunk() module of NLTK.

NNExampleModelDetector

NNTrainedModel

NNTrainedModelBuilder

NoData

This class is used to represent the property contains no data.

NodeMatcher

Final evaluation state of a .../node() XPath expression.

NodeObject

NonDetectingEncodingDetector

Always returns the charset passed in via the initializer

NoOpFilter

This filter performs no operations on the metadata and leaves it untouched.

NoTextPDFRenderer

This class extends the PDFRenderer to exclude rendering of electronic text.

NSNormalizerContentHandler

Content handler decorator that: Maps old OpenOffice 1.0 Namespaces to the OpenDocument ones Returns a fake DTD when parser requests OpenOffice DTD

NumberCell

Number cell.

ObjectFromDOMAndQueueBuilder<T>

Same as ObjectFromDOMAndQueueBuilder, but this is for objects that require access to the shared queue.

ObjectFromDOMBuilder<T>

Interface for things that build objects from a DOM Node and a map of runtime attributes

ObjectGroupData

The ObjectGroupData class.

ObjectGroupDataElementData

ObjectGroupDataElementData.Builder

The internal class for build a list of DataElement from a node object.

ObjectGroupDeclarations

Object Group Declarations

ObjectGroupMetadata

Specifies an object group metadata

ObjectGroupMetadataDeclarations

Object Metadata Declaration

ObjectGroupObjectBLOBDataDeclaration

object data BLOB declaration

ObjectGroupObjectData

ObjectGroupObjectDataBLOBReference

object data BLOB reference

ObjectGroupObjectDeclare

ObjectRecogniser

This is a contract for object recognisers used by ObjectRecognitionParser

ObjectRecognitionParser

This parser recognises objects from Images.

ObjectSpaceObjectPropSet

This class is used to represent a ObjectSpaceObjectPropSet.

ObjectSpaceObjectPropSet

ObjectSpaceObjectStreamHeader

ObjectSpaceObjectStreamOfContextIDs

This class is used to represent a ObjectSpaceObjectStreamOfContextIDs.

ObjectSpaceObjectStreamOfOIDs

This class is used to represent a ObjectSpaceObjectStreamOfOIDs.

ObjectSpaceObjectStreamOfOSIDs

This class is used to represent a ObjectSpaceObjectStreamOfOSIDs.

OfferLargerThanQueueSize

Office

Office Document properties collection.

OfficeOpenXMLCore

Core properties as defined in the Office Open XML specification part Two that are not in the DublinCore namespace.

OfficeOpenXMLExtended

Extended properties as defined in the Office Open XML specification part Four.

OfficeParser

Defines a Microsoft document content extractor.

OfficeParser.POIFSDocumentType

OfficeParserConfig

OfflineContentHandler

Content handler decorator that always returns an empty stream from the OfflineContentHandler.resolveEntity(String, String) method to prevent potential network or other external resources from being accessed by an XML parser.

OldExcelParser

A POI-powered Tika Parser for very old versions of Excel, from pre-OLE2 days, such as Excel 4.

OneByteOfData

This class is used to represent the property contains 1 byte of data in the PropertySet.rgData stream field.

OneNoteParser

OneNote tika parser capable of parsing Microsoft OneNote files.

OneNotePropertyEnum

OneNoteTreeWalkerOptions

Options when walking the one note tree.

OOXMLExtractor

Interface implemented by all Tika OOXML extractors.

OOXMLExtractorFactory

Figures out the correct OOXMLExtractor for the supplied document and returns it.

OOXMLParser

Office Open XML (OOXML) parser.

OOXMLTikaBodyPartHandler

OOXMLWordAndPowerPointTextHandler

This class is intended to handle anything that might contain IBodyElements: main document, headers, footers, notes, slides, etc.

OOXMLWordAndPowerPointTextHandler.EditType

OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler

OPCPackageDetector

OPCPackageWrapper

This is a wrapper around OPCPackage that calls revert() instead of close().

OpenDocumentContentParser

Parser for ODF content.xml files.

OpenDocumentConverter

Tika to XMP mapping for the Open Document formats: Text (.odt), Spreatsheet (.ods), Graphics (.odg) and Presentation (.odp).

OpenDocumentDetector

OpenDocumentMetaParser

Parser for OpenDocument meta.xml files.

OpenDocumentParser

OpenOffice parser

OpenNLPDetector

This is based on OpenNLP's language detector.

OpenNLPMetadataFilter

OpenNLPNameFinder

An implementation of NERecogniser that finds names in text using Open NLP Model.

OpenNLPNERecogniser

This implementation of NERecogniser chains an array of OpenNLPNameFinders for which NER models are available in classpath.

OpenSearchClient

OpenSearchEmitter

OpenSearchEmitter.AttachmentStrategy

OpenSearchEmitter.UpdateStrategy

OpenSearchPipesReporter

As of the 2.5.0 release, this is ALPHA version.

OPFParser

Use this to parse the .opf files

OptimaizeLangDetector

Implementation of the LanguageDetector API that uses https://github.com/optimaize/language-detector

OptimaizeMetadataFilter

OutlookExtractor

Outlook Message Parser.

OutlookExtractor.RECIPIENT_TYPE

OutlookPSTParser

Parser for MS Outlook PST email storage files

OutputStreamFactory

OverrideDetector

Deprecated.

after 2.5.0 this functionality was moved to the CompositeDetector

PackageConstants

PackageParser

Parser for various packaging formats.

PageBasedRenderResults

PagedText

XMP Paged-text schema.

PageRangeRequest

The range of pages to render.

ParagraphProperties

ParallelFileProcessingResult

Param<T>

This is a serializable model class for parameters from configuration file.

ParamField

This class stores metdata for Field annotation are used to map them to Param at runtime

ParseContext

Parse context.

ParseContextConfig

Implementations must be thread-safe!

Parser

Tika parser interface.

ParserContainerExtractor

An implementation of ContainerExtractor powered by the regular Parser API.

ParserDecorator

Decorator base class for the Parser interface.

ParseRecord

Use this class to store exceptions, warnings and other information during the parse.

ParserFactory

ParserFactoryBuilder

ParserFactoryFactory

Lightweight, easily serializable class that contains enough information to build a ParserFactory

ParserPostProcessor

Parser decorator that post-processes the results from a decorated parser.

ParserUtils

Helper util methods for Parsers themselves.

ParsingEmbeddedDocumentExtractor

Helper class for parsers of package archives or other compound document formats that support embedded or attached component documents.

ParsingEmbeddedDocumentExtractorFactory

ParsingExample

ParsingReader

Reader for the text content from a given binary stream.

PasswordProvider

Interface for providing a password to a Parser for handling Encrypted and Password Protected Documents.

PasswordProviderConfig

PDDocumentRenderer

stub interface for the PDFParser to use to figure out if it needs to pass on the PDDocument or create a temp file to be used by a file-based renderer down the road.

PDF

PDF properties collection.

PDFBoxRenderer

PDFMarkedContent2XHTML

This was added in Tika 1.24 as an alpha version of a text extractor that builds the text from the marked text tree and includes/normalizes some of the structural tags.

PDFParser

PDF parser.

PDFParserConfig

Config for PDFParser.

PDFParserConfig.IMAGE_STRATEGY

PDFParserConfig.OCR_RENDERING_STRATEGY

PDFParserConfig.OCR_STRATEGY

PDFParserConfig.OCRStrategyAuto

Encapsulate the numbers used to control OCR Strategy when set to auto

PDFRenderingState

PDFServerConfig

PDF parser configuration, for the request

PhoneExtractingContentHandler

Class used to extract phone numbers while parsing.

Photoshop

XMP Photoshop metadata schema.

PickBestTextEncodingParser

Deprecated.

Currently not suitable for real use, more a demo / prototype!

PipesClient

The PipesClient is designed to be single-threaded.

PipesConfig

PipesConfigBase

PipesException

Fatal exception that means that something went seriously wrong.

PipesIterator

Abstract class that handles the testing for timeouts/thread safety issues.

PipesParser

PipesReporter

This is called asynchronously by the AsyncProcessor.

PipesReporterBase

Base class that includes filtering by PipesResult.STATUS

This server is forked from the PipesClient.

PipesServer.STATUS

Pkcs7Parser

Basic parser for PKCS7 data.

PListParser

Parser for Apple's plist and bplist.

POIFSContainerDetector

A detector that works on a POIFS OLE2 document to figure out exactly what the file is.

POIXMLTextExtractorDecorator

PooledTimeSeriesParser

Uses the Pooled Time Series algorithm + command line tool, to generate a numeric representation of the video suitable for similarity searches.

PrescriptionParser

PrettyMetadataKeyComparator

ProbabilisticMimeDetectionSelector

Selector for combining different mime detection results based on probability

ProbabilisticMimeDetectionSelector.Builder

build class for probability parameters setting

ProcessUtils

ProduceTypeResourceComparator

Resource comparator based to produce type.

ProfilingWriter

Writer that builds a language profile based on all the written content.

Property

XMP property definition.

Property.PropertyType

Property.ValueType

PropertyID

This class is used to represent a PropertyID.

PropertySet

This class is used to represent a PropertySet.

PropertySetObject

This class is used to represent the property set.

PropertyType

PropertyTypeException

XMP property definition violation exception.

PropsUtil

Utility class to handle properties.

PrtArrayOfPropertyValues

The class is used to represent the prtArrayOfPropertyValues .

PrtFourBytesOfLengthFollowedByData

This class is used to represent the prtFourBytesOfLengthFollowedByData.

PRTParser

A basic text extracting parser for the CADKey PRT (CAD Drawing) format.

PSDParser

Parser for the Adobe Photoshop PSD File Format.

QuattroPro

QuattroPro properties collection.

QuattroProParser

Parser for Corel QuattroPro documents (part of Corel WordPerfect Office Suite).

RangeFetcher

This class extracts a range of bytes from a given fetch key.

RarParser

Parser for Rar files.

RDCAnalysisChunking

This class is used to process RDC analysis chunking

RecentFiles

Builds on top of the LuceneIndexer and the Metadata discussions in Chapter 6 to output an RSS (or RDF) feed of files crawled by the LuceneIndexer within the last N minutes.

RecognisedObject

A model for recognised objects from graphics and texts typically includes human readable label for the object, language of the label, id and confidence score.

RecursiveMetadataResource

RecursiveParserWrapper

This is a helper class that wraps a parser in a recursive handler.

RecursiveParserWrapperFSConsumer

This runs a RecursiveParserWrapper against an input file and outputs the json metadata to an output file.

RecursiveParserWrapperHandler

This is the default implementation of AbstractRecursiveParserWrapperHandler.

RegexCaptureParser

RegexNERecogniser

This class offers an implementation of NERecogniser based on Regular Expressions.

RegexUtils

Inspired from Nutch code class OutlinkExtractor.

Renderer

Interface for a renderer.

Rendering

RenderingParser

RenderingState

This should be to track state for each file (embedded or otherwise).

RenderingTracker

Use this in the ParseContext to keep track of unique ids for rendered images in embedded docs.

RenderRequest

Empty interface for requests to a renderer.

An implementation of the standard "replacement" charset defined by the W3C.

Report

This class represents a single report.

ReporterBuilder

Interface for reporter builders

RequestTypes

The enumeration of request type.

RereadableInputStream

Wraps an input stream, reading it only once, but making it available for rereading an arbitrary number of times.

ResultsReporter

RevisionManifest

RevisionManifestDataElementData

RevisionManifestObjectGroupReferences

Specifies a revision manifest object group references, each followed by object group extended GUIDs

RevisionManifestRootDeclare

Specifies a revision manifest root declare, each followed by root and object extended GUIDs

RevisionStoreObject

The class is used to represent the revision store object.

RevisionStoreObjectGroup

RFC822Parser

Uses apache-mime4j to parse emails.

RichTextContentHandler

Content handler for Rich Text, it will extract XHTML <img/> tag <alt/> attribute and XHTML <a/> tag <name/> attribute into the output.

RollbackSoftware

Demonstrates Tika and its ability to sense symlinks.

RTFConverter

Tika to XMP mapping for the RTF format.

RTFMetadata

RTFParser

RTF parser

RTGTranslator

This translator is designed to work with a TCP-IP available RTG translation server, specifically the REST-based RTG server.

RunProperties

WARNING: This class is mutable.

RuntimeSAXException

Use this to throw a SAXException in subclassed methods that don't throw SAXExceptions

S3Emitter

Emits to existing s3 bucket

S3Fetcher

Fetches files from s3.

S3PipesIterator

SafeContentHandler

Content handler decorator that makes sure that the character events (SafeContentHandler.characters(char[], int, int) or SafeContentHandler.ignorableWhitespace(char[], int, int)) passed to the decorated content handler contain only valid XML characters.

SafeContentHandler.Output

Internal interface that allows both character and ignorable whitespace content to be filtered the same way.

SAS7BDATParser

Processes the SAS7BDAT data columnar database file used by SAS and other similar languages.

SecureContentHandler

Content handler decorator that attempts to prevent denial of service attacks against Tika parsers.

SentimentAnalysisParser

This parser classifies documents based on the sentiment of document.

SequenceNumberGenerator

Internal utility class that Tika uses to look up service providers.

ServiceLoaderUtils

Service Loading and Ordering related utils

SiegfriedDetector

Simple wrapper around Siegfried https://github.com/richardlehane/siegfried The default behavior is to run detection, report the results in the metadata and then return null so that other detectors will be used.

SignatureObject

Signature Object

SimpleChunking

SimpleLogReporterBuilder

SimpleTextExtractor

SimpleThreadPoolExecutor

Simple Thread Pool Executor

SimpleTypeDetector

SlowCompositeReaderWrapper

COPIED VERBATIM FROM LUCENE This class forces a composite reader (eg a MultiReader or DirectoryReader) to emulate a LeafReader.

SolrEmitter

SolrEmitter.AttachmentStrategy

SolrEmitter.UpdateStrategy

SolrPipesIterator

Iterates through results from a Solr query.

SourceCodeParser

Generic Source code parser for Java, Groovy, C++.

SpanSwapper

randomly swaps spans from the input

SpreadsheetMLParser

Parses wordml 2003 format Excel files.

SpringExample

SQLite3Parser

This is the main class for parsing SQLite3 files.

StandardHtmlEncodingDetector

An encoding detector that tries to respect the spirit of the HTML spec part 12.2.3 "The input byte stream", or at least the part that is compatible with the implementation of tika.

StandardOrganizations

This class provides a collection of the most important technical standard organizations.

StandardReference

Class that represents a standard reference.

StandardReference.StandardReferenceBuilder

StandardsExtractingContentHandler

StandardsExtractingContentHandler is a Content Handler used to extract standard references while parsing.

StandardsExtractionExample

Class to demonstrate how to use the StandardsExtractingContentHandler to get a list of the standard references from every file in a directory.

StandardsText

StandardText relies on regular expressions to extract standard references from text.

StandardWriteFilter

This is to be used to limit the amount of metadata that a parser can add based on the StandardWriteFilter.maxTotalEstimatedSize, StandardWriteFilter.maxFieldSize, StandardWriteFilter.maxValuesPerField, and StandardWriteFilter.maxKeySize.

StandardWriteFilterFactory

Factory class for StandardWriteFilter.

StarOfficeDetector

StartXRefOffset

StartXRefScanner

This is a first draft of a scanner to extract incremental updates out of PDFs.

StatefulParser

The RecursiveParserWrapper wraps the parser sent into the parsecontext and then uses that parser to store state (among many other things).

StatusReporter

Basic class to use for reporting status from both the crawler and the consumers.

StatusReporterBuilder

StatusReporterFutureResult

Empty class for what a StatusReporter returns when it finishes.

StoppingEarlyException

Sentinel exception to stop parsing xml once target is found while SAX parsing.

StorageIndexCellMapping

Specifies the storage index cell mappings (with cell identifier, cell mapping extended GUID, and cell mapping serial number)

StorageIndexDataElementData

StorageIndexManifestMapping

StorageIndexRevisionMapping

Specifies the storage index revision mappings (with revision and revision mapping extended GUIDs, and revision mapping serial number)

StorageManifestDataElementData

StorageManifestRootDeclare

Specifies one or more storage manifest root declare.

StorageManifestSchemaGUID

Specifies a storage manifest schema GUID

StrawManTikaAppDriver

Simple single-threaded class that calls tika-app against every file in a directory.

StreamEmitter

StreamGobbler

StreamingDetectContext

StreamingZipContainerDetector

Currently only used in tests.

StreamObject

StreamObjectHeaderEnd

StreamObjectHeaderEnd16bit

An 16-bit header for a compound object would indicate the end of a stream object

StreamObjectHeaderEnd8bit

An 8-bit header for a compound object would indicate the end of a stream object

StreamObjectHeaderStart

This class specifies the base class for 16-bit or 32-bit stream object header start

StreamObjectHeaderStart16bit

An 16-bit header for a compound object would indicate the start of a stream object

StreamObjectHeaderStart32bit

An 32-bit header for a compound object would indicate the start of a stream object

StreamObjectParseErrorException

StreamObjectTypeHeaderEnd

StreamObjectTypeHeaderStart

The enumeration of the stream object type header start

StreamOutRPWFSConsumer

This uses the JsonStreamingSerializer to write out a single metadata object at a time.

StringsConfig

Configuration for the "strings" (or strings-alternative) command.

StringsEncoding

Character encoding of the strings that are to be found using the "strings" command.

StringsParser

Parser that uses the "strings" (or strings-alternative) command to find the printable strings in a object, or other binary, file (application/octet-stream).

StringStatsCalculator<T>

Interface for calculators that require a string

StringUtils

SubtreeMatcher

Evaluation state of a ...//... XPath expression.

SummaryExtractor

Extractor for Common OLE2 (HPSF) metadata

SupplementingParser

Runs the input stream through all available parsers, merging the metadata from them based on the AbstractMultipleParser.MetadataPolicy chosen.

SXSLFPowerPointExtractorDecorator

SAX/Streaming pptx extractior

SXWPFWordExtractorDecorator

This is an experimental, alternative extractor for docx files.

SystemUtils

Copied from commons-lang to avoid requiring the dependency

TableInfo

TaggedContentHandler

A content handler decorator that tags potential exceptions so that the handler that caused the exception can easily be identified.

TaggedSAXException

A SAXException wrapper that tags the wrapped exception with a given object reference.

TailStream

A specialized input stream implementation which records the last portion read from an underlying stream.

TarWriter

TaskStatus

TeeContentHandler

Content handler proxy that forwards the received SAX events to zero or more underlying content handlers.

TEIDOMParser

TemporaryResources

Utility class for tracking and ultimately closing or otherwise disposing a collection of temporary resources.

TensorflowImageRecParser

This is an implementation of ObjectRecogniser powered by Tensorflow convolutional neural network (CNN).

TensorflowRESTCaptioner

Tensorflow image captioner.

TensorflowRESTRecogniser

Tensor Flow image recogniser which has high performance.

TensorflowRESTVideoRecogniser

Tensor Flow video recogniser which has high performance.

TesseractOCRConfig

Configuration for TesseractOCRParser.

TesseractOCRConfig.OUTPUT_TYPE

TesseractOCRParser

TesseractOCRParser powered by tesseract-ocr engine.

TesseractServerConfig

Tesseract configuration, for the request

TextAndAttributeContentHandler

TextAndAttributeXMLParser

TextAndCSVParser

Unless the TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE is set, this parser tries to assess whether the file is a text file, csv or tsv.

TextCell

Text cell.

TextContentHandler

Content handler decorator that only passes the TextContentHandler.characters(char[], int, int) and (@link TextContentHandler.ignorableWhitespace(char[], int, int) (plus TextContentHandler.startDocument() and TextContentHandler.endDocument() events to the decorated content handler.

TextDetector

Content type detection of plain text documents.

TextLangDetector

Language Detection using MIT Lincoln Lab’s Text.jl library https://github.com/trevorlewis/TextREST.jl

TextMatcher

Final evaluation state of a .../text() XPath expression.

TextMessageBodyWriter

Returns simple text string for a particular metadata value.

TextOnlyPDFRenderer

This class extends the PDFRenderer to render only the textual elements

TextProfileSignature

Copied nearly directly from Apache Nutch: https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/TextProfileSignature.java

TextSha256Signature

Calculates the base32 encoded SHA-256 checksum on the analyzed text

TextStatistics

Utility class for computing a histogram of the bytes seen in a stream.

TextStatsCalculator

Base text stats interface

TextStatsFromTikaEval

These examples create a new CompositeTextStatsCalculator for each call.

TIAParsingExample

TIFF

XMP Exif TIFF schema.

TiffParser

Tika

Facade class for accessing Tika functionality.

TikaActivator

Bundle activator that adjust the class loading mechanism of the ServiceLoader class to work correctly in an OSGi environment.

TikaAsyncCLI

TikaCLI

Simple command line interface for Apache Tika.

TikaClient

TikaClientCLI

TikaClientConfigException

TikaClientException

TikaConfig

Parse xml config file.

TikaConfigException

Tika Config Exception is an exception to occur when there is an error in Tika config file and/or one or more of the parsers failed to initialize from that erroneous config.

TikaConfigSerializer

TikaConfigSerializer.Mode

TikaCoreProperties

Contains a core set of basic Tika metadata properties, which all parsers will attempt to supply (where the file format permits).

TikaCoreProperties.EmbeddedResourceType

A file might contain different types of embedded documents.

TikaDetectors

Provides details of all the Detectors registered with Apache Tika, similar to --list-detectors with the Tika CLI.

TikaEmitterException

TikaEmitterResult

TikaEvalCLI

TikaEvalMetadataFilter

TikaEvalResource

TikaExcelDataFormatter

Overrides Excel's General format to include more significant digits than the MS Spec allows.

TikaExcelGeneralFormat

A Format that allows up to 15 significant digits for integers.

TikaException

Tika exception

TikaFileTypeDetector

TikaGUI

Simple Swing GUI for Apache Tika.

TikaInputStream

Input stream with extended capabilities.

TikaLanguageDetector

This is Tika's original legacy, homegrown language detector.

TikaLoggingFilter

TikaMemoryLimitException

TikaMimeKeys

A collection of Tika metadata keys used in Mime Type resolution

TikaMimeTypes

Provides details of all the mimetypes known to Apache Tika, similar to --list-supported-types with the Tika CLI.

TikaMp4BoxHandler

TikaPagedText

Metadata properties for paged text, metadata appropriate for an individual page (useful for embedded document handlers called on individual pages).

TikaParsers

Provides details of all the Parsers registered with Apache Tika, similar to --list-parsers and --list-parser-details within the Tika CLI.

TikaResource

TikaServerCli

TikaServerClientConfig

TikaServerConfig

TikaServerParseException

Simple wrapper exception to be thrown for consistent handling of exceptions that can happen during a parse.

TikaServerParseExceptionMapper

TikaServerProcess

TikaServerResource

Stub interface to allow for loading of resources via SPI

TikaServerStatus

TikaServerWatchDog

TikaServerWriter<T>

Stub interface to allow for SPI loading from other modules without opening up service loading to any generic MessageBodyWriter

TikaTaskTimeout

TikaTimeoutException

Runtime/unchecked version of TimeoutException

Provides a basic welcome to the Apache Tika Server.

Content Handler for Translation Memory eXchange (TMX) files.

TMXParser

Parser for Translation Memory eXchange (TMX) files.

TNEFParser

A POI-powered Tika Parser for TNEF (Transport Neutral Encoding Format) messages, aka winmail.dat

ToHTMLContentHandler

SAX event handler that serializes the HTML document to a character stream.

TokenContraster

Computes some corpus contrast statistics.

TokenCounter

Deprecated.

use CompositeTextStatsCalculator with TokenEntropy, TokenLengths and TopNTokens.

TokenCountPriorityQueue

TokenCounts

TokenCountStatsCalculator<T>

Interface for calculators that require token stats

TopCommonTokenCounter

Utility class that reads in a UTF-8 input file with one document per row and outputs the 20000 tokens with the highest document frequencies.

TopNTokens

TotalCounter

Interface for pipesiterators that allow counting of total documents.

TotalCountResult

TotalCountResult.STATUS

ToTextContentHandler

SAX event handler that writes all character content out to a character stream.

ToXMLContentHandler

SAX event handler that serializes the XML document to a character stream.

TrainedModel

TrainedModelDetector

TrainTestSplit

TranscribeTranslateExample

This example demonstrates primitive logic for chaining Tika API calls.

Transformer

TranslateResource

Translator

Interface for Translator services.

TranslatorExample

TrecDocumentGenerator

Generates document summaries for corpus analysis in the Open Relevance project.

TrueTypeParser

Parser for TrueType font files (TTF).

Truncator

TSDParser

Tika parser for Time Stamped Data Envelope (application/timestamped-data)

TwoBytesOfData

This class is used to represent the property contains 2 bytes of data in the PropertySet.rgData stream field.

TXTParser

Plain text parser.

TypeDetector

Content type detection based on a content type hint.

UByte

The unsigned byte type

UInteger

The unsigned int type

ULong

The unsigned long type

UMath

UnicodeBlockCounter

UniversalEncodingDetector

UnpackerResource

UnrarParser

Parser for Rar files.

Unsigned

A utility class for static access to unsigned number functionality.

UnsupportedFormatException

Parsers should throw this exception when they encounter a file format that they do not support.

UNumber

A base type for unsigned numbers.

URLEmailNormalizingFilterFactory

Factory for filter that normalizes urls and emails to __url__ and __email__ respectively.

UrlFetcher

Simple fetcher for URLs.

UShort

The unsigned short type

UuidUtils

VectorGraphicsOnlyPDFRenderer

This class extends the PDFRenderer to render only the textual elements

This parser offers a very rough capability to extract text if there is text stored in the WMF files.

Word2006MLParser

WordExtractor

WordExtractor.TagAndStyle

WordMLParser

Parses wordml 2003 format word files.

WordPerfect

WordPerfect properties collection.

WordPerfectParser

Parser for Corel WordPerfect documents.

WriteLimiter

WriteLimitReachedException

WriteOutContentHandler

SAX event handler that writes content up to an optional write limit out to a character stream or other decorated handler.

XHTMLContentHandler

Content handler decorator that simplifies the task of producing XHTML events for Tika content parsers.

XLIFF12ContentHandler

Content Handler for XLIFF 1.2 documents.

XLIFF12Parser

Parser for XLIFF 1.2 files.

XLSXHREFFormatter

XLZParser

Parser for XLZ Archives.

XMLDOMUtil

XMLErrorLogUpdater

This is a very task specific class that reads a log file and updates the "comparisons" table.

XML parser.

Utility functions for reading XML.

XmlRootExtractor

Utility class that uses a SAXParser to determine the namespace URI and local name of the root element of an XML file.

XMP

XMPContentHandler

Content handler decorator that simplifies the task of producing XMP output.

XMPDM

XMP Dynamic Media schema.

XMPDM.ChannelTypePropertyConverter

Deprecated.

Experimental method, will change shortly

XMPIdq

XMPMessageBodyWriter

XMPMetadata

Provides a conversion of the Metadata map from Tika to the XMP data model by also providing the Metadata API for clients to ease transition.

XMPMetadataExtractor

XMP Metadata Extractor based on Apache XmpBox.

XMPMetadataResource

XMPMM

XMPPacketScanner

This class is a parser for XMP packets.

XMPRights

XMP Rights management schema.

This is somewhat of a hack to handle the older pdfx: See also the more modern XMPSchemaPDFXId

XMPSchemaPDFXId

XPathParser

Parser for a very simple XPath subset.

XPSExtractorDecorator

XPSTextExtractor

Currently, mostly a pass-through class to hold pkg and properties and keep the general framework similar to our other POI-integrated extractors.

XSLFEventBasedPowerPointExtractor

XSLFPowerPointExtractorDecorator

XSSFBExcelExtractorDecorator

XSSFExcelExtractorDecorator

XSSFExcelExtractorDecorator.HeaderFooterFromString

XSSFExcelExtractorDecorator.SheetTextAsHTML

Turns formatted sheet events into HTML

XSSFExcelExtractorDecorator.XSSFSheetInterestingPartsCapturer

Captures information on interesting tags, whilst delegating the main work to the formatting handler

XUserDefinedCharset

XUserDefinedCharset.NotImplementedException

XWPFEventBasedWordExtractor

Experimental class that is based on POI's XSSFEventBasedExcelExtractor

XWPFListManager

XWPFNumberingShim

Stub class of POI's XWPFNumbering because onDocumentRead() is protected

XWPFStylesShim

For Tika, all we need (so far) is a mapping between styleId and a style's name.

XWPFWordExtractorDecorator

YandexTranslator

An implementation of a REST client for the YANDEX Translate API.

ZeroByteFileException

Exception thrown by the AutoDetectParser when a file contains zero-bytes.

ZeroByteFileException.IgnoreZeroByteFileException

ZeroSizeFileDetector

Detector to identify zero length files as application/x-zerovalue

ZipContainerDetector

Classes that implement this must be able to detect on a ZipFile and in streaming mode.

ZipFilesChunking

This class is used to process zip file chunking

ZipHeader

ZipListFiles

Example code listing from Chapter 1.

ZipSalvager

ZipWriter