All Classes Interface Summary Class Summary Enum Summary Exception Summary Error Summary Annotation Types Summary
Class |
Description |
AbstractConsumersBuilder |
|
AbstractConverter |
Base class for Tika Metadata to XMP converter which provides some needed common functionality.
|
AbstractEncodingDetectorParser |
|
AbstractFSConsumer |
|
AbstractListManager |
|
AbstractOfficeParser |
|
AbstractOOXMLExtractor |
Base class for all Tika OOXML extractors.
|
AbstractParser |
Abstract base class for new parsers.
|
AbstractProfiler |
|
AbstractProfiler.EXCEPTION_TYPE |
|
AbstractProfiler.PARSE_ERROR_TYPE |
If information was gathered from the log file about
a parse error
|
AbstractRecursiveParserWrapperHandler |
|
AbstractTranslator |
|
AbstractXML2003Parser |
|
AccessChecker |
Checks whether or not a document allows extraction generally
or extraction for accessibility only.
|
AccessPermissionException |
Exception to be thrown when a document does not allow content extraction.
|
AccessPermissions |
Until we can find a common standard, we'll use these options.
|
Activator |
|
AdobeFontMetricParser |
Parser for AFM Font Files
|
AdvancedTypeDetector |
|
AgeRecogniser |
Parser for extracting features from text.
|
AgeRecogniserConfig |
Stores URL for AgePredictor
|
AlphaIdeographFilterFactory |
Factory for filter that only allows tokens with characters that "isAlphabetic" or "isIdeographic" through.
|
AnalyzerManager |
|
AnnotationUtils |
This class contains utilities for dealing with tika annotations
|
AppleSingleFileParser |
Parser that strips the header off of AppleSingle and AppleDouble
files.
|
AppParserFactoryBuilder |
|
AttributeDependantMetadataHandler |
This adds a Metadata entry for a given node.
|
AttributeMatcher |
Final evaluation state of a .../@* XPath expression.
|
AttributeMetadataHandler |
SAX event handler that maps the contents of an XML attribute into
a metadata field.
|
AudioFrame |
An Audio Frame in an MP3 file.
|
AudioParser |
|
AutoDetectParser |
|
AutoDetectParserFactory |
Simple class for AutoDetectParser
|
AutoDetectParserFactory |
Factory for an AutoDetectParser
|
AutoDetectReader |
An input stream reader that automatically detects the character encoding
to be used for converting bytes to characters.
|
BasicContentHandlerFactory |
Basic factory for creating common types of ContentHandlers
|
BasicContentHandlerFactory.HANDLER_TYPE |
Common handler types for content.
|
BasicTikaFSConsumer |
Basic FileResourceConsumer that reads files from an input
directory and writes content to the output directory.
|
BasicTikaFSConsumersBuilder |
|
BasicTokenCountStatsCalculator |
|
BatchNoRestartError |
FileResourceConsumers should throw this if something
catastrophic has happened and the BatchProcess should shutdown
and not be restarted.
|
BatchProcess |
This is the main processor class for a single process.
|
BatchProcess.BATCH_CONSTANTS |
|
BatchProcessBuilder |
Builds a BatchProcessor from a combination of runtime arguments and the
config file.
|
BatchProcessDriverCLI |
|
BatchTopCommonTokenCounter |
Utility class that runs TopCommonTokenCounter against a directory
of table files (named {lang}_table.gz or leipzip-like afr_...-sentences.txt)
and outputs common tokens files for each input table file in the output directory.
|
BodyContentHandler |
Content handler decorator that only passes everything inside
the XHTML <body/> tag to the underlying handler.
|
BoilerpipeContentHandler |
Uses the boilerpipe
library to automatically extract the main content from a web page.
|
BouncyCastleDigester |
Digester that relies on BouncyCastle for MessageDigest implementations.
|
BoundedInputStream |
Very slight modification of Commons' BoundedInputStream
so that we can figure out if this hit the bound or not.
|
BPGParser |
Parser for the Better Portable Graphics )BPG) File Format.
|
CachedTranslator |
CachedTranslator.
|
CaptionObject |
A model for caption objects from graphics and texts typically includes
human readable sentence, language of the sentence and confidence score.
|
Cell |
Cell of content.
|
CellDecorator |
Cell decorator.
|
CharsetDetector |
CharsetDetector provides a facility for detecting the
charset or encoding of character data in an unknown format.
|
CharsetMatch |
This class represents a charset that has been identified by a CharsetDetector
as a possible encoding for a set of input data.
|
CharsetUtils |
|
ChildMatcher |
Intermediate evaluation state of a .../*... XPath expression.
|
ChmAccessor<T> |
Defines an accessor interface
|
ChmAssert |
Contains chm extractor assertions
|
ChmBlockInfo |
A container that contains chm block information such as: i.
|
ChmCommons |
|
ChmCommons.EntryType |
Represents entry types: uncompressed, compressed
|
ChmCommons.IntelState |
Represents intel file states during decompression
|
ChmCommons.LzxState |
Represents lzx states: started decoding, not started decoding
|
ChmConstants |
|
ChmDirectoryListingSet |
Holds chm listing entries
|
ChmExtractor |
Extracts text from chm file.
|
ChmItsfHeader |
The Header 0000: char[4] 'ITSF' 0004: DWORD 3 (Version number) 0008: DWORD
Total header length, including header section table and following data.
|
ChmItspHeader |
Directory header The directory starts with a header; its format is as
follows: 0000: char[4] 'ITSP' 0004: DWORD Version number 1 0008: DWORD Length
of the directory header 000C: DWORD $0a (unknown) 0010: DWORD $1000 Directory
chunk size 0014: DWORD "Density" of quickref section, usually 2 0018: DWORD
Depth of the index tree - 1 there is no index, 2 if there is one level of
PMGI chunks 001C: DWORD Chunk number of root index chunk, -1 if there is none
(though at least one file has 0 despite there being no index chunk, probably
a bug) 0020: DWORD Chunk number of first PMGL (listing) chunk 0024: DWORD
Chunk number of last PMGL (listing) chunk 0028: DWORD -1 (unknown) 002C:
DWORD Number of directory chunks (total) 0030: DWORD Windows language ID
0034: GUID {5D02926A-212E-11D0-9DF9-00A0C922E6EC} 0044: DWORD $54 (This is
the length again) 0048: DWORD -1 (unknown) 004C: DWORD -1 (unknown) 0050:
DWORD -1 (unknown)
|
ChmLzxBlock |
Decompresses a chm block.
|
ChmLzxcControlData |
::DataSpace/Storage//ControlData This file contains $20 bytes of
information on the compression.
|
ChmLzxcResetTable |
LZXC reset table For ensuring a decompression.
|
ChmLzxState |
|
ChmParser |
|
ChmParsingException |
|
ChmPmgiHeader |
Description Note: not always exists An index chunk has the following format:
0000: char[4] 'PMGI' 0004: DWORD Length of quickref/free area at end of
directory chunk 0008: Directory index entries (to quickref/free area) The
quickref area in an PMGI is the same as in an PMGL The format of a directory
index entry is as follows: BYTE: length of name BYTEs: name (UTF-8 encoded)
ENCINT: directory listing chunk which starts with name Encoded Integers aka
ENCINT An ENCINT is a variable-length integer.
|
ChmPmglHeader |
Description There are two types of directory chunks -- index chunks, and
listing chunks.
|
ChmSection |
|
ChmWrapper |
|
CJKBigramAwareLengthFilterFactory |
Creates a very narrowly focused TokenFilter that limits tokens based on length
_unless_ they've been identified as <DOUBLE> or <SINGLE>
by the CJKBigramFilter.
|
ClassLoaderUtil |
|
ClassParser |
Parser for Java .class files.
|
CleanPhoneText |
Class to help de-obfuscate phone numbers in text.
|
ClimateForcast |
|
ClosedInputStream |
Closed input stream.
|
CloseShieldInputStream |
Proxy stream that prevents the underlying input stream from being closed.
|
ColInfo |
|
Cols |
|
CommandLineParserBuilder |
Reads configurable options from a config file and returns org.apache.commons.cli.Options
object to be used in commandline parser.
|
CommonsDigester |
|
CommonsDigester.DigestAlgorithm |
|
CommonTokenCountManager |
|
CommonTokenOverlapCounter |
|
CommonTokenResult |
|
CommonTokens |
|
CommonTokensBhattacharyya |
|
CommonTokensCosine |
|
CommonTokensHellinger |
|
CommonTokensKLDivergence |
|
CommonTokensKLDNormed |
|
CompositeDetector |
Content type detector that combines multiple different detection mechanisms.
|
CompositeDigester |
|
CompositeEncodingDetector |
|
CompositeExternalParser |
A Composite Parser that wraps up all the available External Parsers,
and provides an easy way to access them.
|
CompositeMatcher |
Composite XPath evaluation state.
|
CompositeParser |
Composite parser that delegates parsing tasks to a component parser
based on the declared content type of the incoming document.
|
CompositeTagHandler |
Takes an array of ID3Tags in preference order, and when asked for
a given tag, will return it from the first ID3Tags that has it.
|
CompositeTextStatsCalculator |
|
CompressorParser |
Parser for various compression formats.
|
CompressorParserOptions |
|
ConcurrentUtils |
Utility Class for Concurrency in Tika
|
ConfigurableThreadPoolExecutor |
Allows Thread Pool to be Configurable.
|
ConsumersManager |
Simple interface around a collection of consumers that allows
for initializing and shutting shared resources (e.g.
|
ContainerExtractor |
Tika container extractor interface.
|
ContentHandlerDecorator |
|
ContentHandlerExample |
Examples of using different Content Handlers to
get different parts of the file's contents
|
ContentHandlerFactory |
Interface to allow easier injection of code for getting a new ContentHandler
|
ContentLengthCalculator |
|
ContentTagParser |
|
ContentTags |
|
ContrastStatistics |
|
CoreNLPNERecogniser |
This class offers an implementation of NERecogniser based on
CRF classifiers from Stanford CoreNLP.
|
CorruptedFileException |
This exception should be thrown when the parse absolutely, positively has to stop.
|
CountingInputStream |
A decorating input stream that counts the number of bytes that have passed
through the stream so far.
|
CreativeCommons |
A collection of Creative Commons properties names.
|
CryptoParser |
Decrypts the incoming document stream and delegates further parsing to
another parser instance.
|
CSVMessageBodyWriter |
|
CSVParams |
|
CSVResult |
|
CTAKESAnnotationProperty |
This enumeration includes the properties that an IdentifiedAnnotation object can provide.
|
CTAKESConfig |
|
CTAKESContentHandler |
Class used to extract biomedical information while parsing.
|
CTAKESParser |
CTAKESParser decorates a Parser and leverages on
CTAKESContentHandler to extract biomedical information from
clinical text using Apache cTAKES.
|
CTAKESSerializer |
Enumeration for types of cTAKES (UIMA) CAS serializer supported by cTAKES.
|
CTAKESUtils |
This class provides methods to extract biomedical information from plain text
using CTAKESContentHandler that relies on Apache cTAKES.
|
CustomMimeInfo |
|
Database |
|
DataURIScheme |
|
DataURISchemeParseException |
|
DataURISchemeUtil |
Not thread safe.
|
DateUtils |
Date related utility methods and constants
|
DBBuffer |
|
DBConsumersManager |
|
DBFParser |
This is a Tika wrapper around the DBFReader.
|
DBWriter |
This is still in its early stages.
|
DcXMLParser |
Dublin Core metadata parser
|
DefaultContentHandlerFactoryBuilder |
Builds BasicContentHandler with type defined by attribute "basicHandlerType"
with possible values: xml, html, text, body, ignore.
|
DefaultDetector |
|
DefaultEncodingDetector |
|
DefaultHtmlMapper |
The default HTML mapping rules in Tika.
|
DefaultInputStreamFactory |
Passthrough -- returns InputStream as is
|
DefaultParser |
|
DefaultProbDetector |
A version of DefaultDetector for probabilistic mime
detectors, which use statistical techniques to blend the
results of differing underlying detectors when attempting
to detect the type of a given file.
|
DefaultTranslator |
|
DelegatingParser |
Base class for parser implementations that want to delegate parts of the
task of parsing an input document to another parser.
|
DescribeMetadata |
Print the supported Tika Metadata models and their fields.
|
Detector |
Content type detector.
|
DetectorResource |
|
DIFContentHandler |
|
DIFContentHandler |
|
DIFParser |
|
DigestingAutoDetectParserFactory |
|
DigestingParser |
|
DigestingParser.Digester |
Interface for digester.
|
DigestingParser.Encoder |
Encodes byte array from a MessageDigest to String
|
DirectFileReadDataSource |
A DataSource implementation that relies on direct reads from a RandomAccessFile .
|
DirectoryListingEntry |
The format of a directory listing entry is as follows: BYTE: length of name
BYTEs: name (UTF-8 encoded) ENCINT: content section ENCINT: offset ENCINT:
length The offset is from the beginning of the content section the file is
in, after the section has been decompressed (if appropriate).
|
DirListParser |
Parses the output of /bin/ls and counts the number of files and the number of
executables using Tika.
|
DisplayMetInstance |
Grabs a PDF file from a URL and prints its Metadata
|
DL4JInceptionV3Net |
|
DL4JVGG16Net |
|
DocumentSelector |
Interface for different document selection strategies for purposes like
embedded document extraction by a ContainerExtractor instance.
|
DublinCore |
A collection of Dublin Core metadata names.
|
DumpTikaConfigExample |
This class shows how to dump a TikaConfig object to a configuration file.
|
DurationFormatUtils |
Functionality and naming conventions (roughly) copied from org.apache.commons.lang3
so that we didn't have to add another dependency.
|
DWGParser |
DWG (CAD Drawing) parser.
|
ElementMappingContentHandler |
Content handler decorator that maps element QName s using
a Map .
|
ElementMappingContentHandler.TargetElement |
|
ElementMatcher |
Final evaluation state of an XPath expression that targets an element.
|
ElementMetadataHandler |
SAX event handler that maps the contents of an XML element into
a metadata field.
|
EmbeddedContentHandler |
|
EmbeddedDocumentExtractor |
|
EmbeddedDocumentUtil |
Utility class to handle common issues with embedded documents.
|
EmbeddedResourceHandler |
Tika container extractor callback interface.
|
Embedder |
Tika embedder interface
|
EMFParser |
Extracts files embedded in EMF and offers a
very rough capability to extract text if there
is text stored in the EMF.
|
EmptyDetector |
Dummy detector that returns application/octet-stream for all documents.
|
EmptyParser |
Dummy parser that always produces an empty XHTML document without even
attempting to parse the given document stream.
|
EmptyTranslator |
Dummy translator that always declines to give any text.
|
EncodingDetector |
Character encoding detector.
|
EncryptedDocumentException |
|
EncryptedPrescriptionDetector |
|
EncryptedPrescriptionParser |
|
EndDocumentShieldingContentHandler |
|
EndianUtils |
General Endian Related Utilties.
|
EndianUtils.BufferUnderrunException |
|
EnviHeaderParser |
|
EpubContentParser |
Parser for EPUB OPS *.html files.
|
EpubParser |
Epub parser
|
ErrorParser |
Dummy parser that always throws a TikaException without even
attempting to parse the given document stream.
|
EvalConsumerBuilder |
|
EvalConsumersBuilder |
|
EvalExceptionUtils |
|
ExcelExtractor |
Excel parser implementation which uses POI's Event API
to handle the contents of a Workbook.
|
ExceptionUtils |
|
ExecutableParser |
Parser for executable files.
|
ExpandedTitleContentHandler |
|
ExternalEmbedder |
Embedder that uses an external program (like sed or exiftool) to embed text
content and metadata into a given document.
|
ExternalParser |
Parser that uses an external program (like catdoc or pdf2txt) to extract
text content and metadata from a given document.
|
ExternalParser.LineConsumer |
Consumer contract
|
ExternalParsersConfigReader |
Builds up ExternalParser instances based on XML file(s)
which define what to run, for what, and how to process
any output metadata.
|
ExternalParsersConfigReaderMetKeys |
|
ExternalParsersFactory |
Creates instances of ExternalParser based on XML
configuration files.
|
ExternalTranslator |
Abstract class used to interact with command line/external Translators.
|
ExtractComparer |
|
ExtractComparerBuilder |
|
ExtractEmbeddedFiles |
|
ExtractProfiler |
|
ExtractProfilerBuilder |
|
ExtractReader |
|
ExtractReader.ALTER_METADATA_LIST |
|
ExtractReaderException |
Exception when trying to read extract
|
ExtractReaderException.TYPE |
|
FeedParser |
Feed parser.
|
FictionBookParser |
|
Field |
Field annotation is a contract for binding Param value from
Tika Configuration to an object.
|
FileConfig |
Configuration for the "file" (or file-alternative) command.
|
FilenameUtils |
|
FileResource |
This is a basic interface to handle a logical "file".
|
FileResourceConsumer |
This is a base class for file consumers.
|
FileResourceCrawler |
|
FLVParser |
Parser for metadata contained in Flash Videos (.flv).
|
Font |
|
ForkParser |
|
ForkProxy |
|
ForkResource |
|
FormattingUtils |
|
FormattingUtils.Tag |
|
FSBatchProcessCLI |
|
FSConsumersManager |
|
FSCrawlerBuilder |
Builds either an FSDirectoryCrawler or an FSListCrawler.
|
FSDirectoryCrawler |
|
FSDirectoryCrawler.CRAWL_ORDER |
|
FSDocumentSelector |
Selector that chooses files based on their file name
and their size, as determined by Metadata.RESOURCE_NAME_KEY and Metadata.CONTENT_LENGTH.
|
FSFileResource |
FileSystem(FS)Resource wraps a file name.
|
FSListCrawler |
Class that "crawls" a list of files.
|
FSOutputStreamFactory |
|
FSOutputStreamFactory.COMPRESSION |
|
FSProperties |
|
FSUtil |
Utility class to handle some common issues when
reading from and writing to a file system (FS).
|
FSUtil.HANDLE_EXISTING |
|
GDALParser |
|
GenericConverter |
Trys to convert as much of the properties in the Metadata map to XMP namespaces.
|
GeoGazetteerClient |
|
Geographic |
Geographic schema.
|
GeographicInformationParser |
|
GeoParser |
|
GeoParserConfig |
|
GeoTag |
|
GoogleTranslator |
|
GrabPhoneNumbersExample |
|
GribParser |
|
GrobidNERecogniser |
|
GrobidRESTParser |
|
H2Util |
|
HDFParser |
|
HexCoDec |
A set of Hex encoding and decoding utility methods.
|
HSLFExtractor |
|
HTML |
|
HtmlEncodingDetector |
Character encoding detector for determining the character encoding of a
HTML document based on the potential charset parameter found in a
Content-Type http-equiv meta tag somewhere near the beginning.
|
HTMLHelper |
Helps produce user facing HTML output.
|
HtmlMapper |
HTML mapper used to make incoming HTML documents easier to handle by
Tika clients.
|
HtmlParser |
HTML parser.
|
HttpHeaders |
A collection of HTTP header names.
|
HwpStreamReader |
|
HwpTextExtractorV5 |
|
HwpV5Parser |
|
ICNSParser |
A basic parser class for Apple ICNS icon files
|
ICNSType |
Holds details on Apple ICNS icons
|
IContentHandlerFactoryBuilder |
|
ICrawlerBuilder |
|
Icu4jEncodingDetector |
|
ID3Tags |
Interface that defines the common interface for ID3 tag parsers,
such as ID3v1 and ID3v2.3.
|
ID3Tags.ID3Comment |
Represents a comments in ID3 (especially ID3 v2), where are
made up of several parts
|
ID3v1Handler |
This is used to parse ID3 Version 1 Tag information from an MP3 file,
if available.
|
ID3v22Handler |
This is used to parse ID3 Version 2.2 Tag information from an MP3 file,
if available.
|
ID3v23Handler |
This is used to parse ID3 Version 2.3 Tag information from an MP3 file,
if available.
|
ID3v24Handler |
This is used to parse ID3 Version 2.4 Tag information from an MP3 file,
if available.
|
ID3v2Frame |
A frame of ID3v2 data, which is then passed to a handler to
be turned into useful data.
|
ID3v2Frame.RawTag |
|
ID3v2Frame.TextEncoding |
|
IDBWriter |
|
IdentityHtmlMapper |
Alternative HTML mapping rules that pass the input HTML as-is without any
modifications.
|
IFileProcessorFutureResult |
stub interface to allow for different result types from different processors
|
ImageMetadataExtractor |
Uses the Metadata Extractor library
to read EXIF and IPTC image metadata and map to Tika fields.
|
ImageParser |
|
ImportContextImpl |
ImportContextImpl ...
|
Initializable |
Components that must do special processing across multiple fields
at initialization time should implement this interface.
|
InitializableProblemHandler |
This is to be used to handle potential recoverable problems that
might arise during initialization.
|
InputStreamDigester |
|
InputStreamFactory |
Interface to allow for custom/consistent creation of InputStream
|
InterruptableParsingExample |
This example demonstrates how to interrupt document parsing if
some condition is met.
|
Interrupter |
Class that waits for input on System.in.
|
InterrupterBuilder |
Builds an Interrupter
|
InterrupterFutureResult |
|
IOExceptionWithCause |
Subclasses IOException with the Throwable constructors missing before Java 6.
|
IOUtils |
General IO stream manipulation utilities.
|
IParserFactoryBuilder |
|
IPTC |
IPTC photo metadata schema.
|
IptcAnpaParser |
Parser for IPTC ANPA New Wire Feeds
|
ISArchiveParser |
|
ISATabUtils |
|
ITikaToXMPConverter |
Interface for the specific Metadata to XMP converters
|
IWork13PackageParser |
|
IWork13PackageParser.IWork13DocumentType |
|
IWorkPackageParser |
A parser for the IWork container files.
|
IWorkPackageParser.IWORKDocumentType |
|
JackcessParser |
Parser that handles Microsoft Access files via
Jackcess
|
JDBCUtil |
|
JDBCUtil.CREATE_TABLE |
|
JempboxExtractor |
|
JoshuaNetworkTranslator |
This translator is designed to work with a TCP-IP available
Joshua translation server, specifically the
REST-based Joshua server.
|
JournalParser |
|
JpegParser |
|
JSONMessageBodyWriter |
|
JsonMetadata |
|
JsonMetadataBase |
|
JsonMetadataDeserializer |
Deserializer for Metadata
If overriding this, remember that this is called from a static context.
|
JsonMetadataList |
|
JsonMetadataSerializer |
Serializer for Metadata
If overriding this, remember that this is called from a static context.
|
JsonStreamingSerializer |
|
LangModel |
|
Language |
|
Language |
|
LanguageAwareTokenCountStats<T> |
Interface for calculators that require language probabilities and token stats
|
LanguageConfidence |
|
LanguageDetectingParser |
|
LanguageDetector |
|
LanguageDetectorExample |
|
LanguageHandler |
SAX content handler that updates a language detector based on all the
received character content.
|
LanguageIdentifier |
Deprecated.
|
LanguageIDWrapper |
|
LanguageNames |
Support for language tags (as defined by https://tools.ietf.org/html/bcp47)
See https://en.wikipedia.org/wiki/List_of_ISO_639-3_codes for a list of
three character language codes.
|
LanguageProfile |
Deprecated. |
LanguageProfilerBuilder |
Deprecated. |
LanguageResource |
|
LanguageResult |
|
LanguageWriter |
Writer that builds a language profile based on all the written content.
|
Latin1StringsParser |
Parser to extract printable Latin1 strings from arbitrary files with pure java
without running any external process.
|
LeipzigHelper |
|
LeipzigSampler |
|
Lingo24LangDetector |
|
Lingo24Translator |
|
Link |
|
LinkContentHandler |
Content handler that collects links from an XHTML document.
|
LinkedCell |
Linked cell.
|
ListDescriptor |
Contains the information for a single list in the list or list override tables.
|
ListManager |
Computes the number text which goes at the beginning of each list paragraph
|
LoadErrorHandler |
Interface for error handling strategies in service class loading.
|
Location |
|
LookaheadInputStream |
Stream wrapper that make it easy to read up to n bytes ahead from
a stream that supports the mark feature.
|
LuceneIndexer |
|
LuceneIndexerExtended |
|
LyricsHandler |
This is used to parse Lyrics3 tag information
from an MP3 file, if available.
|
MachineMetadata |
Metadata for describing machines, such as their
architecture, type and endian-ness
|
MachineMetadata.Endian |
|
MagicDetector |
Content type detection based on magic bytes, i.e.
|
MailUtil |
|
MappedBufferCleaner |
Copied/pasted from the Apache Lucene/Solr project.
|
Matcher |
XPath element matcher.
|
MatchingContentHandler |
Content handler decorator that only passes the elements, attributes,
and text nodes that match the given XPath expression.
|
MatParser |
|
MboxParser |
Mbox (mailbox) parser.
|
MediaType |
Internet media type.
|
MediaTypeExample |
|
MediaTypeRegistry |
Registry of known Internet media types.
|
Message |
A collection of Message related property names.
|
Metadata |
A multi-valued metadata container.
|
MetadataAwareLuceneIndexer |
Builds on the LuceneIndexer from Chapter 5 and adds indexing of Metadata.
|
MetadataExtractor |
OOXML metadata extractor.
|
MetadataFields |
Knowns about all declared Metadata fields.
|
MetadataHandler |
Deprecated.
|
MetadataList |
wrapper class to make isWriteable in MetadataListMBW simpler
|
MetadataListMessageBodyWriter |
|
MetadataResource |
|
MicrosoftTranslator |
Wrapper class to access the Windows translation service.
|
MidiParser |
|
MimeBuffer |
|
MimeType |
Internet media type.
|
MimeTypeException |
A class to encapsulate MimeType related exceptions.
|
MimeTypes |
This class is a MimeType repository.
|
MimeTypesFactory |
Creates instances of MimeTypes.
|
MimeTypesReader |
A reader for XML files compliant with the freedesktop MIME-info DTD.
|
MimeTypesReaderMetKeys |
|
MITIENERecogniser |
This class offers an implementation of NERecogniser based on
trained models using state-of-the-art information extraction tools.
|
MosesTranslator |
Translator that uses the Moses decoder for translation.
|
MP3Frame |
A frame in an MP3 file, such as ID3v2 Tags or some
audio.
|
Mp3Parser |
The Mp3Parser is used to parse ID3 Version 1 Tag information
from an MP3 file, if available.
|
Mp3Parser.ID3TagsAndAudio |
|
MP4Parser |
Parser for the MP4 media container format, as well as the older
QuickTime format that MP4 is based on.
|
MSOffice |
A collection of Microsoft Office and Open Document property names.
|
MSOfficeBinaryConverter |
Tika to XMP mapping for the binary MS formats Word (.doc), Excel (.xls) and PowerPoint (.ppt).
|
MSOfficeXMLConverter |
Tika to XMP mapping for the Office Open XML formats Word (.docx), Excel (.xlsx) and PowerPoint
(.pptx).
|
MSOwnerFileParser |
Parser for temporary MSOFfice files.
|
MyFirstTika |
Demonstrates how to call the different components within Tika: its
Detector framework (aka MIME identification and repository), its
Parser interface, its LanguageIdentifier and other goodies.
|
NamedAttributeMatcher |
Final evaluation state of a .../@name XPath expression.
|
NamedElementMatcher |
Intermediate evaluation state of a .../name... XPath
expression.
|
NamedEntityParser |
This implementation of Parser extracts
entity names from text content and adds it to the metadata.
|
NameDetector |
Content type detection based on the resource name.
|
NameEntityExtractor |
|
Namespace |
Utility class to hold namespace information.
|
NERecogniser |
Defines a contract for named entity recogniser.
|
NetCDFParser |
|
NetworkParser |
|
NLTKNERecogniser |
This class offers an implementation of NERecogniser based on
ne_chunk() module of NLTK.
|
NNExampleModelDetector |
|
NNTrainedModel |
|
NNTrainedModelBuilder |
|
NodeMatcher |
Final evaluation state of a .../node() XPath expression.
|
NonDetectingEncodingDetector |
Always returns the charset passed in via the initializer
|
NSNormalizerContentHandler |
Content handler decorator that:
Maps old OpenOffice 1.0 Namespaces to the OpenDocument ones
Returns a fake DTD when parser requests OpenOffice DTD
|
NullInputStream |
A functional, light weight InputStream that emulates
a stream of a specified size.
|
NullOutputStream |
This OutputStream writes all data to the famous /dev/null.
|
NumberCell |
Number cell.
|
ObjectFromDOMAndQueueBuilder<T> |
|
ObjectFromDOMBuilder<T> |
Interface for things that build objects from a DOM Node and a map of runtime attributes
|
ObjectRecogniser |
|
ObjectRecognitionParser |
This parser recognises objects from Images.
|
Office |
Office Document properties collection.
|
OfficeOpenXMLCore |
Core properties as defined in the Office Open XML specification part Two that are not
in the DublinCore namespace.
|
OfficeOpenXMLExtended |
Extended properties as defined in the Office Open XML specification part Four.
|
OfficeParser |
Defines a Microsoft document content extractor.
|
OfficeParser.POIFSDocumentType |
|
OfficeParserConfig |
|
OfflineContentHandler |
|
OldExcelParser |
A POI-powered Tika Parser for very old versions of Excel, from
pre-OLE2 days, such as Excel 4.
|
OOXMLExtractor |
Interface implemented by all Tika OOXML extractors.
|
OOXMLExtractorFactory |
Figures out the correct OOXMLExtractor for the supplied document and
returns it.
|
OOXMLParser |
Office Open XML (OOXML) parser.
|
OOXMLTikaBodyPartHandler |
|
OOXMLWordAndPowerPointTextHandler |
This class is intended to handle anything that might contain IBodyElements:
main document, headers, footers, notes, slides, etc.
|
OOXMLWordAndPowerPointTextHandler.EditType |
|
OOXMLWordAndPowerPointTextHandler.XWPFBodyContentsHandler |
|
OpenDocumentContentParser |
Parser for ODF content.xml files.
|
OpenDocumentConverter |
Tika to XMP mapping for the Open Document formats: Text (.odt), Spreatsheet (.ods), Graphics
(.odg) and Presentation (.odp).
|
OpenDocumentMetaParser |
Parser for OpenDocument meta.xml files.
|
OpenDocumentParser |
OpenOffice parser
|
OpenNLPNameFinder |
An implementation of NERecogniser that finds names in text using Open NLP Model.
|
OpenNLPNERecogniser |
|
OpenOfficeParser |
Deprecated.
|
OptimaizeLangDetector |
Implementation of the LanguageDetector API that uses
https://github.com/optimaize/language-detector
|
OutlookExtractor |
Outlook Message Parser.
|
OutlookExtractor.RECIPIENT_TYPE |
|
OutlookPSTParser |
Parser for MS Outlook PST email storage files
|
OutputStreamFactory |
|
OverrideDetector |
|
PackageParser |
Parser for various packaging formats.
|
PagedText |
XMP Paged-text schema.
|
ParagraphProperties |
|
ParallelFileProcessingResult |
|
Param<T> |
This is a serializable model class for parameters from configuration file.
|
ParamField |
This class stores metdata for Field annotation are used to map them
to Param at runtime
|
ParseContext |
Parse context.
|
Parser |
Tika parser interface.
|
ParserContainerExtractor |
|
ParserDecorator |
Decorator base class for the Parser interface.
|
ParserFactory |
|
ParserFactory |
|
ParserFactoryBuilder |
|
ParserFactoryFactory |
Lightweight, easily serializable class that contains enough information
to build a ParserFactory
|
ParserPostProcessor |
Parser decorator that post-processes the results from a decorated parser.
|
ParserUtils |
Helper util methods for Parsers themselves.
|
ParsingEmbeddedDocumentExtractor |
Helper class for parsers of package archives or other compound document
formats that support embedded or attached component documents.
|
ParsingExample |
|
ParsingReader |
Reader for the text content from a given binary stream.
|
PasswordProvider |
Interface for providing a password to a Parser for handling Encrypted
and Password Protected Documents.
|
PDF |
PDF properties collection.
|
PDFParser |
PDF parser.
|
PDFParserConfig |
Config for PDFParser.
|
PDFParserConfig.OCR_STRATEGY |
|
Pharmacy |
|
PhoneExtractingContentHandler |
Class used to extract phone numbers while parsing.
|
Photoshop |
XMP Photoshop metadata schema.
|
Pkcs7Parser |
Basic parser for PKCS7 data.
|
POIFSContainerDetector |
A detector that works on a POIFS OLE2 document
to figure out exactly what the file is.
|
POIXMLTextExtractorDecorator |
|
PooledTimeSeriesParser |
Uses the Pooled Time Series algorithm + command line tool, to
generate a numeric representation of the video suitable for
similarity searches.
|
PrescriptionParser |
|
PrettyMetadataKeyComparator |
|
ProbabilisticMimeDetectionSelector |
Selector for combining different mime detection results
based on probability
|
ProbabilisticMimeDetectionSelector.Builder |
build class for probability parameters setting
|
ProcessUtils |
|
ProfilingHandler |
Deprecated.
|
ProfilingWriter |
Deprecated.
|
Property |
XMP property definition.
|
Property.PropertyType |
|
Property.ValueType |
|
PropertyTypeException |
XMP property definition violation exception.
|
PropsUtil |
Utility class to handle properties.
|
ProxyInputStream |
A Proxy stream which acts as expected, that is it passes the method
calls on to the proxied stream and doesn't change which methods are
being called.
|
PRTParser |
A basic text extracting parser for the CADKey PRT (CAD Drawing)
format.
|
PSDParser |
Parser for the Adobe Photoshop PSD File Format.
|
QuattroPro |
QuattroPro properties collection.
|
QuattroProParser |
Parser for Corel QuattroPro documents (part of Corel WordPerfect
Office Suite).
|
RarParser |
Parser for Rar files.
|
RecentFiles |
Builds on top of the LuceneIndexer and the Metadata discussions in Chapter 6
to output an RSS (or RDF) feed of files crawled by the LuceneIndexer within
the last N minutes.
|
RecognisedObject |
A model for recognised objects from graphics and texts typically includes
human readable label for the object, language of the label, id and confidence score.
|
RecursiveMetadataResource |
|
RecursiveParserWrapper |
This is a helper class that wraps a parser in a recursive handler.
|
RecursiveParserWrapperFSConsumer |
This runs a RecursiveParserWrapper against an input file
and outputs the json metadata to an output file.
|
RecursiveParserWrapperHandler |
|
RegexNERecogniser |
This class offers an implementation of NERecogniser based on
Regular Expressions.
|
RegexUtils |
Inspired from Nutch code class OutlinkExtractor.
|
ReplacementCharset |
An implementation of the standard "replacement" charset defined by the W3C.
|
Report |
This class represents a single report.
|
ReporterBuilder |
Interface for reporter builders
|
RereadableInputStream |
Wraps an input stream, reading it only once, but making it available
for rereading an arbitrary number of times.
|
ResultsReporter |
|
RFC822Parser |
Uses apache-mime4j to parse emails.
|
RichTextContentHandler |
Content handler for Rich Text, it will extract XHTML <img/>
tag <alt/> attribute and XHTML <a/> tag <name/>
attribute into the output.
|
RollbackSoftware |
Demonstrates Tika and its ability to sense symlinks.
|
RTFConverter |
Tika to XMP mapping for the RTF format.
|
RTFMetadata |
|
RTFParser |
RTF parser
|
RunProperties |
WARNING: This class is mutable.
|
SafeContentHandler |
|
SafeContentHandler.Output |
Internal interface that allows both character and
ignorable whitespace content to be filtered the same way.
|
SAS7BDATParser |
Processes the SAS7BDAT data columnar database file used by SAS and
other similar languages.
|
SecureContentHandler |
Content handler decorator that attempts to prevent denial of service
attacks against Tika parsers.
|
SentimentAnalysisParser |
This parser classifies documents based on the sentiment of document.
|
ServerStatus |
|
ServerStatus.STATUS |
|
ServerStatus.TASK |
|
ServerStatusWatcher |
|
ServerTimeouts |
|
ServiceLoader |
Internal utility class that Tika uses to look up service providers.
|
ServiceLoaderUtils |
Service Loading and Ordering related utils
|
SimpleLogReporterBuilder |
|
SimpleTextExtractor |
|
SimpleThreadPoolExecutor |
Simple Thread Pool Executor
|
SimpleTypeDetector |
|
SlowCompositeReaderWrapper |
COPIED VERBATIM FROM LUCENE
This class forces a composite reader (eg a MultiReader or DirectoryReader ) to emulate a
LeafReader .
|
SourceCodeParser |
Generic Source code parser for Java, Groovy, C++.
|
SpreadsheetMLParser |
Parses wordml 2003 format Excel files.
|
SpringExample |
|
SQLite3Parser |
This is the main class for parsing SQLite3 files.
|
StandardHtmlEncodingDetector |
An encoding detector that tries to respect the spirit of the HTML spec
part 12.2.3 "The input byte stream", or at least the part that is compatible with
the implementation of tika.
|
StandardOrganizations |
This class provides a collection of the most important technical standard organizations.
|
StandardReference |
Class that represents a standard reference.
|
StandardReference.StandardReferenceBuilder |
|
StandardsExtractingContentHandler |
StandardsExtractingContentHandler is a Content Handler used to extract
standard references while parsing.
|
StandardsExtractionExample |
|
StandardsText |
StandardText relies on regular expressions to extract standard references
from text.
|
StatusReporter |
Basic class to use for reporting status from both the crawler and the consumers.
|
StatusReporterBuilder |
|
StatusReporterFutureResult |
Empty class for what a StatusReporter returns when it finishes.
|
StrawManTikaAppDriver |
Simple single-threaded class that calls tika-app against every file in a directory.
|
StreamingZipContainerDetector |
|
StreamOutRPWFSConsumer |
|
StringsConfig |
Configuration for the "strings" (or strings-alternative) command.
|
StringsEncoding |
Character encoding of the strings that are to be found using the "strings" command.
|
StringsParser |
Parser that uses the "strings" (or strings-alternative) command to find the
printable strings in a object, or other binary, file
(application/octet-stream).
|
StringStatsCalculator<T> |
Interface for calculators that require a string
|
SubtreeMatcher |
Evaluation state of a ...//... XPath expression.
|
SummaryExtractor |
Extractor for Common OLE2 (HPSF) metadata
|
SXSLFPowerPointExtractorDecorator |
SAX/Streaming pptx extractior
|
SXWPFWordExtractorDecorator |
This is an experimental, alternative extractor for docx files.
|
SystemUtils |
Copied from commons-lang to avoid requiring the dependency
|
TableInfo |
|
TaggedContentHandler |
A content handler decorator that tags potential exceptions so that the
handler that caused the exception can easily be identified.
|
TaggedInputStream |
An input stream decorator that tags potential exceptions so that the
stream that caused the exception can easily be identified.
|
TaggedIOException |
An IOException wrapper that tags the wrapped exception with
a given object reference.
|
TaggedSAXException |
A SAXException wrapper that tags the wrapped exception with
a given object reference.
|
TailStream |
A specialized input stream implementation which records the last portion read
from an underlying stream.
|
TarWriter |
|
TaskStatus |
|
TeeContentHandler |
Content handler proxy that forwards the received SAX events to zero or
more underlying content handlers.
|
TEIDOMParser |
|
TemporaryResources |
Utility class for tracking and ultimately closing or otherwise disposing
a collection of temporary resources.
|
TensorflowImageRecParser |
|
TensorflowRESTCaptioner |
Tensorflow image captioner.
|
TensorflowRESTRecogniser |
Tensor Flow image recogniser which has high performance.
|
TensorflowRESTVideoRecogniser |
Tensor Flow video recogniser which has high performance.
|
TesseractOCRConfig |
Configuration for TesseractOCRParser.
|
TesseractOCRConfig.OUTPUT_TYPE |
|
TesseractOCRParser |
TesseractOCRParser powered by tesseract-ocr engine.
|
TextAndCSVParser |
|
TextCell |
Text cell.
|
TextContentHandler |
|
TextDetector |
Content type detection of plain text documents.
|
TextLangDetector |
Language Detection using MIT Lincoln Lab’s Text.jl library
https://github.com/trevorlewis/TextREST.jl
Please run the TextREST.jl server before using this.
|
TextMatcher |
Final evaluation state of a .../text() XPath expression.
|
TextMessageBodyWriter |
Returns simple text string for a particular metadata value.
|
TextStatistics |
Utility class for computing a histogram of the bytes seen in a stream.
|
TextStatsCalculator |
Base text stats interface
|
TextStatsFromTikaEval |
|
TIAParsingExample |
|
TIFF |
XMP Exif TIFF schema.
|
TiffParser |
|
Tika |
Facade class for accessing Tika functionality.
|
TikaActivator |
Bundle activator that adjust the class loading mechanism of the
ServiceLoader class to work correctly in an OSGi environment.
|
TikaCLI |
Simple command line interface for Apache Tika.
|
TikaConfig |
Parse xml config file.
|
TikaConfigException |
Tika Config Exception is an exception to occur when there is an error
in Tika config file and/or one or more of the parsers failed to initialize
from that erroneous config.
|
TikaConfigSerializer |
|
TikaConfigSerializer.Mode |
|
TikaCoreProperties |
Contains a core set of basic Tika metadata properties, which all parsers
will attempt to supply (where the file format permits).
|
TikaCoreProperties.EmbeddedResourceType |
A file might contain different types of embedded documents.
|
TikaDetectors |
Provides details of all the Detector s registered with
Apache Tika, similar to --list-detectors with the Tika CLI.
|
TikaEvalCLI |
|
TikaExcelDataFormatter |
Overrides Excel's General format to include more
significant digits than the MS Spec allows.
|
TikaExcelGeneralFormat |
A Format that allows up to 15 significant digits for integers.
|
TikaException |
Tika exception
|
TikaFileTypeDetector |
|
TikaGUI |
Simple Swing GUI for Apache Tika.
|
TikaInputStream |
Input stream with extended capabilities.
|
TikaLoggingFilter |
|
TikaMemoryLimitException |
|
TikaMetadataKeys |
Contains keys to properties in Metadata instances.
|
TikaMimeKeys |
A collection of Tika metadata keys used in Mime Type resolution
|
TikaMimeTypes |
Provides details of all the mimetypes known to Apache Tika,
similar to --list-supported-types with the Tika CLI.
|
TikaParsers |
Provides details of all the Parser s registered with
Apache Tika, similar to --list-parsers and
--list-parser-details within the Tika CLI.
|
TikaResource |
|
TikaServerCli |
|
TikaServerParseException |
Simple wrapper exception to be thrown for consistent handling
of exceptions that can happen during a parse.
|
TikaServerParseExceptionMapper |
|
TikaServerWatchDog |
|
TikaToXMP |
|
TikaVersion |
|
TikaWelcome |
Provides a basic welcome to the Apache Tika Server.
|
TNEFParser |
A POI-powered Tika Parser for TNEF (Transport Neutral
Encoding Format) messages, aka winmail.dat
|
ToHTMLContentHandler |
SAX event handler that serializes the HTML document to a character stream.
|
TokenContraster |
Computes some corpus contrast statistics.
|
TokenCounter |
Deprecated.
|
TokenCountPriorityQueue |
|
TokenCountPriorityQueue |
|
TokenCounts |
|
TokenCountStatsCalculator<T> |
Interface for calculators that require token stats
|
TokenEntropy |
|
TokenIntPair |
|
TokenLengths |
|
TokenStatistics |
|
TopCommonTokenCounter |
Utility class that reads in a UTF-8 input file with one document per row
and outputs the 20000 tokens with the highest document frequencies.
|
TopNTokens |
|
ToTextContentHandler |
SAX event handler that writes all character content out to a character
stream.
|
ToXMLContentHandler |
SAX event handler that serializes the XML document to a character stream.
|
TrainedModel |
|
TrainedModelDetector |
|
TrainTestSplit |
|
TranslateResource |
|
Translator |
Interface for Translator services.
|
TranslatorExample |
|
TrecDocumentGenerator |
Generates document summaries for corpus analysis in the Open Relevance
project.
|
TrueTypeParser |
Parser for TrueType font files (TTF).
|
TSDParser |
Tika parser for Time Stamped Data Envelope (application/timestamped-data)
|
TXTParser |
Plain text parser.
|
TypeDetector |
Content type detection based on a content type hint.
|
UnicodeBlockCounter |
|
UniversalEncodingDetector |
|
UnpackerResource |
|
UnsupportedFormatException |
Parsers should throw this exception when they encounter
a file format that they do not support.
|
URLEmailNormalizingFilterFactory |
Factory for filter that normalizes urls and emails to __url__ and __email__
respectively.
|
URLEnabledInputStreamFactory |
This class looks for "fileUrl" in the http header.
|
WebPParser |
|
WMFParser |
This parser offers a very rough capability to extract text if there
is text stored in the WMF files.
|
Word2006MLParser |
|
WordExtractor |
|
WordExtractor.TagAndStyle |
|
WordMLParser |
Parses wordml 2003 format word files.
|
WordPerfect |
WordPerfect properties collection.
|
WordPerfectParser |
Parser for Corel WordPerfect documents.
|
WriteOutContentHandler |
SAX event handler that writes content up to an optional write
limit out to a character stream or other decorated handler.
|
XHTMLContentHandler |
Content handler decorator that simplifies the task of producing XHTML
events for Tika content parsers.
|
XLIFF12ContentHandler |
Content Handler for XLIFF 1.2 documents.
|
XLIFF12Parser |
Parser for XLIFF 1.2 files.
|
XLSXHREFFormatter |
|
XLZParser |
Parser for XLZ Archives.
|
XMLDOMUtil |
|
XMLErrorLogUpdater |
This is a very task specific class that reads a log file and updates
the "comparisons" table.
|
XMLLogMsgHandler |
|
XMLLogReader |
|
XMLParser |
XML parser.
|
XMLReaderUtils |
Utility functions for reading XML.
|
XmlRootExtractor |
Utility class that uses a SAXParser to determine
the namespace URI and local name of the root element of an XML file.
|
XMP |
|
XMPContentHandler |
Content handler decorator that simplifies the task of producing XMP output.
|
XMPDM |
XMP Dynamic Media schema.
|
XMPDM.ChannelTypePropertyConverter |
Deprecated.
|
XMPIdq |
|
XMPMessageBodyWriter |
|
XMPMetadata |
Provides a conversion of the Metadata map from Tika to the XMP data model by also providing the
Metadata API for clients to ease transition.
|
XMPMM |
|
XMPPacketScanner |
This class is a parser for XMP packets.
|
XMPRights |
XMP Rights management schema.
|
XPathParser |
Parser for a very simple XPath subset.
|
XPSExtractorDecorator |
|
XPSTextExtractor |
Currently, mostly a pass-through class to hold pkg and properties
and keep the general framework similar to our other POI-integrated
extractors.
|
XSLFEventBasedPowerPointExtractor |
|
XSLFPowerPointExtractorDecorator |
|
XSSFBExcelExtractorDecorator |
|
XSSFExcelExtractorDecorator |
|
XSSFExcelExtractorDecorator.HeaderFooterFromString |
|
XSSFExcelExtractorDecorator.SheetTextAsHTML |
Turns formatted sheet events into HTML
|
XSSFExcelExtractorDecorator.XSSFSheetInterestingPartsCapturer |
Captures information on interesting tags, whilst
delegating the main work to the formatting handler
|
XUserDefinedCharset |
|
XWPFEventBasedWordExtractor |
Experimental class that is based on POI's XSSFEventBasedExcelExtractor
|
XWPFListManager |
|
XWPFNumberingShim |
Stub class of POI's XWPFNumbering because onDocumentRead() is protected
|
XWPFStylesShim |
For Tika, all we need (so far) is a mapping between styleId and a style's name.
|
XWPFWordExtractorDecorator |
|
YandexTranslator |
|
ZeroByteFileException |
Exception thrown by the AutoDetectParser when a file contains zero-bytes.
|
ZeroSizeFileDetector |
Detector to identify zero length files as application/x-zerovalue
|
ZipContainerDetector |
A detector that works on Zip documents and other archive and compression
formats to figure out exactly what the file is.
|
ZipListFiles |
Example code listing from Chapter 1.
|
ZipSalvager |
|
ZipWriter |
|