public class RecursiveParserWrapper extends ParserDecorator
After parsing a document, call getMetadata() to retrieve a list of Metadata objects, one for each embedded resource. The first item in the list will contain the Metadata for the outer container file.
Content can also be extracted and stored in the TIKA_CONTENT
field
of a Metadata object. Select the type of content to be stored
at initialization.
If a WriteLimitReachedException is encountered, the wrapper will stop processing the current resource, and it will not process any of the child resources for the given resource. However, it will try to parse as much as it can. If a WLRE is reached in the parent document, no child resources will be parsed.
The implementation is based on Jukka's RecursiveMetadataParser and Nick's additions. See: RecursiveMetadataParser.
Note that this wrapper holds all data in memory and is not appropriate for files with content too large to be held in memory.
Note, too, that this wrapper is not thread safe because it stores state.
The client must initialize a new wrapper for each thread, and the client
is responsible for calling reset()
after each parse.
The unit tests for this class are in the tika-parsers module.
Modifier and Type | Class and Description |
---|---|
static class |
RecursiveParserWrapper.WriteLimitReached |
Modifier and Type | Field and Description |
---|---|
static Property |
EMBEDDED_EXCEPTION
Deprecated.
|
static Property |
EMBEDDED_RESOURCE_LIMIT_REACHED
|
static Property |
EMBEDDED_RESOURCE_PATH
Deprecated.
|
static Property |
PARSE_TIME_MILLIS
Deprecated.
|
static Property |
TIKA_CONTENT
Deprecated.
|
static Property |
WRITE_LIMIT_REACHED
Deprecated.
|
Constructor and Description |
---|
RecursiveParserWrapper(Parser wrappedParser)
Initialize the wrapper with
catchEmbeddedExceptions set
to true as default. |
RecursiveParserWrapper(Parser wrappedParser,
boolean catchEmbeddedExceptions) |
RecursiveParserWrapper(Parser wrappedParser,
ContentHandlerFactory contentHandlerFactory)
Deprecated.
|
RecursiveParserWrapper(Parser wrappedParser,
ContentHandlerFactory contentHandlerFactory,
boolean catchEmbeddedExceptions)
Deprecated.
|
Modifier and Type | Method and Description |
---|---|
List<Metadata> |
getMetadata()
Deprecated.
use a
RecursiveParserWrapperHandler instead |
Set<MediaType> |
getSupportedTypes(ParseContext context)
Delegates the method call to the decorated parser.
|
void |
parse(InputStream stream,
ContentHandler recursiveParserWrapperHandler,
Metadata metadata,
ParseContext context)
Acts like a regular parser except it ignores the ContentHandler
and it automatically sets/overwrites the embedded Parser in the
ParseContext object.
|
void |
reset()
Deprecated.
use a
RecursiveParserWrapperHandler instead |
void |
setMaxEmbeddedResources(int max)
Deprecated.
set this on a
RecursiveParserWrapperHandler |
getDecorationName, getWrappedParser, withFallbacks, withoutTypes, withTypes
parse
@Deprecated public static final Property TIKA_CONTENT
AbstractRecursiveParserWrapperHandler.TIKA_CONTENT
@Deprecated public static final Property PARSE_TIME_MILLIS
AbstractRecursiveParserWrapperHandler.PARSE_TIME_MILLIS
@Deprecated public static final Property WRITE_LIMIT_REACHED
AbstractRecursiveParserWrapperHandler.EMBEDDED_EXCEPTION
@Deprecated public static final Property EMBEDDED_RESOURCE_LIMIT_REACHED
@Deprecated public static final Property EMBEDDED_EXCEPTION
AbstractRecursiveParserWrapperHandler.EMBEDDED_EXCEPTION
@Deprecated public static final Property EMBEDDED_RESOURCE_PATH
AbstractRecursiveParserWrapperHandler.EMBEDDED_RESOURCE_PATH
public RecursiveParserWrapper(Parser wrappedParser)
catchEmbeddedExceptions
set
to true
as default.wrappedParser
- parser to use for the container documents and the embedded documentspublic RecursiveParserWrapper(Parser wrappedParser, boolean catchEmbeddedExceptions)
wrappedParser
- parser to wrapcatchEmbeddedExceptions
- whether or not to catch+record embedded exceptions.
If set to false
, embedded exceptions will be thrown and
the rest of the file will not be parsed. The following will not be ignored:
CorruptedFileException
, RuntimeException
@Deprecated public RecursiveParserWrapper(Parser wrappedParser, ContentHandlerFactory contentHandlerFactory)
RecursiveParserWrapper(Parser)
catchEmbeddedExceptions
set
to true
as default.wrappedParser
- parser to use for the container documents and the embedded documentscontentHandlerFactory
- factory to use to generate a new content handler for
the container document and each embedded document@Deprecated public RecursiveParserWrapper(Parser wrappedParser, ContentHandlerFactory contentHandlerFactory, boolean catchEmbeddedExceptions)
RecursiveParserWrapper(Parser, boolean)
wrappedParser
- parser to use for the container documents and the embedded documentscontentHandlerFactory
- factory to use to generate a new content handler for
the container document and each embedded documentcatchEmbeddedExceptions
- whether or not to catch the embedded exceptions.
If set to true
, the stack traces will be stored in
the metadata object with key: EMBEDDED_EXCEPTION
.public Set<MediaType> getSupportedTypes(ParseContext context)
ParserDecorator
super.getSupportedTypes()
to invoke the decorated parser) to implement extra decoration.getSupportedTypes
in interface Parser
getSupportedTypes
in class ParserDecorator
context
- parse contextpublic void parse(InputStream stream, ContentHandler recursiveParserWrapperHandler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException
To retrieve the results of the parse, use getMetadata()
.
Make sure to call reset()
after each parse.
parse
in interface Parser
parse
in class ParserDecorator
stream
- the document stream (input)recursiveParserWrapperHandler
- handler for the XHTML SAX events (output)metadata
- document metadata (input and output)context
- parse contextIOException
- if the document stream could not be readSAXException
- if the SAX events could not be processedTikaException
- if the document could not be parsed@Deprecated public List<Metadata> getMetadata()
RecursiveParserWrapperHandler
insteadIllegalStateException
- if you've used a RecursiveParserWrapperHandler
in your last
call to parse(InputStream, ContentHandler, Metadata, ParseContext)
@Deprecated public void setMaxEmbeddedResources(int max)
RecursiveParserWrapperHandler
EMBEDDED_RESOURCE_LIMIT_REACHED
property will be added to the container document's Metadata.
If this value is < 0 (the default), the wrapper will store all Metadata.
max
- maximum number of embedded resources to store@Deprecated public void reset()
RecursiveParserWrapperHandler
insteadIllegalStateException
- if you used a RecursiveParserWrapper
in your call
to parse(InputStream, ContentHandler, Metadata, ParseContext)
Copyright © 2007–2022 The Apache Software Foundation. All rights reserved.