Class AbstractRecursiveParserWrapperHandler

java.lang.Object
org.xml.sax.helpers.DefaultHandler
org.apache.tika.sax.AbstractRecursiveParserWrapperHandler
All Implemented Interfaces:
Serializable, ContentHandler, DTDHandler, EntityResolver, ErrorHandler
Direct Known Subclasses:
RecursiveParserWrapperHandler

public abstract class AbstractRecursiveParserWrapperHandler extends DefaultHandler implements Serializable
This is a special handler to be used only with the RecursiveParserWrapper. It allows for finer-grained processing of embedded documents than in the legacy handlers. Subclasses can choose how to process individual embedded documents.
See Also:
  • Field Details

    • EMBEDDED_RESOURCE_LIMIT_REACHED

      public static final Property EMBEDDED_RESOURCE_LIMIT_REACHED
  • Constructor Details

    • AbstractRecursiveParserWrapperHandler

      public AbstractRecursiveParserWrapperHandler(ContentHandlerFactory contentHandlerFactory)
    • AbstractRecursiveParserWrapperHandler

      public AbstractRecursiveParserWrapperHandler(ContentHandlerFactory contentHandlerFactory, int maxEmbeddedResources)
  • Method Details

    • getNewContentHandler

      public ContentHandler getNewContentHandler()
    • getNewContentHandler

      public ContentHandler getNewContentHandler(OutputStream os, Charset charset)
    • startEmbeddedDocument

      public void startEmbeddedDocument(ContentHandler contentHandler, Metadata metadata) throws SAXException
      This is called before parsing each embedded document. Override this for custom behavior. Make sure to call this in your custom classes because this tracks the number of embedded documents.
      Parameters:
      contentHandler - local handler to be used on this embedded document
      metadata - embedded document's metadata
      Throws:
      SAXException
    • endEmbeddedDocument

      public void endEmbeddedDocument(ContentHandler contentHandler, Metadata metadata) throws SAXException
      This is called after parsing each embedded document. Override this for custom behavior. This is currently a no-op.
      Parameters:
      contentHandler - content handler that was used on this embedded document
      metadata - metadata for this embedded document
      Throws:
      SAXException
    • endDocument

      public void endDocument(ContentHandler contentHandler, Metadata metadata) throws SAXException
      This is called after the full parse has completed. Override this for custom behavior. Make sure to call this as super.endDocument(...) in subclasses because this adds whether or not the embedded resource maximum has been hit to the metadata.
      Parameters:
      contentHandler - content handler that was used on the main document
      metadata - metadata that was gathered for the main document
      Throws:
      SAXException
    • hasHitMaximumEmbeddedResources

      public boolean hasHitMaximumEmbeddedResources()
      Returns:
      whether this handler has hit the maximum embedded resources during the parse
    • getContentHandlerFactory

      public ContentHandlerFactory getContentHandlerFactory()