Class FileResourceConsumer

java.lang.Object
org.apache.tika.batch.FileResourceConsumer
All Implemented Interfaces:
Callable<IFileProcessorFutureResult>
Direct Known Subclasses:
AbstractFSConsumer, AbstractProfiler

public abstract class FileResourceConsumer extends Object implements Callable<IFileProcessorFutureResult>
This is a base class for file consumers. The goal of this class is to abstract out the multithreading and record keeping components.

  • Field Details

    • LOG

      protected static final org.slf4j.Logger LOG
    • TIMED_OUT

      public static String TIMED_OUT
    • OOM

      public static String OOM
    • IO_IS

      public static String IO_IS
    • IO_OS

      public static String IO_OS
    • PARSE_ERR

      public static String PARSE_ERR
    • PARSE_EX

      public static String PARSE_EX
    • ELAPSED_MILLIS

      public static String ELAPSED_MILLIS
  • Constructor Details

  • Method Details

    • call

      Specified by:
      call in interface Callable<IFileProcessorFutureResult>
    • processFileResource

      public abstract boolean processFileResource(FileResource fileResource)
      Main piece of code that needs to be implemented. Clients are responsible for closing streams and handling the exceptions that they'd like to handle.

      Unchecked throwables can be thrown past this, of course. When an unchecked throwable is thrown, this logs the error, and then rethrows the exception. Clients/subclasses should make sure to catch and handle everything they can.

      The design goal is that the whole process should close up and shutdown soon after an unchecked exception or error is thrown.

      Make sure to call incrementHandledExceptions() appropriately in your implementation of this method.

      Parameters:
      fileResource - resource to process
      Returns:
      whether or not a file was successfully processed
    • incrementHandledExceptions

      protected void incrementHandledExceptions()
      Make sure to call this appropriately!
    • isStillActive

      public boolean isStillActive()
      Returns whether or not the consumer is still could process a file or is still processing a file (ACTIVELY_CONSUMING or ASKED_TO_SHUTDOWN)
      Returns:
      whether this consumer is still active
    • pleaseShutdown

      public void pleaseShutdown()
      This politely asks the consumer to shutdown. Before processing another file, the consumer will check to see if it has been asked to terminate.

      This offers another method for politely requesting that a FileResourceConsumer stop processing besides passing it PoisonFileResource.

    • getCurrentFile

      public org.apache.tika.batch.FileStarted getCurrentFile()
      Returns the name and start time of a file that is currently being processed. If no file is currently being processed, this will return null.
      Returns:
      FileStarted or null
    • getNumResourcesConsumed

      public int getNumResourcesConsumed()
    • getNumHandledExceptions

      public int getNumHandledExceptions()
    • checkForTimedOutMillis

      public org.apache.tika.batch.FileStarted checkForTimedOutMillis(long staleThresholdMillis)
      Checks to see if the currentFile being processed (if there is one) should be timed out (still being worked on after staleThresholdMillis).

      If the consumer should be timed out, this will return the currentFile and set the state to TIMED_OUT.

      If the consumer was already timed out earlier or is not processing a file or has been working on a file for less than #staleThresholdMillis, then this will return null.

      Parameters:
      staleThresholdMillis - threshold to determine whether the consumer has gone stale.
      Returns:
      null or the file started that triggered the stale condition
    • getXMLifiedLogMsg

      protected String getXMLifiedLogMsg(String type, String resourceId, String... attrs)
    • getXMLifiedLogMsg

      protected String getXMLifiedLogMsg(String type, String resourceId, Throwable t, String... attrs)
      Use this for structured output that captures resourceId and other attributes.
      Parameters:
      type - entity name for exception
      resourceId - resourceId string
      t - throwable can be null
      attrs - (array of key0, value0, key1, value1, etc.)
    • close

      protected void close(Closeable closeable)
    • flushAndClose

      protected void flushAndClose(Closeable closeable)
    • parse

      protected void parse(String resourceId, Parser parser, InputStream is, ContentHandler handler, Metadata m, ParseContext parseContext) throws Throwable
      Utility method to handle logging equivalently among all implementing classes. Use, override or avoid as desired.
      Parameters:
      resourceId - resourceId
      parser - parser to use
      is - inputStream (will be closed by this method!)
      handler - handler for the content
      m - metadata
      parseContext - parse context
      Throws:
      Throwable - (logs and then throws whatever was thrown (if anything)