Class FileResourceCrawler

java.lang.Object
org.apache.tika.batch.FileResourceCrawler
All Implemented Interfaces:
Callable<IFileProcessorFutureResult>
Direct Known Subclasses:
FSDirectoryCrawler, FSListCrawler

public abstract class FileResourceCrawler extends Object implements Callable<IFileProcessorFutureResult>
  • Field Details

  • Constructor Details

    • FileResourceCrawler

      public FileResourceCrawler(ArrayBlockingQueue<FileResource> queue, int numConsumers)
      Parameters:
      queue - shared queue
      numConsumers - number of consumers (needs to know how many poisons to add when done)
  • Method Details

    • start

      public abstract void start() throws InterruptedException
      Implement this to control the addition of FileResources. Call tryToAdd(org.apache.tika.batch.FileResource) to add FileResources to the queue.
      Throws:
      InterruptedException
    • call

      public org.apache.tika.batch.FileResourceCrawlerFutureResult call()
      Specified by:
      call in interface Callable<IFileProcessorFutureResult>
    • tryToAdd

      protected int tryToAdd(FileResource fileResource) throws InterruptedException
      Parameters:
      fileResource - resource to add
      Returns:
      int status of the attempt (SKIPPED, ADDED, STOP_NOW) to add the resource to the queue.
      Throws:
      InterruptedException
    • isActive

      public boolean isActive()
      If the crawler stops for any reason, it is no longer active.
      Returns:
      whether crawler is active or not
    • setMaxConsecWaitInMillis

      public void setMaxConsecWaitInMillis(long maxConsecWaitInMillis)
    • setDocumentSelector

      public void setDocumentSelector(DocumentSelector documentSelector)
    • getConsidered

      public int getConsidered()
    • select

      protected boolean select(Metadata m)
    • setMaxFilesToAdd

      public void setMaxFilesToAdd(int maxFilesToAdd)
      Maximum number of files to add. If maxFilesToAdd < 0 (default), then this crawler will add all documents.
      Parameters:
      maxFilesToAdd - maximum number of files to add to the queue
    • setMaxFilesToConsider

      public void setMaxFilesToConsider(int maxFilesToConsider)
      Maximum number of files to consider. A file is considered whether or not the DocumentSelector selects a document.

      If maxFilesToConsider < 0 (default), then this crawler will add all documents.

      Parameters:
      maxFilesToConsider - maximum number of files to consider adding to the queue
    • isQueueEmpty

      public boolean isQueueEmpty()
      Use sparingly. This synchronizes on the queue!
      Returns:
      whether this queue contains any non-poison file resources
    • wasTimedOut

      public boolean wasTimedOut()
      Returns whether the crawler timed out while trying to add a resource to the queue.

      If the crawler timed out while trying to add poison, this is not set to true.

      Returns:
      whether this was timed out or not
    • getAdded

      public int getAdded()
      Returns:
      number of files that this crawler added to the queue
    • shutDownNoPoison

      public void shutDownNoPoison()
      Set to true to shut down the FileResourceCrawler without adding poison. Do this only if you've already called another mechanism to request that consumers shut down. This prevents a potential deadlock issue where the crawler is trying to add to the queue, but it is full.