Class PDFMarkedContent2XHTML

java.lang.Object
org.apache.pdfbox.contentstream.PDFStreamEngine
org.apache.pdfbox.text.PDFTextStripper
org.apache.tika.parser.pdf.PDFMarkedContent2XHTML

public class PDFMarkedContent2XHTML extends org.apache.pdfbox.text.PDFTextStripper

This was added in Tika 1.24 as an alpha version of a text extractor that builds the text from the marked text tree and includes/normalizes some of the structural tags.

Since:
1.24
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final String
     
    static final String
     

    Fields inherited from class org.apache.pdfbox.text.PDFTextStripper

    charactersByArticle, document, LINE_SEPARATOR, output
  • Method Summary

    Modifier and Type
    Method
    Description
    protected float
    computeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0)
     
    protected void
    endDocument(org.apache.pdfbox.pdmodel.PDDocument pdf)
     
    protected void
    endPage(org.apache.pdfbox.pdmodel.PDPage page)
     
    int
    we need to override this because we are overriding PDFTextStripper.processPages(PDPageTree)
    int
     
    static void
    process(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config)
    Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.
    void
    processPage(org.apache.pdfbox.pdmodel.PDPage page)
     
    protected void
    processPages(org.apache.pdfbox.pdmodel.PDPageTree pages)
    See TIKA-2845 for why we need to override this.
    void
    setEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
     
    void
    setStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
     
    void
    setStartPage(int startPage)
     
    protected void
    showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4)
     
    protected void
    showGlyph(org.apache.pdfbox.util.Matrix textRenderingMatrix, org.apache.pdfbox.pdmodel.font.PDFont font, int code, org.apache.pdfbox.util.Vector displacement)
     
    protected void
    startDocument(org.apache.pdfbox.pdmodel.PDDocument pdf)
     
    protected void
    startPage(org.apache.pdfbox.pdmodel.PDPage page)
     
    protected void
    writeCharacters(org.apache.pdfbox.text.TextPosition text)
     
    protected void
     
    protected void
     
    protected void
     
    protected void
     
    protected void
     

    Methods inherited from class org.apache.pdfbox.text.PDFTextStripper

    endArticle, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndPage, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, writePage, writePageEnd, writePageStart, writeParagraphSeparator, writeString, writeText

    Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine

    addOperator, applyTextAdjustment, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

  • Method Details

    • process

      public static void process(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config) throws SAXException, TikaException
      Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.
      Parameters:
      pdDocument - PDF document
      handler - SAX content handler
      context -
      metadata - PDF metadata
      config -
      Throws:
      SAXException - if the content handler fails to process SAX events
      TikaException - if there was an exception outside of per page processing
    • processPages

      protected void processPages(org.apache.pdfbox.pdmodel.PDPageTree pages) throws IOException
      See TIKA-2845 for why we need to override this.
      Throws:
      IOException
    • processPage

      public void processPage(org.apache.pdfbox.pdmodel.PDPage page) throws IOException
      Overrides:
      processPage in class org.apache.pdfbox.text.PDFTextStripper
      Throws:
      IOException
    • endPage

      protected void endPage(org.apache.pdfbox.pdmodel.PDPage page) throws IOException
      Throws:
      IOException
    • writeParagraphStart

      protected void writeParagraphStart() throws IOException
      Overrides:
      writeParagraphStart in class org.apache.pdfbox.text.PDFTextStripper
      Throws:
      IOException
    • writeParagraphEnd

      protected void writeParagraphEnd() throws IOException
      Overrides:
      writeParagraphEnd in class org.apache.pdfbox.text.PDFTextStripper
      Throws:
      IOException
    • writeString

      protected void writeString(String text) throws IOException
      Overrides:
      writeString in class org.apache.pdfbox.text.PDFTextStripper
      Throws:
      IOException
    • writeCharacters

      protected void writeCharacters(org.apache.pdfbox.text.TextPosition text) throws IOException
      Overrides:
      writeCharacters in class org.apache.pdfbox.text.PDFTextStripper
      Throws:
      IOException
    • writeWordSeparator

      protected void writeWordSeparator() throws IOException
      Overrides:
      writeWordSeparator in class org.apache.pdfbox.text.PDFTextStripper
      Throws:
      IOException
    • writeLineSeparator

      protected void writeLineSeparator() throws IOException
      Overrides:
      writeLineSeparator in class org.apache.pdfbox.text.PDFTextStripper
      Throws:
      IOException
    • startPage

      protected void startPage(org.apache.pdfbox.pdmodel.PDPage page) throws IOException
      Overrides:
      startPage in class org.apache.pdfbox.text.PDFTextStripper
      Throws:
      IOException
    • startDocument

      protected void startDocument(org.apache.pdfbox.pdmodel.PDDocument pdf) throws IOException
      Overrides:
      startDocument in class org.apache.pdfbox.text.PDFTextStripper
      Throws:
      IOException
    • endDocument

      protected void endDocument(org.apache.pdfbox.pdmodel.PDDocument pdf) throws IOException
      Overrides:
      endDocument in class org.apache.pdfbox.text.PDFTextStripper
      Throws:
      IOException
    • getCurrentPageNo

      public int getCurrentPageNo()
      we need to override this because we are overriding PDFTextStripper.processPages(PDPageTree)
      Overrides:
      getCurrentPageNo in class org.apache.pdfbox.text.PDFTextStripper
      Returns:
    • setStartBookmark

      public void setStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
      Overrides:
      setStartBookmark in class org.apache.pdfbox.text.PDFTextStripper
    • setEndBookmark

      public void setEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
      Overrides:
      setEndBookmark in class org.apache.pdfbox.text.PDFTextStripper
    • getStartPage

      public int getStartPage()
      Overrides:
      getStartPage in class org.apache.pdfbox.text.PDFTextStripper
    • setStartPage

      public void setStartPage(int startPage)
      Overrides:
      setStartPage in class org.apache.pdfbox.text.PDFTextStripper
    • showGlyph

      protected void showGlyph(org.apache.pdfbox.util.Matrix textRenderingMatrix, org.apache.pdfbox.pdmodel.font.PDFont font, int code, org.apache.pdfbox.util.Vector displacement) throws IOException
      Overrides:
      showGlyph in class org.apache.pdfbox.contentstream.PDFStreamEngine
      Throws:
      IOException
    • showGlyph

      protected void showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4) throws IOException
      Overrides:
      showGlyph in class org.apache.pdfbox.contentstream.PDFStreamEngine
      Throws:
      IOException
    • computeFontHeight

      protected float computeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0) throws IOException
      Throws:
      IOException