Migration to PDFBox 2.0.0

Environment

PDFBox 2.0.0 requires at least Java 6

Packages

There are some significant changes to the package structure of PDFBox:

  • Jempbox is no longer supported and was removed in favour of Xmpbox
  • all examples were moved to the new package "pdfbox-examples"
  • all commandline tools were moved to the new package "pdfbox-tools"
  • all debugger related stuff was moved to the new package "pdfbox-debugger"
  • the new package "debugger-app" provides a standalone pre built binary for the debugger

Dependency updates

All libraries on which PDFBox depends are updated to their latest stable versions:

  • Bouncy Castle 1.53
  • Apache Commons Logging 1.2

For test support the libraries are updated to

  • JUnit 4.12
  • JAI Image Core 1.2
  • Levigo JBIG ImageIO Plugin 1.6.3

Breaking Changes to the Library

Deprecated API calls

Most deprecated API calls in PDFBox 1.8.x have been removed for PDFBox 2.0.0

API Changes

The API changes are reflected in the Javadoc for PDFBox 2.0.0. The most notable changes are:

  • getCOSDictionary() is no longer used. Instead getCOSObject now returns the matching COSBase subtype.
  • PDXObjectForm was renamed to PDFormXObject to be more in line with the PDF specification.
  • PDXObjectImage was renamed to PDImageXObject to be more in line with the PDF specification.
  • PDPage.getContents().createInputStream()was simplified to PDPage.getContents().

General Behaviour

PDFBox 2.0.0 is now parsing PDF files following the Xref information in the PDF. This is similar to the functionality using PDDocument.loadNonSeq with PDFBox 1.8.x. Users still using PDDocument.load with PDFBox 1.8.x might experience different results when switching to PDFBox 2.0.0.

Font Handling

Font handling now has full Unicode support and supports font subsetting.

TrueType fonts shall now be loaded using

PDType0Font.load

to leverage that.

PDF Resources Handling

The individual calls to add resources such as PDResource.addFont(PDFont font) and PDResource.addXObject(PDXObject xobject, String prefix) have been replaced with PDResource.add(resource type) where resource type represents the different resource classes such as PDFont, PDAbstractPattern and so on. The add method now supports all the different type of resources available.

Working with Images

The individual classes PDJpeg(), PDPixelMap() and PDCCitt() to import images have been replaced with PDImageXObject.createFromFile which works for JPG, TIFF (only G4 compression), PNG, BMP and GIF.

In addition there are some specialized classes:

  • JPEGFactory.createFromStream which preserve the JPEG data and embed it in the PDF file without modification. (This is best if you have a JPEG file).
  • CCITTFactory.createFromFile (for bitonal TIFF images with G4 compression).
  • LosslessFactory.createFromImage (this is best if you start with a BufferedImage).

Iterating Pages

With PDFBox 2.0.0 the prefered way to iterate through the pages of a document is

for(PDPage page : document.getPages())
{
 ... (do something)
}

PDF Rendering

With PDFBox 2.0.0 PDPage.convertToImagehas been removed. Instead the new PDFRenderer class shall be used.

PDDocument document = PDDocument.load(new File(pdfFilename));
PDFRenderer pdfRenderer = new PDFRenderer(document);
int pageCounter = 0;
for (PDPage page : document.getPages())
{ 
    pdfRenderer.renderImageWithDPI(pageCounter, 300, ImageType.RGB);

    // suffix in filename will be used as the file format
    ImageIOUtil.writeImage(bim, pdfFilename + "-" + (pageCounter++) + ".png", 300);
}
document.close();

PDF Printing

With PDFBox 2.0.0 PDFPrinter has been removed.

Users of PDFPrinter.silentPrint() should now use this code:

PrinterJob job = PrinterJob.getPrinterJob();
job.setPageable(new PDFPageable(document));
job.print();

While users of PDFPrinter.print() should now use this code:

PrinterJob job = PrinterJob.getPrinterJob();
job.setPageable(new PDFPageable(document));
if (job.printDialog()) {
    job.print();
}

Advanced use case examples can be found in th examples package under org/apache/pdfbox/examples/printing/Printing.java

Interactive Forms

Large parts of the support for interactive forms (AcroForms) has been rewritten. The most notable change from 1.8.x is that there is a clear distinction between fields and the annotations representing them visually. Intermediate nodes in a field tree are now represented by the PDNonTerminalField class.

With PDFBox 2.0.0 the prefered way to iterate through the fields is now

for (PDField field : form.getFieldTree())
{
    ... (do something)
}

Most PDField subclasses now accept Java generic types such as String as parameters instead of the former COSBase subclasses.