Apache PDFBox - ExtractText

ExtractText

This application will extract all text from the given PDF document.

usage: java -jar pdfbox-app-x.y.z.jar ExtractText [OPTIONS] <PDF file> [Text file]

Command Line Parameter	Type	Default Value	Description
-password <password>	string	None	The password to the PDF document.
-encoding <output encoding>	string	default encoding	The encoding type of the text file, e.g. ISO-8859-1, UTF-8, UTF-16BE.
-console	boolean	false	Send text to console instead of file.
-html	boolean	false	Output in HTML format instead of raw text.
-sort	boolean	false	Sort the text before writing
-ignoreBeads	boolean	false	Disables the separation by beads.
-force	boolean	false	Enables pdfbox to ignore corrupt objects.
-debug	boolean	false	Enables debug output about the time consumption of every stage.
-startPage <start page>	integer	1	The first page to extract, one based.
-endPage <end page>	integer	Integer.MAX_INT	The last page to extract, one based.
-nonSeq	boolean	false	Use the new non sequential parser.