================================== Design document for PDF generation ================================== :Author: kn This is design document for the PDF generation in the eZ Components document component. PDF documents should be created from Docbook documents, which can be generated from each markup available in the document component. The requirements which should be designed in this document are specified in the pdf_requirements.txt document. Layout directives ================= The PDF document will be created from the Docbook document object. It will already contain basic formatting rules for meaningful default document layout. Additional rules may be passed to the document in various ways. Generally a CSS like approach will be used to encode layout information. This allows both, the easily readable addressing of nodes in an XML tree, like already known from CSS, and humanly readable formatting options. A limited subset of CSS will be used for now for addressing elements inside the Docbook XML tree. The grammar for those rules will be:: Address ::= Element ( Rule )* Rule ::= '>'? Element Element ::= ElementName ( '.' ClassName | '#' ElementId ) ClassName ::= [A-Za-z_-]+ ElementName ::= XMLName* | '*' ElementId ::= XMLName * XMLName references to http://www.w3.org/TR/REC-xml/#NT-Name The semantics of this simple subset of addressing directives are the same as in CSS. A second level title could for example then be addressed by:: section title The formatting options are also mostly the same as in CSS, but again only using a subset of the definitions available in CSS and with some additional formatting options, relevant especially for PDF rendering. The used formatting options depend on the renderer - unknown formatting options may issue errors or warnings. The PDF document wrapper class will implement Iterator and ArrayAccess to access the layout directives, like the following example shows:: $pdf = new ezcDocumentPdf(); $pdf->createFromDocbook( $docbook ); $pdf->styles['article > section title']['font-size'] = '1.6em'; Directives which are specified later will always overwrite earlier directives, for each each formatting option specified in the later directive. The overwriting of formatting options will NOT depend on the complexity of the node addressing like in CSS. Importing and exporting layout directives ----------------------------------------- The layout directives can be exported and imported to and from files, so that users of the component may store a custom PDF layout. The storage format will again very much look like a simplified variant of CSS:: File ::= Directive+ Directive ::= Address '{' Formatting* '}' Formatting ::= Name ':' '"' Value '"' ';' Name ::= [A-Za-z-]+ Value ::= [^"]+ C-style comments are allowed anywhere in the definition file, like ```/* .. */``` and ```// ...```. Importing and exporting styles may be accomblished by:: $pdf->styles->load( 'styles.pcss' ); List of formatting options -------------------------- There will be formatting options just processed, like they are defined in CSS, and some custom options. The options just reused from CSS are: - background-color - background-image - background-position - background-repeat - border-color - border-width - border-bottom-color - border-bottom-width - border-left-color - border-left-width - border-right-color - border-right-width - border-top-color - border-top-width - color - direction - font-family - font-size - font-style - font-variant - font-weight - line-height - list-style - list-style-position - list-style-type - margin - margin-bottom - margin-left - margin-right - margin-top - orphans - padding - padding-bottom - padding-left - padding-right - padding-top - page-break-after - page-break-before - text-align - text-decoration - text-indent - white-space - widows - word-spacing Custom properties are: text-columns Number of text text columns in one section. text-column-spacing The margin between multiple text comlumns on one page page-size Size of pages page-orientation Orientation of pages Not all options can be applied to each element. The renderer might complain on invalid options, depending on the configured error level. Special layout elements ======================= Footers & Headers ----------------- Footnotes and Headers are special layout elements, which can be rendered manually by the user of the component. They can be considered as small sub-documents, but their renderer receives additional information about the current page they are rendered on. They can be set like:: $pdf = new ezcDocumentPdf(); $pdf->createFromDocbook( $docbook ); $pdf->footer = new myDocumentPdfPart(); Each of those parts can render itself and calculate the appropriate bounding. There might be extensions from the basic PDFPart class, which again render small Docbook sub documents into one header, or just take a string, replacing placeholders with page dependent contents. Possible implementations would be: ezcDocumentPdfDocbookPart Receives a docbook document and renders it using a a defined style at the header or footer of the current page. Placeholders in the text, represented by, for example, entities might be replaced. ezcDocumentPdfStringPart Receives a simple string, in which simple placeholders are replaced. Other elements -------------- There are various possible full site elements, which might be rendered before or after the actual contents. Those are for example: - Cover page - Bibliography - Back page To add those to on PDF document you can create a pdf set, which is then rendered into one file:: $pdf = new ezcDocumentPdf(); $pdf->createFromDocbook( $docbook ); $set = new ezcDocumentPdfSet(); $set->parts = array( new ezcDocumentPdfPdfPart( 'title.pdf' ), $customTableOfContents, $pdf, $bibliography, ); $set->render( 'my.pdf' ); Some of the documents aggregated in one set can of course again be documents created from Docbook documents. Each element in the set may contain custom layout directives. For the inclusion of other document parts into a PdfSet you are expected to extend from the PDF base class and implement you custom functionality there. This could mean generating idexes, or a bibliography from the content. Drivers ======= The actual PDF renderer calls methods on the driver, which abstract the quirks of the respective implementations. There will be drivers for at least: - pecl/libharu - TCPDF Renderer ======== The renderer will be responsible for the actual typesetting. It will receive a Docbook document, apply the given layout directives and calculate the appropriate calls to the driver from those. The renderer optionally receives a set of helper objects, which perform relevant parts of the typesetting, like: Hyphenator Class implementing hyphenation for a specific language. We might provide a default implementation, which reads standard hyphenation files. The renderer state will be shared using an object representing the page currently processed, which contains information about the already covered areas and therefore the still available space. Using such a state object, the state can easily be shared between different renderers for different aspects of the rendering process. This should allow us to write simpler rendering classes, which should be better maintainable then one big renderer class, which methods would take care of all aspects. This page state object, knowing about free space on the current page, for example allows to float text around images spanning multiple paragraphs, because the already covered space is encoded. This allows all renderers for the different aspects to reuse this knowledge and depend their rendering on this. The space already covered on a page will most probably be represented by a list of bounding boxes. Which renderer classes can be separated, will show up during implementation, but those for example could be: ezcDocumentPdfParagraphRenderer Takes care of rendering the Docbook inline markup inside one paragraph. Respects orphans and widows and might be required to split paragraphs. ezcDocumentPdfTableRenderer Renders tables. It might be useful to even split this up more into a table row and cell renderer. Additional renderer features ---------------------------- If the used driver class implements the respective interfaces the renderer will also offer to sign PDF documents, or add write protection (or similar) to the PDF document. Example ======= A full example for the creation of a PDF document from a HTML page could look like:: loadFile( 'http://ezcomponents.org/introduction' ); $pdf = new ezcDocumentPdf(); $pdf->createFromDocbook( $html->getAsDocbook() ); // Load some custom layout directives $pdf->style->load( 'my_styles.pcss' ); $pdf->style['article']['text-columns'] = 3; // Set a custom header $pdf->header = new ezcDocumentPdfStringPart( '%title by %author - %pageNum / %pageCount' ); // Set a custom paragraph renderer $pdf->renderer->paragraph = new myPdfParagraphRenderer(); // Use the hyphenator with a german dictionary $pdf->renderer->hyphenator = new myDictionaryHyphenator( '/path/to/german.dict' ); // Store the generated PDF file_put_contents( 'my.pdf', $pdf ); ?> A file containing the layout directives could look like:: article { page-size: "A4"; } paragraph { font-family: "Bitstream Vera Sans"; font-size: "1em"; } article > title { font-weight: "bold"; } section title { font-weight: "normal"; } Classes ======= The classes implemented for the PDF generation are: ezcDocumentPdf Base class, representing the PDF generation. Aggregates the style information, the docbook source document, renderer and page parts like footer and header. ezcDocumentPdfSet Class aggregating multiple ezcDocumentPdf objects, to create one single PDF document from multiple parts, like a cover page, the actual content, a bibliography, etc. ezcDocumentPdfStyles Class containing the PDF layout directives, also implements loading and storing of those layout directives. ezcDocumentPdfPart Abstract base class for page parts, like headers and footers. Renders the respective part and will be extended by multiple concrete implementations, which offer convient rendering methods. ezcDocumentPdfRenderer Basic renderer class, which aggregates renderers for distinct page elements, like paragraphs and tables, and dispatches the rendering to them. Also maintains the ezcDocumentPdfPage state object, which contains information of already covered parts of the pages. ezcDocumentPdfParagraphRenderer Example for the concrete aspect specific renderer classes, which only implement the rendering of small parts of a document, like single paragraphs, tables, or table cell contents. ezcDocumentPdfPage State object describing the current state of a single page in the PDF document, like still available space. ezcDocumentPdfHyphenator Abstract base class for hyphenation implementations for more accurate word wrapping.