============================================== Design document for ODT parsing and generation ============================================== :Author: ts The scope of this document is to define the design for a first implementaion of ODT (Open Document Text) support in the eZ Document component. The parts of the Document component designed in this document do not affect other Open Document formats like spreadsheets or graphics. The goal is to define the infrastructure for reading and writing ODT documents, i.e. to convert existing ODT documents into the internal representation of the Document component (DocBook XML) and to generate new ODT documents from the internal representation. ------------ Requirements ------------ The following sections describe the requirements for the ODT handling in the Document component. The first section defines requirements for reading ODT, the second for writing ODT and the third section defines requirements for later enhancements to be kept in mind during the initial implementation. Import ====== The Document component should be able to parse existing ODT documents and to convert them to the internal format used by the Document component (DocBook XML). Requirements for the import process are: - Read plain XML ODT files - Parse all necessary structural ODT elements - Convert ODT elements properly into equivalent or similar DocBook representations - Maintaining the content semantics provided by the ODT as good as possible - Maintain meta information provided by the ODT as good as possible - Develop a first heuristical approach of how ODT styling information can be used to determine semantics of an element. Export ====== The Document component should be able to generate new ODT documents from an existing internal representations (DocBook XML). Requirements for this process are: - Write plain XML ODT files - Convert DocBook representation elements to their corresponding ODT representations - Maintain the document structure - Maintain content and metadata semantics as good as possible - Styling of ODT elements. Later enhancements ================== In the first step of ODT integration only rudimentary features for import and export should be realized. The following ideas must be kept in mind during the design and implementation, to ensure future extensibility. - Reading / writing of ODT package files (ZIP) - ODF can be presented either as a single XML file or as a ZIP package containg multiple XML files and other related files (e.g. images) in addition. - Reading and writing this format is not necessary from the start, but since it is the default way for users to store ODT, it should be supported later on. - The handling of ZIP files requires a tie-in with the Archive component or similar. ------ Design ------ In the first development cycle, only the structural conversion between ODT and DocBook XML will be considered. In addition, rudimentary styling information will be taken into account. The reading and writing of ODF packages is not considered in this design. Import ====== Three different steps are necessary to import an ODT document and convert it into DocbookXml: 1. Read the XML data 2. Preprocess the ODT representation 3. Actual conversion to DocBook XML representation Step 1 will be performed through the DOM extension in PHP, the internal representation of an ODT will be a DOM treee. The second step performs pre-processing on this DOM tree. Pre-processing is e.g. needed to assign additional semantics to the ODT elements to achieve a better rendering. Finally, the pre-processed DOM tree will be visited, to achieve the actual creation of the DocBook XML representation. Pre-processing -------------- The step of pre-processing the ODT representation is necessary to assign DocBoox semantics to the ODT elements. ODT and DocBook XML have some similarities, but also differ widely in some parts. The pre-processing step performs manipultations on the ODT representation and potentially adds information which is utilized by the latter conversion step to create a correct semantical representation. This process works similar to filters in the XHTML document import. The class level design of this feature is inspired by the XHTML handling: Filters can be registered which pre-process the incoming ODT in the given order. A filter may process the following steps on a DOMElement: - Add type information to an XML element to determine into which DocBook XML element the element will be converted - Add attribute information to determine the attributes in the DocBook XML representation - Add additional elements or element hierarchies The resulting DOM tree must not necessarily be valid ODT anymore, to reflect the latter DocBook structure in a better way. The first implemented filter will only perform rudimentary operations on the DOM to assign basic semantical information to the elements. A second implementation will be an additional filter which takes some styling information into account to enhance this information. Futher filters can be implemented by third parties to extend or replace these mechanisms. Conversion ---------- The conversion process itself will mostly visit the DOM tree and utilize the information, attached to the elements in the pre-processing step, to generate a DocBook XML with the corresponding content. The filter pre-processing step is responsible to annotate all significant elements properly so that the conversion can use them. Flat ODT documents (consisting of only 1 XML file), which will purely be handled in the first version of ODT support, may contain image content embeded. To extract those, the user my specify a target directory or the system temp dir will be used as the default. The content will then be referenced in DocBook from this location. .. note:: We should check if it is possible to define and handle data URLs in docbook. May be problematic with other formats though. (kn) Export ====== .. note:: First sentence a bit unclear ;) (kn) The export process for ODT works similar to PDF rendering, except for that is a little bit less strict. The internal DocBook representation is converted to the desired ODT representation according to its semantics. Based on the DocBook XML elements, the user can define styles using a simplified CSS syntax (see PDF). Each of the style definitions is converted to an automatic style in the resulting ODT document. ODT elements affected by a certain style get this style applied. Styles ====== A style is defined for each styling information. There is no direct assignement of layouting elements to styling information, but always a style in between. The