======================== eZ Publish markup format ======================== Summarization of discussion results on the new internal eZ Publish markup format. Scope ===== The discussed format will be used for the storage of documents in the data backend and therefore need to be able to represent a sufficient superset of markup used by various input and output formats. Common use cases ---------------- Common use cases, which should be matched by the document format. 1) Web content management In web content management the user will most likely edit the contents using some rich text editor [#]_ in the browser and the contents will be transformed to (X)HTML for output on the website. Depending on the customers preferences the output language might be anything from HTML 4, to HTML 5, or X/HTML 1, 1.1, 2 or 5. 2) Content management Content management normally involves more formats like the already known Office document import and export, and also exporting documents using known print output formats like PDF and LaTeX. The storage format must be able to match the markup offered by those documents as much as possible to lose as little document semantics as possible. 3) Website styling Some users want to use web content management systems for easy editing and styling of their web contents, which includes formatting of contents beside pure semantic markup. This markup should also be possible to store in the backend, even it should also be easy to filter out for later content cleaning. 4) Extensibility Content management and publication also means we must offer an easy way to integrate with external contents (like images, videos or other external data providers). We cannot foresee which applications evolve here, so the markup format should stay extensible with custom tags. Document component ================== In the `eZ Components`__ project we develop the `document component`__ which aims to provide document conversions between all relevant markup formats. The current state is that we can convert documents in all directions between RST__, Docbook__, XHTML 1 and HTML <=4. We will work next on integrating the eZ Publish markup formats in the chain and then integrate `wiki markup languages`__, as well as PDF__ and maybe common other markup languages like the `Open Document Format`__. The document component currently uses a subset of Docbook as the internal conversion format, because an initial evaluation showed that it covers most semantic markup structures of the used formats and is easy to process, because one of the supported syntax languages is XML. So each format added to the document component is required to convert from and to Docbook. This way we will be able to convert between all formats using Docbook as an intermediate step. The document components will offer a base for the conversion required by some of the above mentioned use cases. Format considerations ===================== With the use cases above and the background of already existing conversion tools the following markup languages are up to consideration. RST / Wiki markup ----------------- So called "lightweight markup formats" which are easily editable by the user and offer great flexibility, because they are commonly extensible by custom plugins. They will be available as input and output formats using the document component, but are not valid for an internal storage format, because: - There are no common tools to parse such languages, so the parser is required to be implemented in PHP, which is slower then established markup parser frameworks like libxml2, available through the XML extensions in PHP. - RST even is a context free language, so no common parser approaches work here. - A common base for wiki syntaxes is evolving__ but not really defined yet, and a lot of different dialects of the language yet exist. - The general tool support is quite bad for both language flavors - there are only two tools which are really able to parse RST (docutils__ and the document component) and most wiki markup parsers are dialect specific. X/HTML 1 / X/HTML 5 ------------------- X/HTML is easy to parse, because it uses XML as syntax and is used widely in the web environment as a markup format for textual contents. A dialect similar to XHMLT 1.1 is already used in some versions of eZ Publish as a markup language in the database. X/HTML semantics ^^^^^^^^^^^^^^^^ X/HTML improves its semantic markup from version to version, and in version 5 of X/HTML there are several new elements introduced like