========================
eZ Publish markup format
========================

Summarization of discussion results on the new internal eZ Publish markup
format.

Scope
=====

The discussed format will be used for the storage of documents in the data
backend and therefore need to be able to represent a sufficient superset of
markup used by various input and output formats.

Common use cases
----------------

Common use cases, which should be matched by the document format.

1) Web content management

   In web content management the user will most likely edit the contents using
   some rich text editor [#]_ in the browser and the contents will be
   transformed to (X)HTML for output on the website. Depending on the
   customers preferences the output language might be anything from HTML 4, to
   HTML 5, or X/HTML 1, 1.1, 2 or 5.

2) Content management

   Content management normally involves more formats like the already known
   Office document import and export, and also exporting documents using known
   print output formats like PDF and LaTeX. The storage format must be able to
   match the markup offered by those documents as much as possible to lose as
   little document semantics as possible.

3) Website styling

   Some users want to use web content management systems for easy editing and
   styling of their web contents, which includes formatting of contents beside
   pure semantic markup. This markup should also be possible to store in the
   backend, even it should also be easy to filter out for later content
   cleaning.

4) Extensibility

   Content management and publication also means we must offer an easy way to
   integrate with external contents (like images, videos or other external
   data providers). We cannot foresee which applications evolve here, so the
   markup format should stay extensible with custom tags.

Document component
==================

In the `eZ Components`__ project we develop the `document component`__ which
aims to provide document conversions between all relevant markup formats. The
current state is that we can convert documents in all directions between
RST__, Docbook__, XHTML 1 and HTML <=4.

We will work next on integrating the eZ Publish markup formats in the chain
and then integrate `wiki markup languages`__, as well as PDF__ and maybe
common other markup languages like the `Open Document Format`__.

The document component currently uses a subset of Docbook as the internal
conversion format, because an initial evaluation showed that it covers most
semantic markup structures of the used formats and is easy to process, because
one of the supported syntax languages is XML. So each format added to the
document component is required to convert from and to Docbook. This way we
will be able to convert between all formats using Docbook as an intermediate
step.

The document components will offer a base for the conversion required by some
of the above mentioned use cases.

Format considerations
=====================

With the use cases above and the background of already existing conversion
tools the following markup languages are up to consideration.

RST / Wiki markup
-----------------

So called "lightweight markup formats" which are easily editable by the user
and offer great flexibility, because they are commonly extensible by custom
plugins. They will be available as input and output formats using the document
component, but are not valid for an internal storage format, because:

- There are no common tools to parse such languages, so the parser is required
  to be implemented in PHP, which is slower then established markup parser
  frameworks like libxml2, available through the XML extensions in PHP.

- RST even is a context free language, so no common parser approaches work
  here.

- A common base for wiki syntaxes is evolving__ but not really defined yet,
  and a lot of different dialects of the language yet exist.

- The general tool support is quite bad for both language flavors - there are
  only two tools which are really able to parse RST (docutils__ and the
  document component) and most wiki markup parsers are dialect specific.

X/HTML 1 / X/HTML 5
-------------------

X/HTML is easy to parse, because it uses XML as syntax and is used widely in
the web environment as a markup format for textual contents. A dialect similar
to XHMLT 1.1 is already used in some versions of eZ Publish as a markup
language in the database.

X/HTML semantics
^^^^^^^^^^^^^^^^

X/HTML improves its semantic markup from version to version, and in version 5
of X/HTML there are several new elements introduced like <video>, <audio> and
<section>.

Generally the X/HTML markup is document representation centric without markup
elements for structures often used in text semantics, like:

- Footnotes

  Footnotes are available in all other markup formats, like in RST__ and
  Docbook__, but cannot really be represented in in X/HTML. 

- Names, addresses, mail addresses, etc.

  Docbook defines lots of already available markup for elements commonly used
  in various documents, which are only available in X/HTML through external not
  solidified extensions like microformats__.

X/HTML still includes a lot of markup which is used only or partly for
representation. The most common example here are tables used to layout
websites. But also elements like <div> and <span>, or the attributes style="",
on(load|click|...)="" are used solely for representational purposes. X/HTML is
not designed for document centric markup, but still designed as a mix of
representational and semantical markup [CIT_IAN_2008]_.

    However, it lacks elements to express the semantics of many of the
    non-document types of content often seen on the Web. For instance, forum
    sites, auction sites, search engines, online shops, and the like, do not
    fit the document metaphor well, and are not covered by XHTML2

    -- Ian Hickson, HTML 5, W3C Working Draft 22 January 2008

X/HTML conversion benefits
^^^^^^^^^^^^^^^^^^^^^^^^^^

One might think, that X/HTML offers the benefit of less conversions in the
most traditional use case, the web content management. Considering the fourth
use case X/HTML also always is required to be processed on input and output.

The input processing would need to filter representational elements from a
document to sanitize the contents stored in the data backend.

The output processing would need to transform custom extensions, like
<ezp:object node_id="23"/> or <mymodule:gallery/> into valid X/HTML code, not
speaking of yet necessary conversions from X/HTML 5 to X/HTML 1 / HTML 4.

X/HTML editor integration
^^^^^^^^^^^^^^^^^^^^^^^^^

X/HTML integrates perfectly with yet existing editors, even they often do not
focus on semantically correct markup, but representation centric WYSIWYG
editing.

The rich text editors will probably be updated to generate X/HTML 5 sooner or
later, which could spare us the work of convincing the editors of creating a
custom markup.

Custom formatting
^^^^^^^^^^^^^^^^^

Custom user defined formatting like colors, as mentioned in use case 3 is
offered in X/HTML by default. This may make it hard to filter later on,
because, like mentioned above, in X/HTML semantic and representational markup
is mixed by design. On the other hand no markup extensions are required.

A filter can still remove all elements and attributes not defined in a
whitelist for valid markup.

X/HTML 2
--------

X/HTML 2 is also a strong improvement compared with X/HTML 1, by offering
similar section definitions as in Docbook and X/HTML 5 and other small
improvements. It still has many of the same drawbacks like X/HTML 5, as
mentioned in the sections `X/HTML conversion benefits`_, `X/HTML semantics`_
and `X/HTML editor integration`_.

X/HTML 1
--------

Beside the drawbacks mentioned for X/HTML 2 and 5, X/HTML 1 and 1.1 do have
additional problems. It lacks several of the markup structures introduced in
X/HTML 2 and 5, especially the <section> element, which makes it hard to
decide which block level element belongs to which section, like the following
example shows::

    <h1>Header 1</h1>
    <p>First paragraph...</p>
    <h2>Header 2</h2>
    <p>Second paragraph...</p>
    <p>Third paragraph...</p>

Where it is not decidable, if the third paragraph belongs to the first or
second sections, introduced by the respective headers. The same is true for
the second paragraph. The resulting documents could look like::

    <section>
        <header>Header 1</header>
        <para>First paragraph...</para>
        <section>
            <header>Header 1</header>
            <para>Second paragraph...</para>
        </section>
        <para>Third paragraph...</para>
    </section>

Or::

    <section>
        <header>Header 1</header>
        <para>First paragraph...</para>
        <section>
            <header>Header 1</header>
            <para>Second paragraph...</para>
            <para>Third paragraph...</para>
        </section>
    </section>

This may be problematic when converting documents edited in the web interface
to output formats, which are aware of those structures and style documents
accordingly.

Docbook
-------

Docbook is one of the most complete XML based markup languages with only
semantical markup.

Docbook semantics
^^^^^^^^^^^^^^^^^

Docbook is by far the most complete and established markup language,
comparable with LaTeX, but XML based. The only problems experienced so far
converting other markup languages to Docbook are documented in the
`documentation of the document component`__. The described problems are all
not really relevant from a semantical point of view, but only small possible
conversion losses.

Docbook editor integration
^^^^^^^^^^^^^^^^^^^^^^^^^^

The used rich text editor is required to create non X/HTML elements, to offer
the user WYSIWYG experience with a Docbook markup format. The elements created
by the editor can be styled as usual using CSS, like `documented here`__.

Another possibility would be to keep the editor creating X/HTML and converting
it to Docbook before storing the document in the database like already
supported by the document component. This would, of course, reduce the
features, which can be used from the markup language.

Custom formatting
^^^^^^^^^^^^^^^^^

Since Docbook is also XML, custom formatting and modules can be integrated
with the XML source using different XML namespaces, and be converted on output
to X/HTML including the required representational markup.

Conclusion
==========

All formats require conversions during input and output of contents, because
of to the above mentioned use cases. Even there is progress in X/HTML 2 and 5,
the markup offered by those languages is not nearly as complete as the Docbook
markup and still includes purely representational markup, which would require
us to define a subset of X/HTML which is valid to store. Also the X/HTML
standards in the versions 2 and 5 have not settled down yet and may be up for
future modifications.

All formats offer enough capabilities to extend them with custom markup
directives.

The XML based formats should offer faster processing then the text based
formats, especially because of the integration of libxml2 with PHP 5.

Because of the above considerations Docbook seems the best choice for the
interal markup format in eZ Publish.

.. [#] Rich text editors in the web commonly mean editors like TinyMCE__ or
   FCKEditor__, which offer WYSIWYG capabilities in web browsers.

.. [CIT_IAN_2008] `"HTML 5, 1.1.2. Relationship to XHTML2"`__. World Wide Web
   Consortium. Retrieved on 2008-07-19. “… XHTML2… defines a new HTML
   vocabulary with better features for hyperlinks, multimedia content,
   annotating document edits, rich metadata, declarative interactive forms,
   and describing the semantics of human literary works such as poems and
   scientific papers… However, it lacks elements to express the semantics of
   many of the non-document types of content often seen on the Web. For
   instance, forum sites, auction sites, search engines, online shops, and the
   like, do not fit the document metaphor well, and are not covered by XHTML2…
   This specification aims to extend HTML so that it is also suitable in these
   contexts…”

__ http://ezcomponents.org/
__ http://ezcomponents.org/docs/tutorials/Document
__ http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html
__ http://docbook.org/tdg/en/html/docbook.html
__ http://www.wikicreole.org/wiki/Engines
__ http://en.wikipedia.org/wiki/Portable_Document_Format
__ http://de.wikipedia.org/wiki/OpenDocument
__ http://www.wikicreole.org/wiki/Engines
__ http://docutils.sourceforge.net/
__ http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#footnotes
__ http://docbook.org/tdg/en/html/footnote.html
__ http://en.wikipedia.org/wiki/Microformat
__ http://ezcomponents.org/docs/api/trunk/Document_conversion.html
__ http://kore-nordmann.de/blog/the_long_way_to_semantic_web.html#id6
__ http://tinymce.moxiecode.com/
__ http://www.fckeditor.net/
__ http://www.w3.org/TR/2008/WD-html5-20080122/#relationship0