Title: Output Rewriting Pipelines (org.apache.sling.rewriter)

The Apache Sling Rewriter is a module for rewriting the output generated by a usual Sling rendering process. Some possible use cases include rewriting or checking all links in an HTML page, manipulating the HTML page, or using the generated output as the base for further transformation. An example of further transformation is to use XSLT to transform rendered XML to some output format like HTML or XSL:FO for generating PDF.

For supporting these use cases, the rewriter uses the concept for a processor. The processor is a component that is injected through a servlet filter into the response. By implementing the *Processor* interface one is able to rewrite the whole response in one go. A more convenient way of processing the output is by using a so called pipeline; the Apache Sling rewriter basically uses the same concept as the famous Apache Cocoon: an XML based pipeline for further post processing of the output. The pipeline is based on SAX events.

## SAX Pipelines

The rewriter allows to configure a pipeline for post processing of the generated response. Depending on how the pipeline is assembled the rewriting process might buffer the whole output in order to do proper post processing - for example this is required if an HTML response is "transformed" to XHTML or if XSLT is used to process the response.

As the pipeline is based on SAX events, there needs to be a component that generates these events and sends them through the pipeline. By default the Sling rendering scripts write to an output stream, so there is a need to parse this output and generate the SAX events.

The first component in the pipeline generating the initial SAX events is called a generator. The generator gets the output from Sling, generates SAX events (XML), and streams these events into the pipeline. The counterpart of the generator is the serializer which builds the end of the pipeline. The serializer collects all incomming SAX events, transforms them into the required response by writing into output stream of the response.

Between the generator and the serializer so called transformers can be placed in a chain. A transformer receives SAX events from the previous component in the pipeline and sends SAX events to the next component in the pipeline. A transformer can remove events, change events, add events or just pass on the events.

Sling contains a default pipeline which is executed for all HTML responses: it starts with an HTML generator, parsing the HTML output and sending events into the pipeline. An HTML serializer collects all events and serializes the output. 

The pipelines can be configured in the repository as a child node of */apps/APPNAME/config/rewriter* (or */libs/APPNAME/config/rewriter*). (In fact the configured search paths of the resource resolver are observed.) Each node can have the following properties:

* generatorType - the type of the generator (required)
* transformerTypes (multi value string) - the types of the transformers (optional)
* serializerType - the type of the serializer (required)
* paths (multi value string) - the paths this pipeline should run on (content paths)
* contentTypes (multi value string) - the content types this pipeline should be used for (optional)
* extensions (multi value string) - the extensions this pipeline should be used for (optional)
* resourceTypes (multi value string) - the resource types this pipeline should be used for (optional)
* unwrapResources (boolean) - check resource types of unwrapped resources as well (optional, since 1.1.0)
* selectors (multi value string) - a set of selectors the pipeline should be used for (optional, since 1.1.0)
* order (long) - the configurations are sorted by this order, order must be higher or equal to 0. The configuration with the highest order is tried first.
* enabled (boolean) - Is this configuration active? (default yes)

As you can see from the configuration there are several possibilities to define when a pipeline should be used for a response, like paths, extensions, content types, or resource types. It is possible to specify several of them at once. In this case all conditions must be met.

If a component needs a configuration, the configuration is stored in a child node which name is *\{componentType\}\-\{name\}*, e.g. to configure the HTML generator (named *html\-generator*), the node should have the name *generator-html-generator*. In the case that the pipeline contains the same transformer several times, the configuration child node should have the formant *\{componentType\}-\{index\}* where index is the index of the transformer starting with 1. For example if you have a pipeline with the following transformers, xslt, html-cleaner, xslt, link-checker, then the configuration nodes should be named *transformer-1* (for the first xslt), *transformer-html-cleaner*, *transformer-3* (for the second xslt), and *transformer-link-checker*.


### Default Pipeline

The default pipeline is configured for the *text/html* mime type and the *html* extensions and consists of the *html-generator* as the generator, and the *html-serializer* for generating the final response.
As the HTML generated by Sling is not required to be valid XHTML, the HTML parser is using an HTML parser to create valid SAX events. In order to perform this, the generator needs to buffer the whole response first.

## Implementing Pipeline Components

Each pipeline component type has a corresponding Java interface (Generator, Transformer, and Serializer) together with a factory interface (GeneratorFactory, TransformerFactory, and SerializerFactory). When implementing such a component, both interfaces need to be implemented. The factory has only one method which creates a new instance of that type for the current request. The factory has to be registered as a service. For example if you're using the Maven SCR plugin, it looks like this:

    ::java
    @scr.component metatype="no" 
    @scr.service interface="TransformerFactory"
    @scr.property value="pipeline.type" value="validator"

The factory needs to implement the according interface and should be registered as a service for this factory interface (this is a plain service and not a factory service in the OSGi sense). Each factory gets a unique name through the *pipeline.type* property. The pipeline configuration in the repository just references this unique name (like validator).

## Extending the Pipeline
With the possibilities from above, it is possible to define new pipelines and add custom components to the pipeline. However, in some cases it is required to just add a custom transformer to the existing pipeline. Therefore the rewriting can be configured with pre and post transformers that are simply added to each configured pipeline. This allows a more flexible way of customizing the pipeline without changing/adding a configuration in the repository.

The approach here is nearly the same. A transformer factory needs to be implemented, but instead of giving this factory a unique name, this factory is marked as a global factory:

    ::java
    @scr.component metatype="no"
    @scr.service interface="TransformerFactory"
    @scr.property name="pipeline.mode" value="global"
    @scr.property name="service.ranking" value="RANKING" type="Integer"

*RANKING* is an integer value (don't forget the type attribute otherwise the ranking is interpreted as zero!) specifying where to add the transformer in the pipeline. If the value is less than zero the transformer is added at the beginning of the pipeline right after the generator. If the ranking is equal or higher as zero, the transformer is added at the end of the pipeline before the serializer.

The *TransformerFactory* interface has just one method which returns a new transformer instance. If you plan to use other services in your transformer you might declare the references on the factory and pass in the instances into the newly created transformer.

Since the transformer carries information about the current response it is not advisable to reuse the same transformer instance among multiple calls of `TransformerFactory.createTransformer`.


## Implementing a Processor

A processor must conform to the Java interface *org.apache.sling.rewriter.Processor*. It gets initializd (method *init*) with the *ProcessingContext*. This context contains all necessary information for the current request (especially the output writer to write the rewritten content to).
The *getWriter* method should return a writer where the output is written to. When the output is written or an error occured *finished* is called.

Like the pipeline components a processor is generated by a factory which has to be registered as a service factory, like this:

    ::java
    @scr.component metatype="no" 
    @scr.service interface="ProcessorFactory"
    @scr.property value="pipeline.type" value="uniqueName"

## Configuring a Processor

The processors can be configured in the repository as a child node of */apps/APPNAME/config/rewriter* (or libs or any configured search path). Each node can have the following properties:

* processorType - the type of the processor (required) - this is the part from the scr factory information after the slash (in the example above this is *uniqueName*)
* paths (multi value string) - the paths this processor should run on (content paths)
* contentTypes (multi value string) - the content types this processor should be used for (optional)
* extensions (multi value string) - the extensions this processor should be used for (optional)
* resourceTypes (multi value string) - the resource types this processor should be used for (optional)
* unwrapResources (boolean) - check resource types of unwrapped resources as well (optional, since 1.1.0)
* selectors (multi value string) - a set of selectors the pipeline should be used for (optional, since 1.1.0)
* order (long) - the configurations are sorted by this order, order must be higher or equal to 0. The configuration with the highest order is tried first.
* enabled (boolean) - Is this configuration active? (default yes)