XPathScript - A Viable Alternative to XSLT? Matt Sergeant

matt@sergeant.org

2000 AxKit.com Ltd This guide gives an introduction to the features of XPathScript, a template processor that is part of AxKit which provides full programming facilities alongside XPath based node resolution. It also features code / template separation using the ASP <% %> paradigm. Introduction XPathScript is a stylesheet language for translating XML files into some other format. It has only a few features, but by combining those features with the power and flexibility of Perl, XPathScript is a very capable system. Like all XML stylesheet languages, including XSLT, an XPathScript stylesheet is always executed in the context of a source XML file. In many cases the source XML file will actually define what stylesheets to use via the <?xml-stylesheet?> processing instruction. XPathScript was concieved as part of AxKit - an application server environment for Apache servers running mod_perl (XML.com ran my Introduction to AxKit article in May). Its primary goal was to achieve the sorts of transformations that XSLT can do, without being restricted by XSLT's XML based syntax, and to provide full programming facilities within that environment. I also wanted XPathScript to be completely agnostic about output formats, without having to program in special after-effect filters. The result is a language for server-side transformation that provides the power and flexibility of XSLT combined with the full capabilities of the Perl language, and the ability to produce stylesheets in any ASP capable editor or ordinary text editor. The above Introduction to AxKit is recommended reading before reading this guide. The Syntax XPathScript follows the basic ASP syntax of introducing code with the <% %> delimiters. Here's a brief example of a fully compatible XPathScript stylesheet: <%= 5+5 %> ]]> This simply outputs the value 10 in a HTML document. The delimiters used here are the <%= %> delimiters, which are slightly different in that they send the results of the expression to the browser (or to the next processing stage in AxKit). Of course this example does absolutely nothing with the source XML file which is completely separate from this stylesheet. Here's another example: <% $foo = 'World' %> Hello <%= $foo %> !!! ]]> This outputs the text "Hello World !!!". Again, we're not actually doing anything here with our source document, so all XML files using this stylesheet will look identical. This seems rather uninteresting until we discover the library of functions that are accesible to our XPathScript stylesheets for accessing the source document contents. The XPathScript API Along with the code delimiters XPathScript provides stylesheet developers with a full API for accessing and transforming the source XML file. This API can be used in conjunction with the delimiters above to provide a stylesheet language that is as powerful as XSLT, and yet provides all the features of a full programming language (in this case, Perl, but I'm certain that other implementations such as Python or Java would be possible). Extracting Values A simple example to get us started, is to use the API to bring in the title from a docbook article. A docbook article title looks like this: XPathScript - A Viable Alternative to XSLT? ... ]]> The XPath expression to retrieve the text in the title element is: Putting this all together to make this text into the HTML title we get the following XPathScript stylesheet: <%= findvalue("/article/artheader/title/text()") %> This was a DocBook Article. We're only extracting the title for now!

The title was: <%= findvalue("/article/artheader/title/text()") %> ]]> There are lots of features to the expression syntax we used to find that "node", and this syntax is called &XPath;. This is a W3C standard for finding and matching XML document nodes. The standard is fairly readable and is at http://www.w3.org/TR/xpath alternatively I can recommend Norm Walsh's XPath introduction which covers a slightly older version of the specification, but I didn't notice anything in the article that is missing or different from the current recommendation. Extracting Nodes The above example showed us how to extract single values, but what if we have a list of things we wish to extract values from? Here's how we might get a table of contents from docbook article sections: findvalue("title/text()"), "
\n"; for my $sect2 ($sect1->findnodes("sect2")) { print " + ", $sect2->findvalue("title/text()"), "
\n"; for my $sect3 ($sect2->findnodes("sect3")) { print " + + ", $sect3->findvalue("title/text()"), "
\n"; } } } %> ... ]]> This gives us a table of contents down to three levels (adding links to the actual part of the document is left as an exercise). The first call to findnodes gives use all sect1 nodes that are children of the root element (article). The &XPath; expressions following that are relative to the current node. You can see that by the absence of the leading /. Again, &XPath; is a very interesting query language, and you would be best to visit the XPath specification to learn more. Note that in the above we don't use the global function findnodes() after finding the sect1 nodes, instead we call the node method findnodes(), which does exactly the same thing, but makes the node you are calling from the context of the XPath expression. Declarative Templates The examples up to now have all covered a concept of a single global template with a search/replace type functionality from the source XML document. This is a powerful concept in itself, especially when combined with loops and the ability to change the context of searches. But that style of template is limited in utility to well structured data, rather than processing large documents. In order to ease the processing of documents, XPathScript includes a declarative template processing model too, so that you can simply specify the format for a particular element and let XPathScript do the work for you. In order to support this method, XPathScript introduces one more API function: apply_templates(). The name is intended to appeal to people already familiar with XSLT. The apply_templates() function takes either a list of start nodes, or an &XPath; expression (that must result in a node set) and optional context. Starting at the start nodes it traverses the document tree applying the templates defined by the $t hash reference. First a simple example to introduce this feature. Lets assume for a moment that our source XML file is valid XHTML, and we want to change all anchor links to italics. Here is the very simple XPathScript template that will do that: {'a'}{pre} = ''; $t->{'a'}{post} = ''; $t->{'a'}{showtag} = 1; %> <%= apply_templates() %> ]]> Note that apply_templates() has to be output using <%= %>. That's because apply_templates() actually outputs a string representation of the transformation, it doesn't do the output to the browser for you. The first thing this example does is sets up a hash reference $t that XPathScript knows about (lets call it magical). The keys of $t are element names (including namespace prefix if we are using namespaces). The hash can have the following sub-keys: pre post showtag testcode We'll cover testcode in more depth later in , but for now know that it is a place holder for code that allows for more complex templates. Unlike XSLT's declarative transformation syntax, the keys of $t do not specify &XPath; match expressions. Instead they are simple element names. This is a trade off of speed of execution over flexibility. Perl hash lookups are extremely quick compared to XPath matching. Luckily because of the testcode option, more complex matches are quite possible with XPathScript. The simple explanation for now is that pre specifies output to appear before the tag, post specifies output to appear after the tag, and showtag specifies that the tag itself should be output as well as the pre and post values. A Complete Example Now lets put all of these ideas together into a (almost) complete example. This is part of the stylesheet I use to process my docbook articles online: <% my %links; my $linkid = 0; $t->{'ulink'}{testcode} = sub { my $node = shift; my $t = shift; my $url = findvalue('@url', $node); if (!exists $links{$url}) { $linkid++; $links{$url} = $linkid; } my $link_number = $links{$url}; $t->{pre} = ""; $t->{post} = " [$link_number]"; return 1; }; %> <%= findvalue('/article/artheader/title/text()') %> <% # display title/TOC page print apply_templates('/article/artheader/*'); %>

<% # display particular page foreach my $section (findnodes("/article/sect1")) { print apply_templates($section); } %>

List of Links

<% for my $link (sort {$links{$a} <=> $links{$b}} keys %links) { %> <% } %>

URL
<%= "[$links{$link}] $link" %>

]]> The very first line there imports a library of tags that are shared between this stylesheet, and one that is easier for web viewing with clickable links between sections (which can be downloaded here). The import system is based on Server Side Includes (SSI) although only SSI file includes are supported at this time (SSI virtual includes can be implemented using mod_include). Here is part of the docbook_tags.xps file: {'attribution'}{pre} = ""; $t->{'attribution'}{post} = "
\n"; $t->{'para'}{pre} = '

'; $t->{'para'}{post} = '

'; $t->{'ulink'}{testcode} = sub { my $node = shift; my $t = shift; $t->{pre} = ""; $t->{post} = ''; return 1; }; $t->{'title'}{testcode} = sub { my $node = shift; my $t = shift; if (findvalue('parent::blockquote', $node)) { $t->{pre} = ""; $t->{post} = "
\n"; } elsif (findvalue('parent::artheader', $node)) { $t->{pre} = "

"; $t->{post} = "

"; } else { my $parent = findvalue('name(..)', $node); if (my ($level) = $parent =~ m/sect(\d+)$/) { $t->{pre} = ""; $t->{post} = ""; } } return 1; }; %> ]]> We go into detail of what is happening in this example in the next section. Stepping Through the Example Careful readers will note that the first thing we see is a $t specification for <ulink> tags, and you'll also note that the included docbook_tags.xps contains a specification for <ulink>. The reason is to override the default behaviour for ulink tags in the print version of my articles to contain a reference that we can use later in a list of links. We can also see that this specification uses a testcode parameter that we haven't encountered before. We'll see how and why that's used later in . Next we see the findvalue() function used exactly as we already saw in . Then we have a section with a comment marked: "display Title/TOC page". This uses the apply_templates() function with an &XPath; expression. Note that rather than use the <%= %> delimiters around the apply_templates() call, we simply use the print function. This has the same effect, and is used here to show the flexibility in this approach. The main part of the code loops through all sect1 tags, and calls apply_templates on those nodes. Note how this is another demonstration of Perl's TMTOWTDI (There's More Than One Way To Do It) approach - the same code could have been written: ]]> Finally, because this is the print version of our article, we provide a list of links so that people viewing a printed version of this article can type in those links, and they can also refer to the link by reference number, as we saw earlier. We use the hash of links in the %links variable that we built in the testcode handler for our ulink template. The other file, docbook_tags.xps, is included only in part here, to demonstrate a few of the transformations we're applying to various docbook article tags. We can see that we're turning <para> tags into <p> tags, and doing some more complex processing with testcode to <title> tags. We'll see in exactly what testcode allows us to achieve. The Template Hash The apply_templates() function iterates over the nodes of your XML file applying the templates in the $t hash reference. This is the most important feature of XPathScript, because it allows you to define the appearance for individual tags without having to do it programmatically. This is the declarative part of XPathScript. There is an important point to make here: XSLT is a purely declarative syntax, and people are having to work procedural code into XSLT via work arounds. XPathScript takes a much more pragmatic approach (much like Perl itself) - it is both declarative and procedural, allowing you the flexibility to use real code for real problems. It is important to note that apply_templates returns a string, so you must either use print apply_templates() if using it from a Perl section of code, or via <%= apply_templates() %>. The keys of $t are the names of the elements, including namespace prefixes. When you call apply_templates(), every element visited is looked up in the $t hash, and the template items stored in that hash are applied to the node. It's worth noting at this point, that unlike XSLT, XPathScript does not perform tree transformations from one tree to another. It simply sends its output to the browser directly. This has advantages and disadvantages, but they are beyond the scope of this guide. The following sub-keys define the transformation: pre - the output to occur before the tag. post - the output to occur after the tag. prechildren - the output to occur before the children of this tag are output. postchildren - the output to occur after the children of this tag are output. prechild - the output to occur before each child of this tag. postchild - the output to occur after each child of this tag. showtag - set to a true value to display the tag as well as the pre and post values. If unset or false the tag itself is not displayed. testcode - code to execute upon visiting this tag. See below. The showtag option is mostly equivalent to the XSLT <xsl:copy> tag, only less verbose. The pre and post options are useful because generally in transformations we want to specify what comes before and after a tag. For example, to change an HTML A tag to be in italics, but still have the link, we would use the following: {A}{pre} = ""; $t->{A}{post} = ""; $t->{A}{showtag} = 1; ]]> "testcode" The testcode option is where we perform really powerful transformations. Its how we can do more complex tests on the node that are available in XPath, and locally modify the transformation based on what we find. The value stored in testcode is simply a reference to a subroutine. In Perl these are incredibly simple to create using the anonymous sub keyword (note that these are often erroneously called closures, but they only become closures if they reference a lexical variable outside the scope of the subroutine itself). The sub is called every time one of these elements is visited. The subroutine is passed two parameters: The node itself, and an empty hash reference that you can populate using the pre, post, prechildren, prechild, postchildren, postchild and showtag values that we've discussed already. Unlike the global $t hashref you don't have to first specify the element name as a key. Here's the <ulink> example from the global tags code above: {'ulink'}{testcode} = sub { my ($node, $t) = @_; $t->{pre} = ''; $t->{post} = ''; return 1; }; ]]> The equivalent XSLT code looks like this: ]]> Note in the XPathScript above that the inner $t is lexically scoped, so changes to it don't affect the outer $t. To save some confusion we might have named that variable $localtransforms, but some people like myself hate typing... ;-) The return value from the testcode is also important. A return value of 1 means to process this node and continue processing all the children of this node. A return value of -1 means to process this node and stop, and a return value of 0 means do not process this node at all. This is useful in conditional tests, where you may not wish to process the nodes under certain conditions. You may also use a return code of a consisting of a string that is an XPath expression. See for more information. It is important to note that we can do things here based on XPath lookups just as we can in XSLT. While it is a little more verbose than a simple XSLT pattern match, the trade off is in performance. An example is in XSLT you might match artheader/title and elsewhere you might match title[name(..) != "artheader". In XPathScript we can only match "title" in the template hash. But we can use the testcode section to extend the match: {'title'}{testcode} = sub { my $node = shift; my $t = shift; if (findvalue('parent::blockquote', $node)) { $t->{pre} = ""; $t->{post} = "
\n"; } elsif (findvalue('parent::artheader', $node)) { $t->{pre} = "

"; $t->{post} = "

"; } else { my $parent = findvalue('name(..)', $node); if (my ($level) = $parent =~ m/sect(\d+)$/) { $t->{pre} = ""; $t->{post} = ""; } } return 1; }; ]]> Here we check what the parent node is before performing our modification to the local $t hashref. Specifically note the utility of being able to perform Perl regular expressions to extract values. Copying styles One really neat feature of XPathScript that is really hard to do with XSLT is to be able to copy a style completely: {'foo'}{pre} = ""; $t->{'foo'}{post} = ""; $t->{'foo'}{showtag} = 1; $t->{'bar'} = $t->{'foo'}; %> ]]> While this would be possible in XSLT using entities, it's certainly not very practical or neat. With XPathScript many tags can share the same template. Be careful though - this is a reference copy, not a deep copy, so the following may not do what you think it should: {'foo'}{pre} = ""; $t->{'foo'}{post} = ""; $t->{'foo'}{showtag} = 1; $t->{'bar'} = $t->{'foo'}; $t->{'bar'}{post} = "
"; %> ]]> Because this is a reference, the last line there changes the values for 'foo' as well as 'bar'. A "Catch All"? Does XPathScript have a "catch all" option for elements that I don't have a $t entry for? Yes, of course! Simply set $t->{'*'} to the template you want to execute. You can even do some really clever things, such as using the testcode section to output a warning to the Apache error log about an unrecognised tag, rather than having to place some output in the resulting document and bother your users! This feature was introduced in AxKit 0.94. Interpolation Adding attributes or other data into the translated nodes is non-trivial using this setup. It requires you to drop down into testcode. Here's an example of turning <link url="..."> tags into HTML <a> tags: {'link'}{testcode} = sub { my ($node, $t) = @_; $t->{pre} = ''; $t->{post} = ''; return 1; }; %> ]]> This is obviously rather verbose. To make this a little simpler, in XPathScript as of AxKit 1.1, we have introduced interpolation of the replacement strings, much the same as you can do with XSLT attributes. Here is the appropriate $t entry as of AxKit 1.1: {'link'}{pre} = ''; $t->{'link'}{post} = ''; %> ]]> The curly brackets {} delimit an XPath expression on which findvalue is called using the current node as the context. Any XPath expression should be valid within those delimiters. As a backwards compatibility measure, and to ensure efficiency is defaulted, interpolation only occurs when you have the following somewhere in your Apache configuration defined for the current request: You can also turn off interpolation temporarily in your script using the global variable $XPathScript::DoNotInterpolate. Set that to a true value to turn off interpolation. Be careful to only do that locally (using the perl local keyword) to ensure it doesn't remain set for the next invocation of the script. Writing Dynamic Content Because XPathScript has full access to all the perl builtins, you can very easily create dynamic content with XPathScript. There is only 1 caveat though: The AxKit cache works on the basis of the timestamp of the original XML file. This means that your XPathScript code will only be executed when the XML resource that is being requested actually changes. To work around this limitation you simply need to tell AxKit that this stylesheet contains dynamic content, and therefore the output should not be cached. The syntax for this duplicates the Apache API for telling proxy servers not to cache the output: <% ... $r->no_cache(1); ... %> An XPathScript Mini-Reference Code is separated from output in XPathScript using the <% %> delimiters. Perl expression results can be sent to the browser either using print() if inside a <% %> section, or via <%= code %>. The following XPath functions are imported for your use: findnodes($path, [$context]) findvalue($path, [$context]) findnodes_as_string($path, [$context]) apply_templates( $path, [$context]) apply_templates( @nodes ) import_template( $uri ) The first three methods are documented more completely in the XML::XPath manual pages. Apply templates examines the contents of the local $t hash reference for elements names. For example, when encountering a <foo> element via apply_templates, XPathScript will try to find a transformation hash in the key $t->{'foo'}. Import template can be used to pull in an external XPathScript template file. $uri should be a path to the stylesheet to be included. The function returns an anonymous subroutine that when executed will run the stylesheet. The anonymous subroutine takes two arguments, which makes it ideal to plug into a testcode entry, for example: $t->{BODY}{testcode} = import_template("/xps/bodystyle.xps"); Inside the imported stylesheet, you will be referencing the same $t as the parent stylesheet. You can get at the usual testcode version of $t by using $real_local_t. If you want to include a stylesheet anyway (not as part of a testcode setup), just write it as normal, and include a line like this in the parent stylesheet: import_template("/xps/bodystyle.xps")->(); The value in $t->{'foo'} above is a hash reference with the following optional keys: pre post prechildren postchildren prechild postchild showtag testcode If a value is not found in $t for the current element, then the element is output verbatim, and apply_templates performed on all its children. Except in the case where a $t->{'*'} value exists, which is a "catchall" transformation specification. This might be a useful place to add some testcode to output a warning to the error log. If a value is found in $t for the current element then the tag itself is not displayed unless $t->{<element_name>}{showtag} is set to a true value. testcode is a reference to a subroutine (often constructed as an anonymous subroutine). The subroutine is called with two parameters: The current node and a localised hash reference to store new transformations for this node and this node only. The return value from this subroutine must be one of: 1 - process this node and all children -1 - process this node but not the children of this node 0 - do not process this node or its children 'string' - any string (other than "1", "0" or "-1") is equivalent to 1, except rather than processing the node's children, it processes the nodes found by executing findnodes('string', $node) where $node is the current node. Obviously 'string' has to be a valid XPath expression. XPathScript stylesheets can be modularised using SSI #include directives. The code in #included files is added verbatim into the current code at the position of the include. This allows you to use this fact to override defaults (as we saw in the first example where the template for ulink is overridden). Using XPathScript to Write XSP TagLibs XSP is an alternative server side XML programming API. It is not a stylesheet system though - the XSP page is executed directly without a stylesheet. XSP was originally incorporated into the Cocoon application framework, and AxKit included XSP capabilities because it's a very interesting and useful tool. One of the interesting things about XSP is the ability to write taglibs using some form of stylesheet transformation language. A taglib is a separate sheet of tags that have special meaning to your code. They can execute external functions or simply be used in a similar way to external parsed entities. Here's the classic example of a usage of a taglib from the Cocoon documentation (slightly modified from the original):

To the best of my knowledge, it's now

]]> Here the <example:time-of-day> tag gets converted at run time to the current time using the strftime format specified in the format attribute. A taglib implementation is a stylesheet that is evaluated against this file prior to passing it to the XSP processor. The stylesheet converts the tags that it recognises into pure XSP code (see http://xml.apache.org/cocoon/xsp.html for more information on XSP). While this seems a rather redundant feature, it allows even further separation between code and design. Designers can just introduce these special tags, without worrying about the logic behind them. The Cocoon recommendation is to write taglibs using XSLT. This works well, but the code often looks confusing. My recommendation for AxKit is to use XPathScript. Here's our implementation of the time-of-day tag using XPathScript: {'xsp:page'}{prechildren} = < POSIX EOXML $t->{'example:time-of-day'}{testcode} = sub { my ($node, $t) = @_; $t->{pre} = ' POSIX::strftime("' . findvalue('@format', $node) . '", localtime) '; return 1; }; %> <%= apply_templates() %> ]]> This is a rather trivial example of a taglib, but hopefully it introduces the possibilities of further extending your tag library. In order to enable this tag library, we simply make the taglib stylesheet the first in our stylesheet cascade:

To the best of my knowledge, it's now

]]> Note that the XSP script is executed using the stylesheet processing instruction, with a stylesheet of ".". This stylesheet could be anything in the case of XSP, since there is actually no stylesheet associated with it, and the "." is merely a convention. For comparison, here's the equivalent XSLT based taglib:

POSIX

POSIX::strftime("", localtime)

]]> Some people may find one version easier to work with than the other, although I personally prefer the simplicity of XPathScript.