Apache Jackrabbit : EncodingAndEscaping

Encoding and Escaping

This pages covers escaping/encoding of paths, names, and values in the context of JCR-based web applications.

Why?

JCR node names have a certain character set, which is actually very broad and includes almost all of unicode minus some special characters such as /, [, ], |, : and * (used to build paths, address same-name siblings etc. in JCR), and it cannot be "." or ".." (obviously).

For XPath queries, the underlying model is that of the JCR repository as an XML document, hence every path step in the XPath is seen as XML name (ISO9075), which is more restrictive than JCR node names and most importantly does not allow names starting with digits. But they can be escaped.

Furthermore, in XPath queries there is the full text search using "jcr:contains()" and this has its own query string format itself, which in Jackrabbit will be that of Lucene.

Then you might often use JCR for web applications where you map URLs to JCR paths - note that JCR node names allow for more than what URLs allow, most notably the space for example.

There are utility methods for escaping/encoding in the org.apache.jackrabbit.util.ISO9075 and org.apache.jackrabbit.util.Text classes. Although developed under Jackrabbit, they are part of the JCR Commons module (jackrabbit-jcr-commons) which only depends on the JCR API.

Escaping paths

If you're building a path from user-supplied names, you need to escape illegal JCR characters (eg "item:1" becomes "item%3A1"):

String path = "/foo/" + Text.escapeIllegalJcrChars(name);

Such paths are useful for JCR methods like Node.addNode(...), Session.getItem(...) etc., but usually only when you create nodes in the first place. Once the node exists, its name just needs to be passed around, but no escaping should happen for accessing the node, since it will already be in the right form, of course.

Encoding path in queries

If you want to use paths in XPath queries, though, you need to escape according to ISO9075 rules (eg "1hr0" becomes "_x0031_hr0"):

String query = "/jcr:root" + ISO9075.encodePath(node.getPath()) + "/" + ISO9075.encode(name);

For a user-supplied string, this could lead to something like ISO9075.encode(Text.escapeIllegalJcrChars(name)). But in most cases the path given to a query is from a known node, so there is no need for escaping it with Text.escapeIllegalJcrChars(name), so just the ISO9075 encoding is required.

Escaping values in queries

For values inserted into the queries, you should do escaping to prevent incorrect values and query injection. Generally, if you enclose values in single quotes, you just need to replace any literal single quote character with _ (two consecutive single quote characters).

Escaping text in fulltext (contains) clauses

Jackrabbit Oak uses the Apache Lucene grammar for fulltext search. So to escape user-supplied text for use in contains, you will need to either filter out all the special characters, or escape them. So for example, to filter out the special characters, use:

String filteredContains = searchTerm.replaceAll("[\\Q+-&|!(){}[]^\"~*?:\\/\\E]", "");
String q =
  "/jcr:root/foo/element(*, foo)" +
  "[jcr:contains(@title, '" + filteredContains.replaceAll("'", "''") + "')]" +
  "[@itemID = '" + itemID.replaceAll("'", "''") + "']";

Only for Jackrabbit 2.x: use Text.escapeIllegalXpathSearchChars(...) for calls to jcr:contains(...) (see also JCR-1248):

String q =
  "/jcr:root/foo/element(*, foo)" +
  "[jcr:contains(@title, '" + Text.escapeIllegalXpathSearchChars(searchTerm).replaceAll("'", "''") + "')]" +
  "[@itemID = '" + itemID.replaceAll("'", "''") + "']";

Note that other special characters (like "@" or ".") are_ usually_ ignored by the Lucene parser, however if a "*" (wildcard) is used, then they are_ not'' ignored. So:

jcr:contains(., 'hello.world') => works (all documents that contain both the exact words "hello" and "world")
jcr:contains(., 'hello world') => works (all documents that contain both the exact words "hello" and "world")

jcr:contains(., '*hello world*') => works (all documents that contain a word ending with "hello" and starting with "world") 
jcr:contains(., '*hello.world*') => does not work (no results)
jcr:contains(., 'hello.world*') => does not work (no results)
jcr:contains(., '*hello.world') => does not work (no results)

If the search text only contains special characters, then all indexed nodes are returned:

/jcr:root//*[jcr:contains(., '.')] => *:*
/jcr:root//*[jcr:contains(., '..')] => *:*
/jcr:root//*[jcr:contains(., '/')] => *:*
/jcr:root//*[jcr:contains(., '//')] => *:*
/jcr:root//*[jcr:contains(., '*')] => :fulltext:*
/jcr:root//*[jcr:contains(., '**')] => :fulltext:**
/jcr:root//*[jcr:contains(., '○')] => :fulltext:â (○ is WHITE CIRCLE U+25CB)
/jcr:root//*[jcr:contains(., '○○')] => +:fulltext:â +:fulltext:â
/jcr:root//*[jcr:contains(., '☯︎')] => :fulltext:â (☯ is YIN YANG U+262F U+FE0E)
/jcr:root//*[jcr:contains(., '¥︎')] => :fulltext:â (¥ is YEN SIGN U+00A5)

On the other hand, if an empty string or just spaces are used, the query fails with "Invalid expression":

/jcr:root//*[jcr:contains(., '')]
/jcr:root//*[jcr:contains(., ' ')]
/jcr:root//*[jcr:contains(., '  ')]

Escaping/encoding in URIs

There are further encoding/decoding methods in the Text class for dealing with URIs in a webapp. The allowed chars for JCR names contains the URI set plus a few others (eg. spaces). Thus the URI set is actually more constrained. Therefore, if you have a valid URI, you can map it directly onto a JCR path without having to worry about escaping (this is by design). If you go the other way, ie. have a JCR path and want to create an URI for it, you simply use plain URI escaping for it. To make everything simpler in the context of URIs, one suggestion is to only create JCR nodes with names that are valid URIs.

See also