Apache Jackrabbit : SimilaritySearch

Starting with version, 1.4 Jackrabbit allows you to search for nodes that are similar to an existing node.

Similarity is determined by looking up terms that are common to nodes. There are some conditions that must be met for a term to be considered. This is required to limit the number possibly relevant terms.

  • Only terms with at least 4 characters are considered.
  • Only terms that occur at least 2 times in the source node are considered.
  • Only terms that occur in at least 5 nodes are considered.

Note: The similarity functionality requires that the supportHightlighting is enabled. Please make sure that you have the following parameter set for the query handler in your workspace.xml.

<param name="supportHighlighting" value="true"/>

The functions are called rep:similar() (in XPath) and similar() (in SQL) and have two arguments:

  • relativePath: a relative path to a descendant node or . for the current node.
  • absoluteStringPath: a string literal that contains the path to the node for which to find similar nodes.

Examples:

//element(*, nt:resource)[rep:similar(., '/my:content/readme.txt/jcr:content')]

Finds nt:resource nodes, which are similar to /my:content/readme.txt/jcr:content.

select * from nt:file where similar(jcr:content, '/my:content/readme.txt/jcr:content')

Finds files that contains content similar to /my:content/readme.txt.

Note: SQL only supports one path step for the relativePath parameter!