-------------------------------------------- IndexReaders in Jackrabbit - The Big Picture -------------------------------------------- ~~ Licensed to the Apache Software Foundation (ASF) under one or more ~~ contributor license agreements. See the NOTICE file distributed with ~~ this work for additional information regarding copyright ownership. ~~ The ASF licenses this file to You under the Apache License, Version 2.0 ~~ (the "License"); you may not use this file except in compliance with ~~ the License. You may obtain a copy of the License at ~~ ~~ http://www.apache.org/licenses/LICENSE-2.0 ~~ ~~ Unless required by applicable law or agreed to in writing, software ~~ distributed under the License is distributed on an "AS IS" BASIS, ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ~~ See the License for the specific language governing permissions and ~~ limitations under the License. IndexReaders in Jackrabbit - The Big Picture Jackrabbit uses Lucene as the underlying index implementation and provides several extensions and customizations that help improve performance in an environment where changes to the index are frequent. The extensions also cover features that are not supported by Lucene, like hierarchical queries. [index-readers-per-segment.jpg] The readers in an index segment. CachingIndexReader The <<>> is at the very bottom of the index reader stack in Jackrabbit. It's main purpose is to cache the parent relationship of a node. Each node is represented with a document in the index and one of the fields is <<<_:PARENT>>>. The value of this field is the string representation of the parent nodes UUID. In case of the root node the the parent field contains an empty string as its value. Several queries in Jackrabbit are hierarchical and check whether a node is a descendant of another node. For the very simple case, where one needs to know if a node is the child of another node, we can just look up both nodes (lucene documents) in the index and compare the parent field on one node with the <<<_:UUID>>> field of the other. If they match the one is the child of the other node. When it comes to evaluating a descendant axis, this becomes much more expensive and will cause lots of document lookups in lucene. By caching the parent child relationship of documents, hierarchical operations can be executed much faster. The cache consists of an array of <<>> instances. The length of this array corresponds to the number of documents accessible through the index reader. That is every document in the index has a corresponding cache entry in the array. Initially the cache is empty and is filled as it is accessed. There are two kinds of <<>>s: <<>> and <<>>. When the parent of a node resides in the same index segment a <<>> is created, which simply contains the document number of the parent. If the parent resides in a different index segment a <<>> is created, which contains the UUID of the parent node. When a <<>> is resolved it is passed an index reader, which allows it to get the document number for the UUID and cache it for later reuse. Overwriting DocId It may happen that a <<>> is present in the cache of a <<>> but must be considered invalid in the context of a call. <<>> may be called from a <<>> instance which has the target of the <<>> in the set of deleted document. This indicates that the nodes has been deleted or modified. Thus it has traveled to another index segment. In this case the <<>> is overwritten with a <<>>. The opposite never happens. A <<>> is never overwritten with a <<>> because when a document is added to an index a new <<>> is created. SharedIndexReader The <<>> wraps a <<>> and adds a reference count facility. A <<>> is kept open for the entire lifetime of a <<>>. Even if documents are marked deleted in the underlying index (by another thread through <<>>), the <<>> will still be kept open and considers the documents as valid. The reference counting is needed because it may happen that a client of the <<>> is still in use while the underlying <<>> is closed. This may happen when the index merger replaces indexes while a query still operates on the indexes to be deleted. Using reference counts, closing the <<>> is delayed until all clients are finished with the <<>>. ReadOnlyIndexReader The inconsistency introduced by the <<>> (considers deleted documents as still valid) is corrected by the <<>>. Whenever a new instance of this reader is created it copies the currently marked deleted documents from the <<>>. At the same time all methods that attempt delete documents will throw a <<>>. CommittableIndexReader This is the index reader where documents are marked deleted in a <<>>. As with the <<>> the <<>> is kept open for the entire lifetime of the <<>>. To achieve this the <<>> exposes a method <<>>, which forces the underlying native lucene index reader to commit changes. Only committing changes whithout closing the index reader is otherwise not possible using the plain lucene index reader. Combining the index segments [index-readers-per-query-handler.jpg] The readers in a query handler. CachingMultiIndexReader The index for the content of a workspace consists of multiple segments, that is multiple <<>>s. They are combined in a <<>> using a <<>>. In order to speed up lookups by UUID the <<>> also has a <<>>. This cache uses a LRU algorithm to keep a limitted amount of UUID to document number mappings. CombinedIndexReader This index reader is similar to the <<>>, in fact both implement <<>> and <<>>. A <<>> is created when a query needs an index reader that spans both the workspace index as well as the <<>> index, where the version store resides.