Apache Jackrabbit : Journal based Async Indexer

JIRA Issue OAK-6513
Oakathon August 2017 Presentation Journal based Async Indexer.pdf

Current async indexer design is based on NodeState diff. This has served us fine so far however off late it is not able to perform well if rate of repository writes is high. When changes happen faster than index-update can process them, larger and larger diffs will happen. These make index-updates slower, which again lead to the next diff being ever larger than the one before (assuming a constant ingestion rate).

In current diff based flow the indexer performs complete diff for all changes happening between 2 cycle. It may happen that lots of writes happens but not much indexable content is written. So doing diff there is a wasted effort.

In 1.6 release for NRT Indexing we implemented a journal based indexing for external changes(OAK-4808, OAK-5430). That approach can be generalized and used for async indexing.

Drawbacks

Following are the drawbacks due to diff based design

For default setup having 2 indexing lanes (async, fulltext-async) all recently written nodes are read twice. This puts read pressure on storage (special for DocumentNodeStore)
Diff based apparoach suffers from same problem as observation queue full i.e. once it starts lagging behind the next cycle would take more time and system may not recover

Before talking about the journal based approach lets see how IndexEditor work currently

IndexEditor

Currently any IndexEditor performs 2 tasks

Identify which node is to be indexed based on some index definition. The Editor gets invoked as part of content diff where it determines which NodeState is to be indexed
Update the index based on node to be indexed

For e.g. in oak-lucene we have LuceneIndexEditor which identifies the NodeStates to be indexed and LuceneDocumentMaker which constructs the Lucene Document from NodeState to be indexed. For journal based approach we can decouple these 2 parts and thus have

IndexEditor - Identifies which all paths need to be indexed for given index definition
IndexUpdater - Updates the index based on given NodeState and its path

Proposal

Session Commit Flow
1. Each index type would provide a IndexEditor which would be invoked as part of commit (like sync indexes). These IndexEditor would just determine which paths needs to be indexed.
2. As part of commit the paths to be indexed would be written to journal.
AsyncIndexUpdate flow
1. AsyncIndexUpdate would query this journal to fetch all such indexed paths between the 2 checkpoints
2. Based on the index path data it would invoke the IndexUpdater to update the index for that path
3. Merge the index updates

Benefits

Such a design would have following impact

More work done as part of write
Marking of indexable content is distributed hence at indexing time lesser work to be done
Indexing can progress in batches - As the indexer iterate over journal it can commit changes in batches
The indexers can be called in parallel

Journal Implementation

DocumentNodeStore currently has an in built journal which is being used for NRT Indexing. That feature can be exposed as an api.

For scaling index this design is mostly required for cluster case. So we can possibly have both indexing support implemented and use the journal based support for DocumentNodeStore setups. Or we can look into implementing such a journal for SegmentNodeStore setups also

Open Points

Journal support in SegmentNodeStore
Handling deletes.
Counter index support - Possibly we can get all changed paths from Journal in separate iterator and use that to update. or counter index updates are pushed as part of commit to journal
Reindexing - Here we can possibly continue to use diff based editor. Diff based allows use of include and excludes to reduce the traversal. Or we can just traverse over whole NodeStore and call the new path based indexers