Class WAH8DocIdSet
- java.lang.Object
-
- org.apache.lucene.search.DocIdSet
-
- org.apache.lucene.util.WAH8DocIdSet
-
public final class WAH8DocIdSet extends DocIdSet
DocIdSet
implementation based on word-aligned hybrid encoding on words of 8 bits.This implementation doesn't support random-access but has a fast
DocIdSetIterator
which can advance in logarithmic time thanks to an index.The compression scheme is simplistic and should work well with sparse and very dense doc id sets while being only slightly larger than a
FixedBitSet
for incompressible sets (overhead<2% in the worst case) in spite of the index.Format: The format is byte-aligned. An 8-bits word is either clean, meaning composed only of zeros or ones, or dirty, meaning that it contains between 1 and 7 bits set. The idea is to encode sequences of clean words using run-length encoding and to leave sequences of dirty words as-is.
Token Clean length+ Dirty length+ Dirty words 1 byte 0-n bytes 0-n bytes 0-n bytes - Token encodes whether clean means full of zeros or ones in the first bit, the number of clean words minus 2 on the next 3 bits and the number of dirty words on the last 4 bits. The higher-order bit is a continuation bit, meaning that the number is incomplete and needs additional bytes to be read.
- Clean length+: If clean length has its higher-order bit set,
you need to read a
vint
, shift it by 3 bits on the left side and add it to the 3 bits which have been read in the token. - Dirty length+ works the same way as Clean length+ but on 4 bits and for the length of dirty words.
- Dirty words are the dirty words, there are Dirty length of them.
This format cannot encode sequences of less than 2 clean words and 0 dirty word. The reason is that if you find a single clean word, you should rather encode it as a dirty word. This takes the same space as starting a new sequence (since you need one byte for the token) but will be lighter to decode. There is however an exception for the first sequence. Since the first sequence may start directly with a dirty word, the clean length is encoded directly, without subtracting 2.
There is an additional restriction on the format: the sequence of dirty words is not allowed to contain two consecutive clean words. This restriction exists to make sure no space is wasted and to make sure iterators can read the next doc ID by reading at most 2 dirty words.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
WAH8DocIdSet.Builder
A builder forWAH8DocIdSet
s.
-
Field Summary
Fields Modifier and Type Field Description static int
DEFAULT_INDEX_INTERVAL
Default index interval.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description int
cardinality()
Return the number of documents in thisDocIdSet
in constant time.static WAH8DocIdSet
intersect(Collection<WAH8DocIdSet> docIdSets)
Same asintersect(Collection, int)
with the default index interval.static WAH8DocIdSet
intersect(Collection<WAH8DocIdSet> docIdSets, int indexInterval)
Compute the intersection of the provided sets.boolean
isCacheable()
This method is a hint forCachingWrapperFilter
, if thisDocIdSet
should be cached without copying it.org.apache.lucene.util.WAH8DocIdSet.Iterator
iterator()
Provides aDocIdSetIterator
to access the set.long
ramBytesUsed()
Return the memory usage of this class in bytes.static WAH8DocIdSet
union(Collection<WAH8DocIdSet> docIdSets)
Same asunion(Collection, int)
with the default index interval.static WAH8DocIdSet
union(Collection<WAH8DocIdSet> docIdSets, int indexInterval)
Compute the union of the provided sets.
-
-
-
Field Detail
-
DEFAULT_INDEX_INTERVAL
public static final int DEFAULT_INDEX_INTERVAL
Default index interval.- See Also:
- Constant Field Values
-
-
Method Detail
-
intersect
public static WAH8DocIdSet intersect(Collection<WAH8DocIdSet> docIdSets)
Same asintersect(Collection, int)
with the default index interval.
-
intersect
public static WAH8DocIdSet intersect(Collection<WAH8DocIdSet> docIdSets, int indexInterval)
Compute the intersection of the provided sets. This method is much faster than computing the intersection manually since it operates directly at the byte level.
-
union
public static WAH8DocIdSet union(Collection<WAH8DocIdSet> docIdSets)
Same asunion(Collection, int)
with the default index interval.
-
union
public static WAH8DocIdSet union(Collection<WAH8DocIdSet> docIdSets, int indexInterval)
Compute the union of the provided sets. This method is much faster than computing the union manually since it operates directly at the byte level.
-
isCacheable
public boolean isCacheable()
Description copied from class:DocIdSet
This method is a hint forCachingWrapperFilter
, if thisDocIdSet
should be cached without copying it. The default is to returnfalse
. If you have an ownDocIdSet
implementation that does its iteration very effective and fast without doing disk I/O, override this method and returntrue
.- Overrides:
isCacheable
in classDocIdSet
-
iterator
public org.apache.lucene.util.WAH8DocIdSet.Iterator iterator()
Description copied from class:DocIdSet
Provides aDocIdSetIterator
to access the set. This implementation can returnnull
if there are no docs that match.
-
cardinality
public int cardinality()
Return the number of documents in thisDocIdSet
in constant time.
-
ramBytesUsed
public long ramBytesUsed()
Return the memory usage of this class in bytes.
-
-