package org.apache.lucene.search; /** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import java.io.IOException; import java.util.LinkedList; import org.apache.lucene.analysis.NumericTokenStream; // for javadocs import org.apache.lucene.document.NumericField; // for javadocs import org.apache.lucene.util.NumericUtils; import org.apache.lucene.util.ToStringUtils; import org.apache.lucene.util.StringHelper; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermEnum; /** *
A {@link Query} that matches numeric values within a * specified range. To use this, you must first index the * numeric values using {@link NumericField} (expert: {@link * NumericTokenStream}). If your terms are instead textual, * you should use {@link TermRangeQuery}. {@link * NumericRangeFilter} is the filter equivalent of this * query.
* *You create a new NumericRangeQuery with the static * factory methods, eg: * *
* Query q = NumericRangeQuery.newFloatRange("weight", 0.3f, 0.10f, true, true);
*
*
* matches all documents whose float valued "weight" field
* ranges from 0.3 to 0.10, inclusive.
*
* The performance of NumericRangeQuery is much better * than the corresponding {@link TermRangeQuery} because the * number of terms that must be searched is usually far * fewer, thanks to trie indexing, described below.
* *You can optionally specify a precisionStep
* when creating this query. This is necessary if you've
* changed this configuration from its default (4) during
* indexing. Lower values consume more disk space but speed
* up searching. Suitable values are between 1 and
* 8. A good starting point to test is 4,
* which is the default value for all Numeric*
* classes. See below for
* details.
*
*
This query defaults to {@linkplain * MultiTermQuery#CONSTANT_SCORE_AUTO_REWRITE_DEFAULT} for * 32 bit (int/float) ranges with precisionStep ≤8 and 64 * bit (long/double) ranges with precisionStep ≤6. * Otherwise it uses {@linkplain * MultiTermQuery#CONSTANT_SCORE_FILTER_REWRITE} as the * number of terms is likely to be high. With precision * steps of ≤4, this query can be run with one of the * BooleanQuery rewrite methods without changing * BooleanQuery's default max clause count. * *
NOTE: This API is experimental and
* might change in incompatible ways in the next release.
*
*
See the publication about panFMP,
* where this algorithm was described (referred to as TrieRangeQuery):
*
*
Schindler, U, Diepenbroek, M, 2008. * Generic XML-based Framework for Metadata Portals. * Computers & Geosciences 34 (12), 1947-1955. * doi:10.1016/j.cageo.2008.02.023* *
A quote from this paper: Because Apache Lucene is a full-text * search engine and not a conventional database, it cannot handle numerical ranges * (e.g., field value is inside user defined bounds, even dates are numerical values). * We have developed an extension to Apache Lucene that stores * the numerical values in a special string-encoded format with variable precision * (all numerical values like doubles, longs, floats, and ints are converted to * lexicographic sortable string representations and stored with different precisions * (for a more detailed description of how the values are stored, * see {@link NumericUtils}). A range is then divided recursively into multiple intervals for searching: * The center of the range is searched only with the lowest possible precision in the trie, * while the boundaries are matched more exactly. This reduces the number of terms dramatically.
* *For the variant that stores long values in 8 different precisions (each reduced by 8 bits) that
* uses a lowest precision of 1 byte, the index contains only a maximum of 256 distinct values in the
* lowest precision. Overall, a range could consist of a theoretical maximum of
* 7*255*2 + 255 = 3825 distinct terms (when there is a term for every distinct value of an
* 8-byte-number in the index and the range covers almost all of them; a maximum of 255 distinct values is used
* because it would always be possible to reduce the full 256 values to one term with degraded precision).
* In practice, we have seen up to 300 terms in most cases (index with 500,000 metadata records
* and a uniform value distribution).
You can choose any precisionStep when encoding values.
* Lower step values mean more precisions and so more terms in index (and index gets larger).
* On the other hand, the maximum number of terms to match reduces, which optimized query speed.
* The formula to calculate the maximum term count is:
*
* n = [ (bitsPerValue/precisionStep - 1) * (2^precisionStep - 1 ) * 2 ] + (2^precisionStep - 1 ) **
(this formula is only correct, when bitsPerValue/precisionStep is an integer;
* in other cases, the value must be rounded up and the last summand must contain the modulo of the division as
* precision step).
* For longs stored using a precision step of 4, n = 15*15*2 + 15 = 465, and for a precision
* step of 2, n = 31*3*2 + 3 = 189. But the faster search speed is reduced by more seeking
* in the term enum of the index. Because of this, the ideal precisionStep value can only
* be found out by testing. Important: You can index with a lower precision step value and test search speed
* using a multiple of the original step value.
Good values for precisionStep are depending on usage and data type:
*
precisionStep is given.
* precisionStep). Using {@link NumericField NumericFields} for sorting
* is ideal, because building the field cache is much faster than with text-only numbers.
* These fields have one term per value and therefore also work with term enumeration for building distinct lists
* (e.g. facets / preselected values to search for).
* Sorting is also possible with range query optimized fields using one of the above precisionSteps.
* Comparisons of the different types of RangeQueries on an index with about 500,000 docs showed * that {@link TermRangeQuery} in boolean rewrite mode (with raised {@link BooleanQuery} clause count) * took about 30-40 secs to complete, {@link TermRangeQuery} in constant score filter rewrite mode took 5 secs * and executing this class took <100ms to complete (on an Opteron64 machine, Java 1.5, 8 bit * precision step). This query type was developed for a geographic portal, where the performance for * e.g. bounding boxes or exact date/time stamps is important.
* * @since 2.9 **/ public final class NumericRangeQueryNumericRangeQuery, that queries a long
* range using the given precisionStep.
* You can have half-open ranges (which are in fact </≤ or >/≥ queries)
* by setting the min or max value to null. By setting inclusive to false, it will
* match all documents excluding the bounds, with inclusive on, the boundaries are hits, too.
*/
public static NumericRangeQueryNumericRangeQuery, that queries a long
* range using the default precisionStep {@link NumericUtils#PRECISION_STEP_DEFAULT} (4).
* You can have half-open ranges (which are in fact </≤ or >/≥ queries)
* by setting the min or max value to null. By setting inclusive to false, it will
* match all documents excluding the bounds, with inclusive on, the boundaries are hits, too.
*/
public static NumericRangeQueryNumericRangeQuery, that queries a int
* range using the given precisionStep.
* You can have half-open ranges (which are in fact </≤ or >/≥ queries)
* by setting the min or max value to null. By setting inclusive to false, it will
* match all documents excluding the bounds, with inclusive on, the boundaries are hits, too.
*/
public static NumericRangeQueryNumericRangeQuery, that queries a int
* range using the default precisionStep {@link NumericUtils#PRECISION_STEP_DEFAULT} (4).
* You can have half-open ranges (which are in fact </≤ or >/≥ queries)
* by setting the min or max value to null. By setting inclusive to false, it will
* match all documents excluding the bounds, with inclusive on, the boundaries are hits, too.
*/
public static NumericRangeQueryNumericRangeQuery, that queries a double
* range using the given precisionStep.
* You can have half-open ranges (which are in fact </≤ or >/≥ queries)
* by setting the min or max value to null. By setting inclusive to false, it will
* match all documents excluding the bounds, with inclusive on, the boundaries are hits, too.
*/
public static NumericRangeQueryNumericRangeQuery, that queries a double
* range using the default precisionStep {@link NumericUtils#PRECISION_STEP_DEFAULT} (4).
* You can have half-open ranges (which are in fact </≤ or >/≥ queries)
* by setting the min or max value to null. By setting inclusive to false, it will
* match all documents excluding the bounds, with inclusive on, the boundaries are hits, too.
*/
public static NumericRangeQueryNumericRangeQuery, that queries a float
* range using the given precisionStep.
* You can have half-open ranges (which are in fact </≤ or >/≥ queries)
* by setting the min or max value to null. By setting inclusive to false, it will
* match all documents excluding the bounds, with inclusive on, the boundaries are hits, too.
*/
public static NumericRangeQueryNumericRangeQuery, that queries a float
* range using the default precisionStep {@link NumericUtils#PRECISION_STEP_DEFAULT} (4).
* You can have half-open ranges (which are in fact </≤ or >/≥ queries)
* by setting the min or max value to null. By setting inclusive to false, it will
* match all documents excluding the bounds, with inclusive on, the boundaries are hits, too.
*/
public static NumericRangeQuerytrue if the lower endpoint is inclusive */
public boolean includesMin() { return minInclusive; }
/** Returns true if the upper endpoint is inclusive */
public boolean includesMax() { return maxInclusive; }
/** Returns the lower value of this range query */
public T getMin() { return min; }
/** Returns the upper value of this range query */
public T getMax() { return max; }
@Override
public String toString(final String field) {
final StringBuilder sb = new StringBuilder();
if (!this.field.equals(field)) sb.append(this.field).append(':');
return sb.append(minInclusive ? '[' : '{')
.append((min == null) ? "*" : min.toString())
.append(" TO ")
.append((max == null) ? "*" : max.toString())
.append(maxInclusive ? ']' : '}')
.append(ToStringUtils.boost(getBoost()))
.toString();
}
@Override
public final boolean equals(final Object o) {
if (o==this) return true;
if (!super.equals(o))
return false;
if (o instanceof NumericRangeQuery) {
final NumericRangeQuery q=(NumericRangeQuery)o;
return (
field==q.field &&
(q.min == null ? min == null : q.min.equals(min)) &&
(q.max == null ? max == null : q.max.equals(max)) &&
minInclusive == q.minInclusive &&
maxInclusive == q.maxInclusive &&
precisionStep == q.precisionStep
);
}
return false;
}
@Override
public final int hashCode() {
int hash = super.hashCode();
hash += field.hashCode()^0x4565fd66 + precisionStep^0x64365465;
if (min != null) hash += min.hashCode()^0x14fa55fb;
if (max != null) hash += max.hashCode()^0x733fa5fe;
return hash +
(Boolean.valueOf(minInclusive).hashCode()^0x14fa55fb)+
(Boolean.valueOf(maxInclusive).hashCode()^0x733fa5fe);
}
// field must be interned after reading from stream
private void readObject(java.io.ObjectInputStream in) throws java.io.IOException, ClassNotFoundException {
in.defaultReadObject();
field = StringHelper.intern(field);
}
// members (package private, to be also fast accessible by NumericRangeTermEnum)
String field;
final int precisionStep, valSize;
final T min, max;
final boolean minInclusive,maxInclusive;
/**
* Subclass of FilteredTermEnum for enumerating all terms that match the
* sub-ranges for trie range queries.
*
* WARNING: This term enumeration is not guaranteed to be always ordered by
* {@link Term#compareTo}.
* The ordering depends on how {@link NumericUtils#splitLongRange} and
* {@link NumericUtils#splitIntRange} generates the sub-ranges. For
* {@link MultiTermQuery} ordering is not relevant.
*/
private final class NumericRangeTermEnum extends FilteredTermEnum {
private final IndexReader reader;
private final LinkedListfalse ends iterating the current enum
* and forwards to the next sub-range.
*/
@Override
protected boolean termCompare(Term term) {
return (term.field() == field && term.text().compareTo(currentUpperBound) <= 0);
}
/** Increments the enumeration to the next element. True if one exists. */
@Override
public boolean next() throws IOException {
// if a current term exists, the actual enum is initialized:
// try change to next term, if no such term exists, fall-through
if (currentTerm != null) {
assert actualEnum != null;
if (actualEnum.next()) {
currentTerm = actualEnum.term();
if (termCompare(currentTerm))
return true;
}
}
// if all above fails, we go forward to the next enum,
// if one is available
currentTerm = null;
while (rangeBounds.size() >= 2) {
assert rangeBounds.size() % 2 == 0;
// close the current enum and read next bounds
if (actualEnum != null) {
actualEnum.close();
actualEnum = null;
}
final String lowerBound = rangeBounds.removeFirst();
this.currentUpperBound = rangeBounds.removeFirst();
// create a new enum
actualEnum = reader.terms(termTemplate.createTerm(lowerBound));
currentTerm = actualEnum.term();
if (currentTerm != null && termCompare(currentTerm))
return true;
// clear the current term for next iteration
currentTerm = null;
}
// no more sub-range enums available
assert rangeBounds.size() == 0 && currentTerm == null;
return false;
}
/** Closes the enumeration to further activity, freeing resources. */
@Override
public void close() throws IOException {
rangeBounds.clear();
currentUpperBound = null;
super.close();
}
}
}