package org.apache.lucene.search; /** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import java.io.IOException; import java.util.LinkedList; import org.apache.lucene.analysis.NumericTokenStream; // for javadocs import org.apache.lucene.document.NumericField; // for javadocs import org.apache.lucene.util.NumericUtils; import org.apache.lucene.util.ToStringUtils; import org.apache.lucene.util.StringHelper; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermEnum; /** *

A {@link Query} that matches numeric values within a * specified range. To use this, you must first index the * numeric values using {@link NumericField} (expert: {@link * NumericTokenStream}). If your terms are instead textual, * you should use {@link TermRangeQuery}. {@link * NumericRangeFilter} is the filter equivalent of this * query.

* *

You create a new NumericRangeQuery with the static * factory methods, eg: * *

 * Query q = NumericRangeQuery.newFloatRange("weight", 0.3f, 0.10f, true, true);
 * 
* * matches all documents whose float valued "weight" field * ranges from 0.3 to 0.10, inclusive. * *

The performance of NumericRangeQuery is much better * than the corresponding {@link TermRangeQuery} because the * number of terms that must be searched is usually far * fewer, thanks to trie indexing, described below.

* *

You can optionally specify a precisionStep * when creating this query. This is necessary if you've * changed this configuration from its default (4) during * indexing. Lower values consume more disk space but speed * up searching. Suitable values are between 1 and * 8. A good starting point to test is 4, * which is the default value for all Numeric* * classes. See below for * details. * *

This query defaults to {@linkplain * MultiTermQuery#CONSTANT_SCORE_AUTO_REWRITE_DEFAULT} for * 32 bit (int/float) ranges with precisionStep ≤8 and 64 * bit (long/double) ranges with precisionStep ≤6. * Otherwise it uses {@linkplain * MultiTermQuery#CONSTANT_SCORE_FILTER_REWRITE} as the * number of terms is likely to be high. With precision * steps of ≤4, this query can be run with one of the * BooleanQuery rewrite methods without changing * BooleanQuery's default max clause count. * *

NOTE: This API is experimental and * might change in incompatible ways in the next release. * *

How it works

* *

See the publication about panFMP, * where this algorithm was described (referred to as TrieRangeQuery): * *

Schindler, U, Diepenbroek, M, 2008. * Generic XML-based Framework for Metadata Portals. * Computers & Geosciences 34 (12), 1947-1955. * doi:10.1016/j.cageo.2008.02.023
* *

A quote from this paper: Because Apache Lucene is a full-text * search engine and not a conventional database, it cannot handle numerical ranges * (e.g., field value is inside user defined bounds, even dates are numerical values). * We have developed an extension to Apache Lucene that stores * the numerical values in a special string-encoded format with variable precision * (all numerical values like doubles, longs, floats, and ints are converted to * lexicographic sortable string representations and stored with different precisions * (for a more detailed description of how the values are stored, * see {@link NumericUtils}). A range is then divided recursively into multiple intervals for searching: * The center of the range is searched only with the lowest possible precision in the trie, * while the boundaries are matched more exactly. This reduces the number of terms dramatically.

* *

For the variant that stores long values in 8 different precisions (each reduced by 8 bits) that * uses a lowest precision of 1 byte, the index contains only a maximum of 256 distinct values in the * lowest precision. Overall, a range could consist of a theoretical maximum of * 7*255*2 + 255 = 3825 distinct terms (when there is a term for every distinct value of an * 8-byte-number in the index and the range covers almost all of them; a maximum of 255 distinct values is used * because it would always be possible to reduce the full 256 values to one term with degraded precision). * In practice, we have seen up to 300 terms in most cases (index with 500,000 metadata records * and a uniform value distribution).

* *

Precision Step

*

You can choose any precisionStep when encoding values. * Lower step values mean more precisions and so more terms in index (and index gets larger). * On the other hand, the maximum number of terms to match reduces, which optimized query speed. * The formula to calculate the maximum term count is: *

 *  n = [ (bitsPerValue/precisionStep - 1) * (2^precisionStep - 1 ) * 2 ] + (2^precisionStep - 1 )
 * 
*

(this formula is only correct, when bitsPerValue/precisionStep is an integer; * in other cases, the value must be rounded up and the last summand must contain the modulo of the division as * precision step). * For longs stored using a precision step of 4, n = 15*15*2 + 15 = 465, and for a precision * step of 2, n = 31*3*2 + 3 = 189. But the faster search speed is reduced by more seeking * in the term enum of the index. Because of this, the ideal precisionStep value can only * be found out by testing. Important: You can index with a lower precision step value and test search speed * using a multiple of the original step value.

* *

Good values for precisionStep are depending on usage and data type: *

* *

Comparisons of the different types of RangeQueries on an index with about 500,000 docs showed * that {@link TermRangeQuery} in boolean rewrite mode (with raised {@link BooleanQuery} clause count) * took about 30-40 secs to complete, {@link TermRangeQuery} in constant score filter rewrite mode took 5 secs * and executing this class took <100ms to complete (on an Opteron64 machine, Java 1.5, 8 bit * precision step). This query type was developed for a geographic portal, where the performance for * e.g. bounding boxes or exact date/time stamps is important.

* * @since 2.9 **/ public final class NumericRangeQuery extends MultiTermQuery { private NumericRangeQuery(final String field, final int precisionStep, final int valSize, T min, T max, final boolean minInclusive, final boolean maxInclusive ) { assert (valSize == 32 || valSize == 64); if (precisionStep < 1) throw new IllegalArgumentException("precisionStep must be >=1"); this.field = StringHelper.intern(field); this.precisionStep = precisionStep; this.valSize = valSize; this.min = min; this.max = max; this.minInclusive = minInclusive; this.maxInclusive = maxInclusive; // For bigger precisionSteps this query likely // hits too many terms, so set to CONSTANT_SCORE_FILTER right off // (especially as the FilteredTermEnum is costly if wasted only for AUTO tests because it // creates new enums from IndexReader for each sub-range) switch (valSize) { case 64: setRewriteMethod( (precisionStep > 6) ? CONSTANT_SCORE_FILTER_REWRITE : CONSTANT_SCORE_AUTO_REWRITE_DEFAULT ); break; case 32: setRewriteMethod( (precisionStep > 8) ? CONSTANT_SCORE_FILTER_REWRITE : CONSTANT_SCORE_AUTO_REWRITE_DEFAULT ); break; default: // should never happen throw new IllegalArgumentException("valSize must be 32 or 64"); } // shortcut if upper bound == lower bound if (min != null && min.equals(max)) { setRewriteMethod(CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE); } } /** * Factory that creates a NumericRangeQuery, that queries a long * range using the given
precisionStep. * You can have half-open ranges (which are in fact </≤ or >/≥ queries) * by setting the min or max value to null. By setting inclusive to false, it will * match all documents excluding the bounds, with inclusive on, the boundaries are hits, too. */ public static NumericRangeQuery newLongRange(final String field, final int precisionStep, Long min, Long max, final boolean minInclusive, final boolean maxInclusive ) { return new NumericRangeQuery(field, precisionStep, 64, min, max, minInclusive, maxInclusive); } /** * Factory that creates a NumericRangeQuery, that queries a long * range using the default precisionStep {@link NumericUtils#PRECISION_STEP_DEFAULT} (4). * You can have half-open ranges (which are in fact </≤ or >/≥ queries) * by setting the min or max value to null. By setting inclusive to false, it will * match all documents excluding the bounds, with inclusive on, the boundaries are hits, too. */ public static NumericRangeQuery newLongRange(final String field, Long min, Long max, final boolean minInclusive, final boolean maxInclusive ) { return new NumericRangeQuery(field, NumericUtils.PRECISION_STEP_DEFAULT, 64, min, max, minInclusive, maxInclusive); } /** * Factory that creates a NumericRangeQuery, that queries a int * range using the given precisionStep. * You can have half-open ranges (which are in fact </≤ or >/≥ queries) * by setting the min or max value to null. By setting inclusive to false, it will * match all documents excluding the bounds, with inclusive on, the boundaries are hits, too. */ public static NumericRangeQuery newIntRange(final String field, final int precisionStep, Integer min, Integer max, final boolean minInclusive, final boolean maxInclusive ) { return new NumericRangeQuery(field, precisionStep, 32, min, max, minInclusive, maxInclusive); } /** * Factory that creates a NumericRangeQuery, that queries a int * range using the default precisionStep {@link NumericUtils#PRECISION_STEP_DEFAULT} (4). * You can have half-open ranges (which are in fact </≤ or >/≥ queries) * by setting the min or max value to null. By setting inclusive to false, it will * match all documents excluding the bounds, with inclusive on, the boundaries are hits, too. */ public static NumericRangeQuery newIntRange(final String field, Integer min, Integer max, final boolean minInclusive, final boolean maxInclusive ) { return new NumericRangeQuery(field, NumericUtils.PRECISION_STEP_DEFAULT, 32, min, max, minInclusive, maxInclusive); } /** * Factory that creates a NumericRangeQuery, that queries a double * range using the given precisionStep. * You can have half-open ranges (which are in fact </≤ or >/≥ queries) * by setting the min or max value to null. By setting inclusive to false, it will * match all documents excluding the bounds, with inclusive on, the boundaries are hits, too. */ public static NumericRangeQuery newDoubleRange(final String field, final int precisionStep, Double min, Double max, final boolean minInclusive, final boolean maxInclusive ) { return new NumericRangeQuery(field, precisionStep, 64, min, max, minInclusive, maxInclusive); } /** * Factory that creates a NumericRangeQuery, that queries a double * range using the default precisionStep {@link NumericUtils#PRECISION_STEP_DEFAULT} (4). * You can have half-open ranges (which are in fact </≤ or >/≥ queries) * by setting the min or max value to null. By setting inclusive to false, it will * match all documents excluding the bounds, with inclusive on, the boundaries are hits, too. */ public static NumericRangeQuery newDoubleRange(final String field, Double min, Double max, final boolean minInclusive, final boolean maxInclusive ) { return new NumericRangeQuery(field, NumericUtils.PRECISION_STEP_DEFAULT, 64, min, max, minInclusive, maxInclusive); } /** * Factory that creates a NumericRangeQuery, that queries a float * range using the given precisionStep. * You can have half-open ranges (which are in fact </≤ or >/≥ queries) * by setting the min or max value to null. By setting inclusive to false, it will * match all documents excluding the bounds, with inclusive on, the boundaries are hits, too. */ public static NumericRangeQuery newFloatRange(final String field, final int precisionStep, Float min, Float max, final boolean minInclusive, final boolean maxInclusive ) { return new NumericRangeQuery(field, precisionStep, 32, min, max, minInclusive, maxInclusive); } /** * Factory that creates a NumericRangeQuery, that queries a float * range using the default precisionStep {@link NumericUtils#PRECISION_STEP_DEFAULT} (4). * You can have half-open ranges (which are in fact </≤ or >/≥ queries) * by setting the min or max value to null. By setting inclusive to false, it will * match all documents excluding the bounds, with inclusive on, the boundaries are hits, too. */ public static NumericRangeQuery newFloatRange(final String field, Float min, Float max, final boolean minInclusive, final boolean maxInclusive ) { return new NumericRangeQuery(field, NumericUtils.PRECISION_STEP_DEFAULT, 32, min, max, minInclusive, maxInclusive); } @Override protected FilteredTermEnum getEnum(final IndexReader reader) throws IOException { return new NumericRangeTermEnum(reader); } /** Returns the field name for this query */ public String getField() { return field; } /** Returns true if the lower endpoint is inclusive */ public boolean includesMin() { return minInclusive; } /** Returns true if the upper endpoint is inclusive */ public boolean includesMax() { return maxInclusive; } /** Returns the lower value of this range query */ public T getMin() { return min; } /** Returns the upper value of this range query */ public T getMax() { return max; } @Override public String toString(final String field) { final StringBuilder sb = new StringBuilder(); if (!this.field.equals(field)) sb.append(this.field).append(':'); return sb.append(minInclusive ? '[' : '{') .append((min == null) ? "*" : min.toString()) .append(" TO ") .append((max == null) ? "*" : max.toString()) .append(maxInclusive ? ']' : '}') .append(ToStringUtils.boost(getBoost())) .toString(); } @Override public final boolean equals(final Object o) { if (o==this) return true; if (!super.equals(o)) return false; if (o instanceof NumericRangeQuery) { final NumericRangeQuery q=(NumericRangeQuery)o; return ( field==q.field && (q.min == null ? min == null : q.min.equals(min)) && (q.max == null ? max == null : q.max.equals(max)) && minInclusive == q.minInclusive && maxInclusive == q.maxInclusive && precisionStep == q.precisionStep ); } return false; } @Override public final int hashCode() { int hash = super.hashCode(); hash += field.hashCode()^0x4565fd66 + precisionStep^0x64365465; if (min != null) hash += min.hashCode()^0x14fa55fb; if (max != null) hash += max.hashCode()^0x733fa5fe; return hash + (Boolean.valueOf(minInclusive).hashCode()^0x14fa55fb)+ (Boolean.valueOf(maxInclusive).hashCode()^0x733fa5fe); } // field must be interned after reading from stream private void readObject(java.io.ObjectInputStream in) throws java.io.IOException, ClassNotFoundException { in.defaultReadObject(); field = StringHelper.intern(field); } // members (package private, to be also fast accessible by NumericRangeTermEnum) String field; final int precisionStep, valSize; final T min, max; final boolean minInclusive,maxInclusive; /** * Subclass of FilteredTermEnum for enumerating all terms that match the * sub-ranges for trie range queries. *

* WARNING: This term enumeration is not guaranteed to be always ordered by * {@link Term#compareTo}. * The ordering depends on how {@link NumericUtils#splitLongRange} and * {@link NumericUtils#splitIntRange} generates the sub-ranges. For * {@link MultiTermQuery} ordering is not relevant. */ private final class NumericRangeTermEnum extends FilteredTermEnum { private final IndexReader reader; private final LinkedList rangeBounds = new LinkedList(); private final Term termTemplate = new Term(field); private String currentUpperBound = null; NumericRangeTermEnum(final IndexReader reader) throws IOException { this.reader = reader; switch (valSize) { case 64: { // lower long minBound = Long.MIN_VALUE; if (min instanceof Long) { minBound = min.longValue(); } else if (min instanceof Double) { minBound = NumericUtils.doubleToSortableLong(min.doubleValue()); } if (!minInclusive && min != null) { if (minBound == Long.MAX_VALUE) break; minBound++; } // upper long maxBound = Long.MAX_VALUE; if (max instanceof Long) { maxBound = max.longValue(); } else if (max instanceof Double) { maxBound = NumericUtils.doubleToSortableLong(max.doubleValue()); } if (!maxInclusive && max != null) { if (maxBound == Long.MIN_VALUE) break; maxBound--; } NumericUtils.splitLongRange(new NumericUtils.LongRangeBuilder() { @Override public final void addRange(String minPrefixCoded, String maxPrefixCoded) { rangeBounds.add(minPrefixCoded); rangeBounds.add(maxPrefixCoded); } }, precisionStep, minBound, maxBound); break; } case 32: { // lower int minBound = Integer.MIN_VALUE; if (min instanceof Integer) { minBound = min.intValue(); } else if (min instanceof Float) { minBound = NumericUtils.floatToSortableInt(min.floatValue()); } if (!minInclusive && min != null) { if (minBound == Integer.MAX_VALUE) break; minBound++; } // upper int maxBound = Integer.MAX_VALUE; if (max instanceof Integer) { maxBound = max.intValue(); } else if (max instanceof Float) { maxBound = NumericUtils.floatToSortableInt(max.floatValue()); } if (!maxInclusive && max != null) { if (maxBound == Integer.MIN_VALUE) break; maxBound--; } NumericUtils.splitIntRange(new NumericUtils.IntRangeBuilder() { @Override public final void addRange(String minPrefixCoded, String maxPrefixCoded) { rangeBounds.add(minPrefixCoded); rangeBounds.add(maxPrefixCoded); } }, precisionStep, minBound, maxBound); break; } default: // should never happen throw new IllegalArgumentException("valSize must be 32 or 64"); } // seek to first term next(); } @Override public float difference() { return 1.0f; } /** this is a dummy, it is not used by this class. */ @Override protected boolean endEnum() { throw new UnsupportedOperationException("not implemented"); } /** this is a dummy, it is not used by this class. */ @Override protected void setEnum(TermEnum tenum) { throw new UnsupportedOperationException("not implemented"); } /** * Compares if current upper bound is reached, * this also updates the term count for statistics. * In contrast to {@link FilteredTermEnum}, a return value * of false ends iterating the current enum * and forwards to the next sub-range. */ @Override protected boolean termCompare(Term term) { return (term.field() == field && term.text().compareTo(currentUpperBound) <= 0); } /** Increments the enumeration to the next element. True if one exists. */ @Override public boolean next() throws IOException { // if a current term exists, the actual enum is initialized: // try change to next term, if no such term exists, fall-through if (currentTerm != null) { assert actualEnum != null; if (actualEnum.next()) { currentTerm = actualEnum.term(); if (termCompare(currentTerm)) return true; } } // if all above fails, we go forward to the next enum, // if one is available currentTerm = null; while (rangeBounds.size() >= 2) { assert rangeBounds.size() % 2 == 0; // close the current enum and read next bounds if (actualEnum != null) { actualEnum.close(); actualEnum = null; } final String lowerBound = rangeBounds.removeFirst(); this.currentUpperBound = rangeBounds.removeFirst(); // create a new enum actualEnum = reader.terms(termTemplate.createTerm(lowerBound)); currentTerm = actualEnum.term(); if (currentTerm != null && termCompare(currentTerm)) return true; // clear the current term for next iteration currentTerm = null; } // no more sub-range enums available assert rangeBounds.size() == 0 && currentTerm == null; return false; } /** Closes the enumeration to further activity, freeing resources. */ @Override public void close() throws IOException { rangeBounds.clear(); currentUpperBound = null; super.close(); } } }