LUCENE-2380: FieldCache.getStrings/Index --> FieldCache.getDocTerms/Index * The field values returned when sorting by SortField.STRING are now BytesRef. You can call value.utf8ToString() to convert back to string, if necessary. * In FieldCache, getStrings (returning String[]) has been replaced with getTerms (returning a FieldCache.DocTerms instance). DocTerms provides a getTerm method, taking a docID and a BytesRef to fill (which must not be null), and it fills it in with the reference to the bytes for that term. If you had code like this before: String[] values = FieldCache.DEFAULT.getStrings(reader, field); ... String aValue = values[docID]; you can do this instead: DocTerms values = FieldCache.DEFAULT.getTerms(reader, field); ... BytesRef term = new BytesRef(); String aValue = values.getTerm(docID, term).utf8ToString(); Note however that it can be costly to convert to String, so it's better to work directly with the BytesRef. * Similarly, in FieldCache, getStringIndex (returning a StringIndex instance, with direct arrays int[] order and String[] lookup) has been replaced with getTermsIndex (returning a FieldCache.DocTermsIndex instance). DocTermsIndex provides the getOrd(int docID) method to lookup the int order for a document, lookup(int ord, BytesRef reuse) to lookup the term from a given order, and the sugar method getTerm(int docID, BytesRef reuse) which internally calls getOrd and then lookup. If you had code like this before: StringIndex idx = FieldCache.DEFAULT.getStringIndex(reader, field); ... int ord = idx.order[docID]; String aValue = idx.lookup[ord]; you can do this instead: DocTermsIndex idx = FieldCache.DEFAULT.getTermsIndex(reader, field); ... int ord = idx.getOrd(docID); BytesRef term = new BytesRef(); String aValue = idx.lookup(ord, term).utf8ToString(); Note however that it can be costly to convert to String, so it's better to work directly with the BytesRef. DocTermsIndex also has a getTermsEnum() method, which returns an iterator (TermsEnum) over the term values in the index (ie, iterates ord = 0..numOrd()-1). * StringComparatorLocale is now more CPU costly than it was before (it was already very CPU costly since it does not compare using indexed collation keys; use CollationKeyFilter for better performance), since it converts BytesRef -> String on the fly. Also, the field values returned when sorting by SortField.STRING are now BytesRef. * FieldComparator.StringOrdValComparator has been renamed to TermOrdValComparator, and now uses BytesRef for its values. Likewise for StringValComparator, renamed to TermValComparator. This means when sorting by SortField.STRING or SortField.STRING_VAL (or directly invoking these comparators) the values returned in the FieldDoc.fields array will be BytesRef not String. You can call the .utf8ToString() method on the BytesRef instances, if necessary. LUCENE-1458, LUCENE-2111: Flexible Indexing Flexible indexing changed the low level fields/terms/docs/positions enumeration APIs. Here are the major changes: * Terms are now binary in nature (arbitrary byte[]), represented by the BytesRef class (which provides an offset + length "slice" into an existing byte[]). * Fields are separately enumerated (FieldsEnum) from the terms within each field (TermEnum). So instead of this: TermEnum termsEnum = ...; while(termsEnum.next()) { Term t = termsEnum.term(); System.out.println("field=" + t.field() + "; text=" + t.text()); } Do this: FieldsEnum fieldsEnum = ...; String field; while((field = fieldsEnum.next()) != null) { TermsEnum termsEnum = fieldsEnum.terms(); BytesRef text; while((text = termsEnum.next()) != null) { System.out.println("field=" + field + "; text=" + text.utf8ToString()); } * TermDocs is renamed to DocsEnum. Instead of this: while(td.next()) { int doc = td.doc(); ... } do this: int doc; while((doc = td.next()) != DocsEnum.NO_MORE_DOCS) { ... } Instead of this: if (td.skipTo(target)) { int doc = td.doc(); ... } do this: if ((doc=td.advance(target)) != DocsEnum.NO_MORE_DOCS) { ... } The bulk read API has also changed. Instead of this: int[] docs = new int[256]; int[] freqs = new int[256]; while(true) { int count = td.read(docs, freqs) if (count == 0) { break; } // use docs[i], freqs[i] } do this: DocsEnum.BulkReadResult bulk = td.getBulkResult(); while(true) { int count = td.read(); if (count == 0) { break; } // use bulk.docs.ints[i] and bulk.freqs.ints[i] } * TermPositions is renamed to DocsAndPositionsEnum, and no longer extends the docs only enumerator (DocsEnum). * Deleted docs are no longer implicitly filtered from docs/positions enums. Instead, you pass a Bits skipDocs (set bits are skipped) when obtaining the enums. Also, you can now ask a reader for its deleted docs. * The docs/positions enums cannot seek to a term. Instead, TermsEnum is able to seek, and then you request the docs/positions enum from that TermsEnum. * TermsEnum's seek method returns more information. So instead of this: Term t; TermEnum termEnum = reader.terms(t); if (t.equals(termEnum.term())) { ... } do this: TermsEnum termsEnum = ...; BytesRef text; if (termsEnum.seek(text) == TermsEnum.SeekStatus.FOUND) { ... } SeekStatus also contains END (enumerator is done) and NOT_FOUND (term was not found but enumerator is now positioned to the next term). * TermsEnum has an ord() method, returning the long numeric ordinal (ie, first term is 0, next is 1, and so on) for the term it's not positioned to. There is also a corresponding seek(long ord) method. Note that these methods are optional; in particular the MultiFields TermsEnum does not implement them. How you obtain the enums has changed. The primary entry point is the Fields class. If you know your reader is a single segment reader, do this: Fields fields = reader.Fields(); if (fields != null) { ... } If the reader might be multi-segment, you must do this: Fields fields = MultiFields.getFields(reader); if (fields != null) { ... } The fields may be null (eg if the reader has no fields). Note that the MultiFields approach entails a performance hit on MultiReaders, as it must merge terms/docs/positions on the fly. It's generally better to instead get the sequential readers (use oal.util.ReaderUtil) and then step through those readers yourself, if you can (this is how Lucene drives searches). If you pass a SegmentReader to MultiFields.fiels it will simply return reader.fields(), so there is no performance hit in that case. Once you have a non-null Fields you can do this: Terms terms = fields.terms("field"); if (terms != null) { ... } The terms may be null (eg if the field does not exist). Once you have a non-null terms you can get an enum like this: TermsEnum termsEnum = terms.iterator(); The returned TermsEnum will not be null. You can then .next() through the TermsEnum, or seek. If you want a DocsEnum, do this: Bits skipDocs = MultiFields.getDeletedDocs(reader); DocsEnum docsEnum = null; docsEnum = termsEnum.docs(skipDocs, docsEnum); You can pass in a prior DocsEnum and it will be reused if possible. Likewise for DocsAndPositionsEnum. IndexReader has several sugar methods (which just go through the above steps, under the hood). Instead of: Term t; TermDocs termDocs = reader.termDocs(); termDocs.seek(t); do this: String field; BytesRef text; DocsEnum docsEnum = reader.termDocsEnum(reader.getDeletedDocs(), field, text); Likewise for DocsAndPositionsEnum. * LUCENE-2600: remove IndexReader.isDeleted Instead of IndexReader.isDeleted, do this: import org.apache.lucene.util.Bits; import org.apache.lucene.index.MultiFields; Bits delDocs = MultiFields.getDeletedDocs(indexReader); if (delDocs.get(docID)) { // document is deleted... } * LUCENE-2674: A new idfExplain method was added to Similarity, that accepts an incoming docFreq. If you subclass Similarity, make sure you also override this method on upgrade, otherwise your customizations won't run for certain MultiTermQuerys. * LUCENE-2413: Lucene's core and contrib analyzers, along with Solr's analyzers, were consolidated into modules/analysis. During the refactoring some package names have changed: - o.a.l.analysis.KeywordAnalyzer -> o.a.l.analysis.core.KeywordAnalyzer - o.a.l.analysis.KeywordTokenizer -> o.a.l.analysis.core.KeywordTokenizer - o.a.l.analysis.LetterTokenizer -> o.a.l.analysis.core.LetterTokenizer - o.a.l.analysis.LowerCaseFilter -> o.a.l.analysis.core.LowerCaseFilter - o.a.l.analysis.LowerCaseTokenizer -> o.a.l.analysis.core.LowerCaseTokenizer - o.a.l.analysis.SimpleAnalyzer -> o.a.l.analysis.core.SimpleAnalyzer - o.a.l.analysis.StopAnalyzer -> o.a.l.analysis.core.StopAnalyzer - o.a.l.analysis.StopFilter -> o.a.l.analysis.core.StopFilter - o.a.l.analysis.WhitespaceAnalyzer -> o.a.l.analysis.core.WhitespaceAnalyzer - o.a.l.analysis.WhitespaceTokenizer -> o.a.l.analysis.core.WhitespaceTokenizer - o.a.l.analysis.PorterStemFilter -> o.a.l.analysis.en.PorterStemFilter - o.a.l.analysis.ASCIIFoldingFilter -> o.a.l.analysis.miscellaneous.ASCIIFoldingFilter - o.a.l.analysis.ISOLatin1AccentFilter -> o.a.l.analysis.miscellaneous.ISOLatin1AccentFilter - o.a.l.analysis.KeywordMarkerFilter -> o.a.l.analysis.miscellaneous.KeywordMarkerFilter - o.a.l.analysis.LengthFilter -> o.a.l.analysis.miscellaneous.LengthFilter - o.a.l.analysis.PerFieldAnalyzerWrapper -> o.a.l.analysis.miscellaneous.PerFieldAnalyzerWrapper - o.a.l.analysis.TeeSinkTokenFilter -> o.a.l.analysis.sinks.TeeSinkTokenFilter - o.a.l.analysis.CharFilter -> o.a.l.analysis.charfilter.CharFilter - o.a.l.analysis.BaseCharFilter -> o.a.l.analysis.charfilter.BaseCharFilter - o.a.l.analysis.MappingCharFilter -> o.a.l.analysis.charfilter.MappingCharFilter - o.a.l.analysis.NormalizeCharMap -> o.a.l.analysis.charfilter.NormalizeCharMap - o.a.l.analysis.CharArraySet -> o.a.l.analysis.util.CharArraySet - o.a.l.analysis.CharArrayMap -> o.a.l.analysis.util.CharArrayMap - o.a.l.analysis.ReusableAnalyzerBase -> o.a.l.analysis.util.ReusableAnalyzerBase - o.a.l.analysis.StopwordAnalyzerBase -> o.a.l.analysis.util.StopwordAnalyzerBase - o.a.l.analysis.WordListLoader -> o.a.l.analysis.util.WordListLoader * LUCENE-2514: The option to use a Collator's order (instead of binary order) for sorting and range queries has been moved to contrib/queries. The Collated TermRangeQuery/Filter has been moved to SlowCollatedTermRangeQuery/Filter, and the collated sorting has been moved to SlowCollatedStringComparator. Note: this functionality isn't very scalable and if you are using it, consider indexing collation keys with the collation support in the analysis module instead. To perform collated range queries, use a suitable collating analyzer: CollationKeyAnalyzer or ICUCollationKeyAnalyzer, and set qp.setAnalyzeRangeTerms(true). TermRangeQuery and TermRangeFilter now work purely on bytes. Both have helper factory methods (newStringRange) similar to the NumericRange API, to easily perform range queries on Strings. * LUCENE-2691: The near-real-time API has moved from IndexWriter to IndexReader. Instead of IndexWriter.getReader(), call IndexReader.open(IndexWriter) or IndexReader.reopen(IndexWriter). * LUCENE-2690: MultiTermQuery boolean rewrites per segment. Also MultiTermQuery.getTermsEnum() now takes an AttributeSource. FuzzyTermsEnum is both consumer and producer of attributes: MTQ.BoostAttribute is added to the FuzzyTermsEnum and MTQ's rewrite mode consumes it. The other way round MTQ.TopTermsBooleanQueryRewrite supplys a global AttributeSource to each segments TermsEnum. The TermsEnum is consumer and gets the current minimum competitive boosts (MTQ.MaxNonCompetitiveBoostAttribute). * LUCENE-2374: The backwards layer in AttributeImpl was removed. To support correct reflection of AttributeImpl instances, where the reflection was done using deprecated toString() parsing, you have to now override reflectWith() to customize output. toString() is no longer implemented by AttributeImpl, so if you have overridden toString(), port your customization over to reflectWith(). reflectAsString() would then return what toString() did before. * LUCENE-2236, LUCENE-2912: DefaultSimilarity can no longer be set statically (and dangerously) for the entire JVM. Instead, IndexWriterConfig and IndexSearcher now take a SimilarityProvider. Similarity can now be configured on a per-field basis. Similarity retains only the field-specific relevance methods such as tf() and idf(). Previously some (but not all) of these methods, such as computeNorm and scorePayload took field as a parameter, this is removed due to the fact the entire Similarity (all methods) can now be configured per-field. Methods that apply to the entire query such as coord() and queryNorm() exist in SimilarityProvider. * LUCENE-1076: TieredMergePolicy is now the default merge policy. It's able to merge non-contiguous segments; this may cause problems for applications that rely on Lucene's internal document ID assigment. If so, you should instead use LogByteSize/DocMergePolicy during indexing.