Lucene contrib change Log ======================= Lucene 3.2.0 ======================= Changes in backwards compatibility policy * LUCENE-2981: Removed the following contribs: ant, db, lucli, swing. (Robert Muir) Changes in runtime behavior * LUCENE-3086: ItalianAnalyzer now uses ElisionFilter with a set of Italian contractions by default. (Robert Muir) Bug Fixes * LUCENE-3045: fixed QueryNodeImpl.containsTag(String key) that was not lowercasing the key before checking for the tag (Adriano Crestani) * LUCENE-3026: SmartChineseAnalyzer's WordTokenFilter threw NullPointerException on sentences longer than 32,767 characters. (wangzhenghang via Robert Muir) * LUCENE-2939: Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream. This can be a significant performance bug for large documents. (Mark Miller) * LUCENE-3043: GermanStemmer threw IndexOutOfBoundsException if it encountered a zero-length token. (Robert Muir) * LUCENE-3044: ThaiWordFilter didn't reset its cached state correctly, this only caused a problem if you consumed a tokenstream, then reused it, added different attributes to it, and consumed it again. (Robert Muir, Uwe Schindler) * LUCENE-3113: Fixed some minor analysis bugs: double-reset() in ReusableAnalyzerBase and ShingleAnalyzerWrapper, missing end() implementations in PrefixAwareTokenFilter and PrefixAndSuffixAwareTokenFilter, invocations of incrementToken() after it already returned false in CommonGramsQueryFilter, HyphenatedWordsFilter, ShingleFilter, and SynonymsFilter. (Robert Muir, Steven Rowe, Uwe Schindler) New Features * LUCENE-3016: Add analyzer for Latvian. (Robert Muir) * LUCENE-1421: create new grouping contrib module, enabling search results to be grouped by a single-valued indexed field. This module was factored out of Solr's grouping implementation, but it cannot group by function queries nor arbitrary queries. (Mike McCandless) * LUCENE-3098: add AllGroupsCollector, to collect all unique groups (but in unspecified order). (Martijn van Groningen via Mike McCandless) * LUCENE-3092: Added NRTCachingDirectory in contrib/misc, which caches small segments in RAM. This is useful, in the near-real-time case where the indexing rate is lowish but the reopen rate is highish, to take load off the IO system. (Mike McCandless) Optimizations * LUCENE-3040: Switch all analysis consumers (highlighter, morelikethis, memory, ...) over to reusableTokenStream(). (Robert Muir) ======================= Lucene 3.1.0 ======================= Changes in backwards compatibility policy * LUCENE-2100: All Analyzers in Lucene-contrib have been marked as final. Analyzers should be only act as a composition of TokenStreams, users should compose their own analyzers instead of subclassing existing ones. (Simon Willnauer) * LUCENE-2194, LUCENE-2201: Snowball APIs were upgraded to snowball revision 502 (with some local modifications for improved performance). Index backwards compatibility and binary backwards compatibility is preserved, but some protected/public member variables changed type. This does NOT affect java code/class files produced by the snowball compiler, but technically is a backwards compatibility break. (Robert Muir) * LUCENE-2226: Moved contrib/snowball functionality into contrib/analyzers. Be sure to remove any old obselete lucene-snowball jar files from your classpath! (Robert Muir) * LUCENE-2323: Moved contrib/wikipedia functionality into contrib/analyzers. Additionally the package was changed from org.apache.lucene.wikipedia.analysis to org.apache.lucene.analysis.wikipedia. (Robert Muir) * LUCENE-2581: Added new methods to FragmentsBuilder interface. These methods are used to set pre/post tags and Encoder. (Koji Sekiguchi) * LUCENE-2391: Improved spellchecker (re)build time/ram usage by omitting frequencies/positions/norms for single-valued fields, modifying the default ramBufferMBSize to match IndexWriterConfig (16MB), making index optimization an optional boolean parameter, and modifying the incremental update logic to work well with unoptimized spellcheck indexes. The indexDictionary() methods were made final to ensure a hard backwards break in case you were subclassing Spellchecker. In general, subclassing Spellchecker is not recommended. (Robert Muir) Changes in runtime behavior * LUCENE-2117: SnowballAnalyzer uses TurkishLowerCaseFilter instead of LowercaseFilter to correctly handle the unique Turkish casing behavior if used with Version > 3.0 and the TurkishStemmer. (Robert Muir via Simon Willnauer) * LUCENE-2055: GermanAnalyzer now uses the Snowball German2 algorithm and stopwords list by default for Version > 3.0. (Robert Muir, Uwe Schindler, Simon Willnauer) Bug fixes * LUCENE-2855: contrib queryparser was using CharSequence as key in some internal Map instances, which was leading to incorrect behavior, since some CharSequence implementors do not override hashcode and equals methods. Now the internal Maps are using String instead. (Adriano Crestani) * LUCENE-2068: Fixed ReverseStringFilter which was not aware of supplementary characters. During reverse the filter created unpaired surrogates, which will be replaced by U+FFFD by the indexer, but not at query time. The filter now reverses supplementary characters correctly if used with Version > 3.0. (Simon Willnauer, Robert Muir) * LUCENE-2035: TokenSources.getTokenStream() does not assign positionIncrement. (Christopher Morris via Mark Miller) * LUCENE-2055: Deprecated RussianTokenizer, RussianStemmer, RussianStemFilter, FrenchStemmer, FrenchStemFilter, DutchStemmer, and DutchStemFilter. For these Analyzers, SnowballFilter is used instead (for Version > 3.0), as the previous code did not always implement the Snowball algorithm correctly. Additionally, for Version > 3.0, the Snowball stopword lists are used by default. (Robert Muir, Uwe Schindler, Simon Willnauer) * LUCENE-2184: Fixed bug with handling best fit value when the proper best fit value is not an indexed field. Note, this change affects the APIs. (Grant Ingersoll) * LUCENE-2359: Fix bug in CartesianPolyFilterBuilder related to handling of behavior around the 180th meridian (Grant Ingersoll) * LUCENE-2404: Fix bugs with position increment and empty tokens in ThaiWordFilter. For matchVersion >= 3.1 the filter also no longer lowercases. ThaiAnalyzer will use a separate LowerCaseFilter instead. (Uwe Schindler, Robert Muir) * LUCENE-2615: Fix DirectIOLinuxDirectory to not assign bogus permissions to newly created files, and to not silently hardwire buffer size to 1 MB. (Mark Miller, Robert Muir, Mike McCandless) * LUCENE-2629: Fix gennorm2 task for generating ICUFoldingFilter's .nrm file. This allows you to customize its normalization/folding, by editing the source data files in src/data and regenerating a new .nrm with 'ant gennorm2'. (David Bowen via Robert Muir) * LUCENE-2653: ThaiWordFilter depends on the JRE having a Thai dictionary, which is not always the case. If the dictionary is unavailable, the filter will now throw UnsupportedOperationException in the constructor. (Robert Muir) * LUCENE-589: Fix contrib/demo for international documents. (Curtis d'Entremont via Robert Muir) * LUCENE-2246: Fix contrib/demo for Turkish html documents. (Selim Nadi via Robert Muir) * LUCENE-590: Demo HTML parser gives incorrect summaries when title is repeated as a heading (Curtis d'Entremont via Robert Muir) * LUCENE-591: The demo indexer now indexes meta keywords. (Curtis d'Entremont via Robert Muir) * LUCENE-2874: Highlighting overlapping tokens outputted doubled words. (Pierre Gossé via Robert Muir) * LUCENE-2943: Fix thread-safety issues with ICUCollationKeyFilter. (Robert Muir) * LUCENE-3087: Highlighter: fix case that was preventing highlighting of exact phrase when tokens overlap. (Pierre Gossé via Mike McCandless) API Changes * LUCENE-2867: Some contrib queryparser methods that receives CharSequence as identifier, such as QueryNode#unsetTag(CharSequence), were deprecated and will be removed on version 4. (Adriano Crestani) * LUCENE-2147: Spatial GeoHashUtils now always decode GeoHash strings with full precision. GeoHash#decode_exactly(String) was merged into GeoHash#decode(String). (Chris Male, Simon Willnauer) * LUCENE-2204: Change some package private classes/members to publicly accessible to implement custom FragmentsBuilders. (Koji Sekiguchi) * LUCENE-2055: Integrate snowball into contrib/analyzers. SnowballAnalyzer is now deprecated in favor of language-specific analyzers which contain things such as stopword lists and any language-specific processing in addition to stemming. Add Turkish and Romanian stopwords lists to support this. (Robert Muir, Uwe Schindler, Simon Willnauer) * LUCENE-2603: Add setMultiValuedSeparator(char) method to set an arbitrary char that is used when concatenating multiValued data. Default is a space (' '). It is applied on ANALYZED field only. (Koji Sekiguchi) * LUCENE-2626: FastVectorHighlighter: enable FragListBuilder and FragmentsBuilder to be set per-field override. (Koji Sekiguchi) * LUCENE-2712: FieldBoostMapAttribute in contrib/queryparser was changed from a Map to a Map. Per the CharSequence javadoc, CharSequence is inappropriate as a map key. (Robert Muir) * LUCENE-1937: Add more methods to manipulate QueryNodeProcessorPipeline elements. QueryNodeProcessorPipeline now implements the List interface, this is useful if you want to extend or modify an existing pipeline. (Adriano Crestani via Robert Muir) * LUCENE-2754, LUCENE-2757: Deprecated SpanRegexQuery. Use new SpanMultiTermQueryWrapper(new RegexQuery()) instead. (Robert Muir, Uwe Schindler) * LUCENE-2747: Deprecated ArabicLetterTokenizer. StandardTokenizer now tokenizes most languages correctly including Arabic. (Steven Rowe, Robert Muir) * LUCENE-2830: Use StringBuilder instead of StringBuffer across Benchmark, and remove the StringBuffer HtmlParser.parse() variant. (Shai Erera) * LUCENE-2920: Deprecated ShingleMatrixFilter as it is unmaintained and does not work with custom Attributes or custom payload encoders. (Uwe Schindler) New features * LUCENE-2500: Added DirectIOLinuxDirectory, a Linux-specific Directory impl that uses the O_DIRECT flag to bypass the buffer cache. This is useful to prevent segment merging from evicting pages from the buffer cache, since fadvise/madvise do not seem. (Michael McCandless) * LUCENE-2306: Add NumericRangeFilter and NumericRangeQuery support to XMLQueryParser. (Jingkei Ly, via Mark Harwood) * LUCENE-2102: Add a Turkish LowerCase Filter. TurkishLowerCaseFilter handles Turkish and Azeri unique casing behavior correctly. (Ahmet Arslan, Robert Muir via Simon Willnauer) * LUCENE-2039: Add a extensible query parser to contrib/misc. ExtendableQueryParser enables arbitrary parser extensions based on a customizable field naming scheme. (Simon Willnauer) * LUCENE-2067: Add a Czech light stemmer. CzechAnalyzer will now stem words when Version is set to 3.1 or higher. (Robert Muir) * LUCENE-2062: Add a Bulgarian analyzer. (Robert Muir, Simon Willnauer) * LUCENE-2206: Add Snowball's stopword lists for Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Russian, Spanish, and Swedish. These can be loaded with WordListLoader.getSnowballWordSet. (Robert Muir, Simon Willnauer) * LUCENE-2243: Add DisjunctionMaxQuery support for FastVectorHighlighter. (Koji Sekiguchi) * LUCENE-2218: ShingleFilter supports minimum shingle size, and the separator character is now configurable. Its also up to 20% faster. (Steven Rowe via Robert Muir) * LUCENE-2234: Add a Hindi analyzer. (Robert Muir) * LUCENE-2055: Add analyzers/misc/StemmerOverrideFilter. This filter provides the ability to override any stemmer with a custom dictionary map. (Robert Muir, Uwe Schindler, Simon Willnauer) * LUCENE-2399: Add ICUNormalizer2Filter, which normalizes tokens with ICU's Normalizer2. This allows for efficient combinations of normalization and custom mappings in addition to standard normalization, and normalization combined with unicode case folding. (Robert Muir) * LUCENE-1343: Add ICUFoldingFilter, a replacement for ASCIIFoldingFilter that does a more thorough job of normalizing unicode text for search. (Robert Haschart, Robert Muir) * LUCENE-2409: Add ICUTransformFilter, which transforms text in a context sensitive way, either from ICU built-in rules (such as Traditional-Simplified), or from rules you write yourself. (Robert Muir) * LUCENE-2414: Add ICUTokenizer, a tailorable tokenizer that implements Unicode Text Segmentation. This tokenizer is useful for documents or collections with multiple languages. The default configuration includes special support for Thai, Lao, Myanmar, and Khmer. (Robert Muir, Uwe Schindler) * LUCENE-2298: Add analyzers/stempel, an algorithmic stemmer with support for the Polish language. (Andrzej Bialecki via Robert Muir) * LUCENE-2400: ShingleFilter was changed to don't output all-filler shingles and unigrams, and uses a more performant algorithm to build grams using a linked list of AttributeSource.cloneAttributes() instances and the new copyTo() method. (Steven Rowe via Uwe Schindler) * LUCENE-2437: Add an Analyzer for Indonesian. (Robert Muir) * LUCENE-2393: The HighFreqTerms tool (in misc) can now optionally also include the total termFreq. (Tom Burton-West via Mike McCandless) * LUCENE-2463: Add a Greek inflectional stemmer. GreekAnalyzer will now stem words when Version is set to 3.1 or higher. (Robert Muir) * LUCENE-1287: Allow usage of HyphenationCompoundWordTokenFilter without dictionary. (Thomas Peuss via Robert Muir) * LUCENE-2464: FastVectorHighlighter: add SingleFragListBuilder to return entire field contents. (Koji Sekiguchi) * LUCENE-2503: Added lighter stemming alternatives for European languages. (Robert Muir) * LUCENE-2581: FastVectorHighlighter: add Encoder to FragmentsBuilder. (Koji Sekiguchi) * LUCENE-2624: Add Analyzers for Armenian, Basque, and Catalan, from snowball. (Robert Muir) * LUCENE-1938: PrecedenceQueryParser is now implemented with the flexible QP framework. This means that you can also add this functionality to your own QP pipeline by using BooleanModifiersQueryNodeProcessor, for example instead of GroupQueryNodeProcessor. (Adriano Crestani via Robert Muir) * LUCENE-2791: Added WindowsDirectory, a Windows-specific Directory impl that doesn't synchronize on the file handle. This can be useful to avoid the performance problems of SimpleFSDirectory and NIOFSDirectory. (Robert Muir, Simon Willnauer, Uwe Schindler, Michael McCandless) * LUCENE-2842: Add analyzer for Galician. Also adds the RSLP (Orengo) stemmer for Portuguese. (Robert Muir) * SOLR-1057: Add PathHierarchyTokenizer that represents file path hierarchies as synonyms of /something, /something/something, /something/something/else. (Ryan McKinley, Koji Sekiguchi) Build * LUCENE-2124: Moved the JDK-based collation support from contrib/collation into core, and moved the ICU-based collation support into contrib/icu. (Steven Rowe, Robert Muir) * LUCENE-2323: Moved contrib/regex into contrib/queries. Moved the queryparsers under contrib/misc and contrib/surround into contrib/queryparser. Moved contrib/fast-vector-highlighter into contrib/highlighter. Moved ChainedFilter from contrib/misc to contrib/queries. contrib/spatial now depends on contrib/queries instead of contrib/misc. (Robert Muir) * LUCENE-2333: Fix failures during contrib builds, when classes in core were changed without ant clean. This fix also optimizes the dependency management between contribs by a new ANT macro. (Uwe Schindler, Shai Erera) * LUCENE-2797: Upgrade contrib/icu's ICU jar file to ICU 4.6 (Robert Muir) * LUCENE-2833: Upgrade contrib/ant's jtidy jar file to r938 (Robert Muir) * LUCENE-2413: Moved the demo out of lucene core and into contrib/demo. (Robert Muir) Optimizations * LUCENE-2157: DelimitedPayloadTokenFilter no longer copies the buffer over itsself. Instead it sets only the length. This patch also optimizes the logic of the filter and uses NIO for IdentityEncoder. (Uwe Schindler) * LUCENE-2084: Change IndexableBinaryStringTools to work on byte[] and char[] directly, instead of Byte/CharBuffers, and modify ICUCollationKeyFilter to take advantage of this for faster performance. (Steven Rowe, Uwe Schindler, Robert Muir) * LUCENE-2194, LUCENE-2201, LUCENE-2288: Snowball stemmers in contrib/analyzers have been optimized to work on char[] and remove unnecessary object creation. (Shai Erera, Robert Muir) * LUCENE-2404: Improve performance of ThaiWordFilter by using a char[]-backed CharacterIterator (currently from javax.swing). (Uwe Schindler, Robert Muir) Test Cases * LUCENE-2115: Cutover contrib tests to use Java5 generics. (Kay Kay via Mike McCandless) Other * LUCENE-1845: Updated bdb-je jar from version 3.3.69 to 3.3.93. (Simon Willnauer via Mike McCandless) * LUCENE-2415: Use reflection instead of a shim class to access Jakarta Regex prefix. (Uwe Schindler) ================== Release 2.9.4 / 3.0.3 ==================== Bug Fixes * LUCENE-2277: QueryNodeImpl threw ConcurrentModificationException on add(List). (Frank Wesemann via Robert Muir) * LUCENE-2284: MatchAllDocsQueryNode toString() created an invalid XML tag. (Frank Wesemann via Robert Muir) * LUCENE-2278: FastVectorHighlighter: Highlighted term is out of alignment in multi-valued NOT_ANALYZED field. (Koji Sekiguchi) * LUCENE-2524: FastVectorHighlighter: use mod for getting colored tag. (Koji Sekiguchi) * LUCENE-2616: FastVectorHighlighter: out of alignment when the first value is empty in multiValued field (Koji Sekiguchi) * LUCENE-2731, LUCENE-2732: Fix (charset) problems in XML loading in HyphenationCompoundWordTokenFilter (partial bugfix-only in 2.9 and 3.0, full fix will be in later 3.1). (Uwe Schinder) Documentation * LUCENE-2055: Add documentation noting that the Dutch and French stemmers in contrib/analyzers do not implement the Snowball algorithm correctly, and recommend to use the equivalents in contrib/snowball if possible. (Robert Muir, Uwe Schindler, Simon Willnauer) * LUCENE-2653: Add documentation noting that ThaiWordFilter will not work as expected on all JRE's. For example, on an IBM JRE, it does nothing. (Robert Muir) ================== Release 2.9.3 / 3.0.2 ==================== No changes. ================== Release 2.9.2 / 3.0.1 ==================== New features * LUCENE-2108: Spellchecker now safely supports concurrent modifications to the spell-index. Threads can safely obtain term suggestions while the spell- index is rebuild, cleared or reset. Internal IndexSearcher instances remain open until the last thread accessing them releases the reference. (Simon Willnauer) Bug Fixes * LUCENE-2144: Fix InstantiatedIndex to handle termDocs(null) correctly (enumerate all non-deleted docs). (Karl Wettin via Mike McCandless) * LUCENE-2199: ShingleFilter skipped over tri-gram shingles if outputUnigram was set to false. (Simon Willnauer) * LUCENE-2211: Fix missing clearAttributes() calls: ShingleMatrix, PrefixAware, compounds, NGramTokenFilter, EdgeNGramTokenFilter, Highlighter, and MemoryIndex. (Uwe Schindler, Robert Muir) * LUCENE-2207, LUCENE-2219: Fix incorrect offset calculations in end() for CJKTokenizer, ChineseTokenizer, SmartChinese SentenceTokenizer, and WikipediaTokenizer. (Koji Sekiguchi, Robert Muir) * LUCENE-2266: Fixed offset calculations in NGramTokenFilter and EdgeNGramTokenFilter. (Joe Calderon, Robert Muir via Uwe Schindler) API Changes * LUCENE-2108: Add SpellChecker.close, to close the underlying reader. (Eirik Bjørsnøs via Mike McCandless) * LUCENE-2165: Add a constructor to SnowballAnalyzer that takes a Set of stopwords, and deprecate the String[] one. (Nick Burch via Robert Muir) ======================= Release 3.0.0 ======================= Changes in backwards compatibility policy * LUCENE-1257: Change some occurences of StringBuffer in public/protected APIs of contrib/surround to StringBuilder. (Paul Elschot via Uwe Schindler) Changes in runtime behavior * LUCENE-1966: Modified and cleaned the default Arabic stopwords list used by ArabicAnalyzer. You'll need to fully re-index any previously created indexes. (Basem Narmok via Robert Muir) API Changes * LUCENE-1936: Deprecated RussianLowerCaseFilter, because it transforms text exactly the same as LowerCaseFilter. Please use LowerCaseFilter instead, which has the same functionality. (Robert Muir) * LUCENE-2051: Contrib Analyzer setters were deprecated and replaced with ctor arguments / Version number. Also stop word lists were unified. (Simon Willnauer) Bug fixes * LUCENE-1781: Fixed various issues with the lat/lng bounding box distance filter created for radius search in contrib/spatial. (Bill Bell via Mike McCandless) * LUCENE-1939: IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method on exhausted streams. (Patrick Jungermann via Karl Wettin) * LUCENE-1359: French analyzer did not support null field names. (Andrew Lynch via Robert Muir) New features * LUCENE-1924: Added BalancedSegmentMergePolicy to contrib/misc, which is a merge policy that tries to avoid doing very large segment merges to give better search performance in a mixed indexing/searching environment. (John Wang via Mike McCandless) * LUCENE-1959: Add index splitting tools. The IndexSplitter tool works on multi-segment (non optimized) indexes and it can copy specific segments out of the index into a new index. It can also list the segments in the index, and delete specified segments. (Jason Rutherglen via Mike McCandless). MultiPassIndexSplitter can split any index into any number of output parts, at the cost of doing multiple passes over the input index. (Andrzej Bialecki) * LUCENE-1993: Add maxDocFreq setting to MoreLikeThis, to exclude from consideration terms that match more than the specified number of documents. (Christian Steinert via Mike McCandless) Optimizations * LUCENE-1965, LUCENE-1962: Arabic-, Persian- and SmartChineseAnalyzer loads default stopwords only once if accessed for the first time. Previous versions were loading the stopword files each time a new instance was created. This might improve performance for applications creating lots of instances of these Analyzers. (Simon Willnauer) Documentation * LUCENE-1916: Translated documentation in the smartcn hhmm package. (Patricia Peng via Robert Muir) Build * LUCENE-1904: Moved wordnet-based synonym support from contrib/memory into contrib/wordnet. (Robert Muir) * LUCENE-2031: Moved PatternAnalyzer from contrib/memory into contrib/analyzers/common, under miscellaneous. (Robert Muir) ======================= Release 2.9.1 ======================= Changes in backwards compatibility policy * LUCENE-2002: Add required Version matchVersion argument when constructing ComplexPhraseQueryParser and default (as of 2.9) enablePositionIncrements to true to match StandardAnalyzer's default. Also added required matchVersion to most of the analyzers (Uwe Schindler, Mike McCandless) Changes in runtime behavior * LUCENE-1963: ArabicAnalyzer now lowercases before checking the stopword list. This has no effect on Arabic text, but if you are using a custom stopword list that contains some non-Arabic words, you'll need to fully reindex. (DM Smith via Robert Muir) Bug fixes * LUCENE-1953: FastVectorHighlighter: small fragCharSize can cause StringIndexOutOfBoundsException. (Koji Sekiguchi) * LUCENE-1929: Highlighter throws exception on NumericRangeQuery and does not support deprecated RangeQuery. (Mark Miller) * LUCENE-2001: Wordnet Syns2Index incorrectly parses synonyms that contain a single quote. (Parag H. Dave via Robert Muir) * LUCENE-2003: Highlighter doesn't respect position increments other than 1 with PhraseQuerys. (Uwe Schindler, Mark Miller) * LUCENE-1954: InstantiatedIndexWriter: Fixed ClassCastException with NumericField because of incorrect unchecked cast: Document.getFields() returns List. (Bernd Fondermann via Uwe Schindler) * LUCENE-2014: SmartChineseAnalyzer did not properly clear attributes in WordTokenFilter. If enablePositionIncrements is set for StopFilter, then this could create invalid position increments, causing IndexWriter to crash. (Robert Muir, Uwe Schindler) * LUCENE-2013: SpanRegexQuery does not work with QueryScorer. (Benjamin Keil via Mark Miller) ======================= Release 2.9.0 ======================= Changes in runtime behavior * LUCENE-1505: Local lucene now uses org.apache.lucene.util.NumericUtils for all number conversion. You'll need to fully re-index any previously created indexes. This isn't a break in back-compatibility because local Lucene has not yet been released. (Mike McCandless) * LUCENE-1758: ArabicAnalyzer now uses the light10 algorithm, has a refined default stopword list, and lowercases non-Arabic text. You'll need to fully re-index any previously created indexes. This isn't a break in back-compatibility because ArabicAnalyzer has not yet been released. (Robert Muir) API Changes * LUCENE-1695: Update the Highlighter to use the new TokenStream API. This issue breaks backwards compatibility with some public classes. If you have implemented custom Fragmenters or Scorers, you will need to adjust them to work with the new TokenStream API. Rather than getting passed a Token at a time, you will be given a TokenStream to init your impl with - store the Attributes you are interested in locally and access them on each call to the method that used to pass a new Token. Look at the included updated impls for examples. (Mark Miller) * LUCENE-1460: Change contrib TokenStreams/Filters to use the new TokenStream API. (Robert Muir, Michael Busch) * LUCENE-1775, LUCENE-1903: Change remaining TokenFilters (shingle, prefix-suffix) to use the new TokenStream API. ShingleFilter is much more efficient now, it clones much less often and computes the tokens mostly on the fly now. Also added more tests. (Robert Muir, Michael Busch, Uwe Schindler, Chris Harris) * LUCENE-1685: The position aware SpanScorer has become the default scorer for Highlighting. The SpanScorer implementation has replaced QueryScorer and the old term highlighting QueryScorer has been renamed to QueryTermScorer. Multi-term queries are also now expanded by default. If you were previously rewriting the query for multi-term query highlighting, you should no longer do that (unless you switch to using QueryTermScorer). The SpanScorer API (now QueryScorer) has also been improved to more closely match the API of the previous QueryScorer implementation. (Mark Miller) * LUCENE-1793: Deprecate the custom encoding support in the Greek and Russian Analyzers. If you need to index text in these encodings, please use Java's character set conversion facilities (InputStreamReader, etc) during I/O, so that Lucene can analyze this text as Unicode instead. (Robert Muir) Bug fixes * LUCENE-1423: InstantiatedTermEnum#skipTo(Term) throws ArrayIndexOutOfBounds on empty index. (Karl Wettin) * LUCENE-1462: InstantiatedIndexWriter did not reset pre analyzed TokenStreams the same way IndexWriter does. Parts of InstantiatedIndex was not Serializable. (Karl Wettin) * LUCENE-1510: InstantiatedIndexReader#norms methods throws NullPointerException on empty index. (Karl Wettin, Robert Newson) * LUCENE-1514: ShingleMatrixFilter#next(Token) easily throws a StackOverflowException due to recursive invocation. (Karl Wettin) * LUCENE-1548: Fix distance normalization in LevenshteinDistance to not produce negative distances (Thomas Morton via Mike McCandless) * LUCENE-1490: Fix latin1 conversion of HALFWIDTH_AND_FULLWIDTH_FORMS characters to only apply to the correct subset (Daniel Cheng via Mike McCandless) * LUCENE-1576: Fix BrazilianAnalyzer to downcase tokens after StandardTokenizer so that stop words with mixed case are filtered out. (Rafael Cunha de Almeida, Douglas Campos via Mike McCandless) * LUCENE-1491: EdgeNGramTokenFilter no longer stops on tokens shorter than minimum n-gram size. (Todd Teak via Otis Gospodnetic) * LUCENE-1683: Fixed JavaUtilRegexCapabilities (an impl used by RegexQuery) to use Matcher.matches() not Matcher.lookingAt() so that the regexp must match the entire string, not just a prefix. (Trejkaz via Mike McCandless) * LUCENE-1792: Fix new query parser to set rewrite method for multi-term queries. (Luis Alves, Mike McCandless via Michael Busch) * LUCENE-1828: Fix memory index to call TokenStream.reset() and TokenStream.end(). (Tim Smith via Michael Busch) * LUCENE-1912: Fix fast-vector-highlighter issue when two or more terms are concatenated (Koji Sekiguchi via Mike McCandless) New features * LUCENE-1531: Added support for BoostingTermQuery to XML query parser. (Karl Wettin) * LUCENE-1435: Added contrib/collation, a CollationKeyFilter allowing you to convert tokens into CollationKeys encoded using IndexableBinaryStringTools. This allows for faster RangeQuery when a field needs to use a custom Collator. (Steven Rowe via Mike McCandless) * LUCENE-1591: EnWikiDocMaker, LineDocMaker, WriteLineDoc can now read/write bz2 using Apache commons compress library. This means you can download the .bz2 export from http://wikipedia.org and immediately index it. (Shai Erera via Mike McCandless) * LUCENE-1629: Add SmartChineseAnalyzer to contrib/analyzers. It improves on CJKAnalyzer and ChineseAnalyzer by handling Chinese sentences properly. SmartChineseAnalyzer uses a Hidden Markov Model to tokenize Chinese words in a more intelligent way. (Xiaoping Gao via Mike McCandless) * LUCENE-1676: Added DelimitedPayloadTokenFilter class for automatically adding payloads "in-stream" (Grant Ingersoll) * LUCENE-1578: Support for loading unoptimized readers to the constructor of InstantiatedIndex. (Karl Wettin) * LUCENE-1704: Allow specifying the Tidy configuration file when parsing HTML docs with contrib/ant. (Keith Sprochi via Mike McCandless) * LUCENE-1522: Added contrib/fast-vector-highlighter, a new alternative highlighter. (Koji Sekiguchi via Mike McCandless) * LUCENE-1740: Added "analyzer" command to Lucli, enabling changing the analyzer from the default StandardAnalyzer. (Bernd Fondermann via Mike McCandless) * LUCENE-1272: Add get/setBoost to MoreLikeThis. (Jonathan Leibiusky via Mike McCandless) * LUCENE-1745: Added constructors to JakartaRegexpCapabilities and JavaUtilRegexCapabilities as well as static flags to support configuring a RegexCapabilities implementation with the implementation-specific modifier flags. Allows for callers to customize the RegexQuery using the implementation-specific options and fine tune how regular expressions are compiled and matched. (Marc Zampetti zampettim@aim.com via Mike McCandless) * LUCENE-1567: Added a new QueryParser framework, that allows implementing a new query syntax in a flexible and efficient way. This new QueryParser will be moved to Lucene's core in release 3.0 and will then replace the current core QueryParser, which has been deprecated with this patch. (Luis Alves and Adriano Campos via Michael Busch) * LUCENE-1486: Added ComplexPhraseQueryParser, an extension of QueryParser that allows a subset of the Lucene query language to be embedded in PhraseQuerys. Wildcard, Range, and Fuzzy queries, as well as limited boolean logic, can be used within quote operators with this parser, ie: "(jo* -john) smyth~". (Mark Harwood via Mark Miller) * Added web-based demo of functionality in contrib's XML Query Parser packaged as War file (Mark Harwood) * LUCENE-1406: Added Arabic analyzer. (Robert Muir via Grant Ingersoll) * LUCENE-1628: Added Persian analyzer. (Robert Muir) * LUCENE-1813: Add option to ReverseStringFilter to mark reversed tokens. (Andrzej Bialecki via Robert Muir) Optimizations * LUCENE-1643: Re-use the collation key (RawCollationKey) for better performance, in ICUCollationKeyFilter. (Robert Muir via Mike McCandless) * LUCENE-1794: Implement TokenStream reuse for contrib Analyzers, and implement reset() for TokenStreams to support reuse. (Robert Muir) Documentation * LUCENE-1876: added missing package level documentation for numerous contrib packages. (Steven Rowe & Robert Muir) Build * LUCENE-1728: Split contrib/analyzers into common and smartcn modules. Contrib/analyzers now builds an additional lucene-smartcn Jar file. All smartcn classes are not included in the lucene-analyzers JAR file. (Robert Muir via Simon Willnauer) * LUCENE-1829: Fix contrib query parser to properly create javacc files. (Jan-Pascal and Luis Alves via Michael Busch) Test Cases ======================= Release 2.4.0 ======================= Changes in runtime behavior (None) API Changes 1. (None) Bug fixes 1. LUCENE-1312: Added full support for InstantiatedIndexReader#getFieldNames() and tests that assert that deleted documents behaves as they should (they did). (Jason Rutherglen, Karl Wettin) 2. LUCENE-1318: InstantiatedIndexReader.norms(String, b[], int) didn't treat the array offset right. (Jason Rutherglen via Karl Wettin) New features 1. LUCENE-1320: ShingleMatrixFilter, multidimensional shingle token filter. (Karl Wettin) 2. LUCENE-1142: Updated Snowball package, org.tartarus distribution revision 500. Introducing Hungarian, Turkish and Romanian support, updated older stemmers and optimized (reflectionless) SnowballFilter. IMPORTANT NOTICE ON BACKWARDS COMPATIBILITY: an index created using the 2.3.2 (or older) might not be compatible with these updated classes as some algorithms have changed. (Karl Wettin) 3. LUCENE-1016: TermVectorAccessor, transparent vector space access via stored vectors or by resolving the inverted index. (Karl Wettin) Documentation (None) Build (None) Test Cases (None)