Lucene Benchmark Contrib Change Log The Benchmark contrib package contains code for benchmarking Lucene in a variety of ways. 05/25/2011 LUCENE-3137: ExtractReuters supports out-dir param suffixed by a slash. (Doron Cohen) 03/31/2011 Updated ReadTask to the new method for obtaining a top-level deleted docs bitset. Also checking the bitset for null, when there are no deleted docs. (Steve Rowe, Mike McCandless) Updated NewAnalyzerTask and NewShingleAnalyzerTask to handle analyzers in the new org.apache.lucene.analysis.core package (KeywordAnalyzer, SimpleAnalyzer, etc.) (Steve Rowe, Robert Muir) Updated ReadTokensTask to convert tokens to their indexed forms (char[]->byte[]), just as the indexer does. This allows measurement of the conversion process, which is important for analysis components that customize it, e.g. (ICU)CollationKeyFilter. As a result, benchmarks that incorporate this task will no longer be directly comparable between 3.X and 4.0. (Robert Muir, Steve Rowe) 03/24/2011 LUCENE-2977: WriteLineDocTask now automatically detects how to write - GZip or BZip2 or Plain-text - according to the output file extension. Property bzip.compression of WriteLineDocTask was canceled. (Doron Cohen) 03/23/2011 LUCENE-2980: Benchmark's ContentSource no more requires lower case file suffixes for detecting file type (gzip/bzip2/text). As part of this fix worked around an issue with gzip input streams which were remaining open (See COMPRESS-127). (Doron Cohen) 03/22/2011 LUCENE-2978: Upgrade benchmark's commons-compress from 1.0 to 1.1 as the move of gzip decompression in LUCENE-1540 from Java's GZipInputStream to commons-compress 1.0 made it 15 times slower. In 1.1 no such slow-down is observed. (Doron Cohen) 03/21/2011 LUCENE-2958: WriteLineDocTask improvements - allow to emit line docs also for empty docs, and be flexible about which fields are added to the line file. For this, a header line was added to the line file. That header is examined by LineDocSource. Old line files which have no header line are handled as before, imposing the default header. (Doron Cohen, Shai Erera, Mike McCandless) 03/21/2011 LUCENE-2964: Allow benchmark tasks from alternative packages, specified through a new property "alt.tasks.packages". (Doron Cohen, Shai Erera) 03/20/2011 LUCENE-2963: Easier way to run benchmark, by calling Benmchmark.exec(alg-file). (Doron Cohen) 03/10/2011 LUCENE-2961: Removed lib/xml-apis.jar, since JVM 1.5+ already contains the JAXP 1.3 interface classes it provides. 02/05/2011 LUCENE-1540: Improvements to contrib.benchmark for TREC collections. ContentSource can now process plain text files, gzip files, and bzip2 files. TREC doc parsing now handles the TREC gov2 collection and TREC disks 4&5-CR collection (both used by many TREC tasks). (Shai Erera, Doron Cohen) 01/31/2011 LUCENE-1591: Rollback to xerces-2.9.1-patched-XERCESJ-1257.jar to workaround XERCESJ-1257, which we hit on current Wikipedia XML export (ENWIKI-20110115-pages-articles.xml) with xerces-2.10.0.jar. (Mike McCandless) 01/26/2011 LUCENE-929: ExtractReuters first extracts to a tmp dir and then renames. That way, if a previous extract attempt failed, "ant extract-reuters" will still extract the files. (Shai Erera, Doron Cohen, Grant Ingersoll) 01/24/2011 LUCENE-2885: Add WaitForMerges task (calls IndexWriter.waitForMerges()). (Mike McCandless) 10/10/2010 The locally built patched version of the Xerces-J jar introduced as part of LUCENE-1591 is no longer required, because Xerces 2.10.0, which contains a fix for XERCESJ-1257 (see http://svn.apache.org/viewvc?view=revision&revision=554069), was released earlier this year. Upgraded xerces-2.9.1-patched-XERCESJ-1257.jar and xml-apis-2.9.0.jar to xercesImpl-2.10.0.jar and xml-apis-2.10.0.jar. (Steven Rowe) 8/2/2010 LUCENE-2582: You can now specify the default codec to use for writing new segments by adding default.codec = Pulsing (for example), in the alg file. (Mike McCandless) 4/27/2010: WriteLineDocTask now supports multi-threading. Also, StringBufferReader was renamed to StringBuilderReader and works on StringBuilder now. In addition, LongToEnglishContentSource starts from 0 (instead of Long.MIN_VAL+10) and wraps around to MIN_VAL (if you ever hit Long.MAX_VAL). (Shai Erera) 4/07/2010 LUCENE-2377: Enable the use of NoMergePolicy and NoMergeScheduler by CreateIndexTask. (Shai Erera) 3/28/2010 LUCENE-2353: Fixed bug in Config where Windows absolute path property values were incorrectly handled (Shai Erera) 3/24/2010 LUCENE-2343: Added support for benchmarking collectors. (Grant Ingersoll, Shai Erera) 2/21/2010 LUCENE-2254: Add support to the quality package for running experiments with any combination of Title, Description, and Narrative. (Robert Muir) 1/28/2010 LUCENE-2223: Add a benchmark for ShingleFilter. You can wrap any analyzer with ShingleAnalyzerWrapper and specify shingle parameters with the NewShingleAnalyzer task. (Steven Rowe via Robert Muir) 1/14/2010 LUCENE-2210: TrecTopicsReader now properly reads descriptions and narratives from trec topics files. (Robert Muir) 1/11/2010 LUCENE-2181: Add a benchmark for collation. This adds NewLocaleTask, which sets a Locale in the run data for collation to use, and can be used in the future for benchmarking localized range queries and sorts. Also add NewCollationAnalyzerTask, which works with both JDK and ICU Collator implementations. Fix ReadTokensTask to not tokenize fields unless they should be tokenized according to DocMaker config. The easiest way to run the benchmark is to run 'ant collation' (Steven Rowe via Robert Muir) 12/22/2009 LUCENE-2178: Allow multiple locations to add to the class path with -Dbenchmark.ext.classpath=... when running "ant run-task" (Steven Rowe via Mike McCandless) 12/17/2009 LUCENE-2168: Allow negative relative thread priority for BG tasks (Mike McCandless) 12/07/2009 LUCENE-2106: ReadTask does not close its Reader when OpenReader/CloseReader are not used. (Mark Miller) 11/17/2009 LUCENE-2079: Allow specifying delta thread priority after the "&"; added log.time.step.msec to print per-time-period counts; fixed NearRealTimeTask to print reopen times (in msec) of each reopen, at the end. (Mike McCandless) 11/13/2009 LUCENE-2050: Added ability to run tasks within a serial sequence in the background, by appending "&". The tasks are stopped & joined at the end of the sequence. Also added Wait and RollbackIndex tasks. Genericized NearRealTimeReaderTask to only reopen the reader (previously it spawned its own thread, and also did searching). Also changed the API of PerfRunData.getIndexReader: it now returns a reference, and it's your job to decRef the reader when you're done using it. (Mike McCandless) 11/12/2009 LUCENE-2059: allow TrecContentSource not to change the docname. Previously, it would always append the iteration # to the docname. With the new option content.source.excludeIteration, you can disable this. The resulting index can then be used with the quality package to measure relevance. (Robert Muir) 11/12/2009 LUCENE-2058: specify trec_eval submission output from the command line. Previously, 4 arguments were required, but the third was unused. The third argument is now the desired location of submission.txt (Robert Muir) 11/08/2009 LUCENE-2044: Added delete.percent.rand.seed to seed the Random instance used by DeleteByPercentTask. (Mike McCandless) 11/07/2009 LUCENE-2043: Fix CommitIndexTask to also commit pending IndexReader changes (Mike McCandless) 11/07/2009 LUCENE-2042: Added print.hits.field, to print each hit from the Search* tasks. (Mike McCandless) 11/04/2009 LUCENE-2029: Added doc.body.stored and doc.body.tokenized; each falls back to the non-body variant as its default. (Mike McCandless) 10/28/2009 LUCENE-1994: Fix thread safety of EnwikiContentSource and DocMaker when doc.reuse.fields is false. Also made docs.reuse.fields=true thread safe. (Mark Miller, Shai Erera, Mike McCandless) 8/4/2009 LUCENE-1770: Add EnwikiQueryMaker (Mark Miller) 8/04/2009 LUCENE-1773: Add FastVectorHighlighter tasks. This change is a non-backwards compatible change in how subclasses of ReadTask define a highlighter. The methods doHighlight, isMergeContiguousFragments, maxNumFragments and getHighlighter are no longer used and have been mark deprecated and package protected private so there's a compile time error. Instead, the new getBenchmarkHighlighter method should return an appropriate highlighter for the task. The configuration of the highlighter tasks (maxFrags, mergeContiguous, etc.) is now accepted as params to the task. (Koji Sekiguchi via Mike McCandless) 8/03/2009 LUCENE-1778: Add support for log.step setting per task type. Perviously, if you included a log.step line in the .alg file, it had been applied to all tasks. Now, you can include a log.step.AddDoc, or log.step.DeleteDoc (for example) to control logging for just these tasks. If you want to ommit logging for any other task, include log.step=-1. The syntax is "log.step." together with the Task's 'short' name (i.e., without the 'Task' part). (Shai Erera via Mark Miller) 7/24/2009 LUCENE-1595: Deprecate LineDocMaker and EnwikiDocMaker in favor of using DocMaker directly, with content.source = LineDocSource or EnwikiContentSource. NOTE: with this change, the "id" field from the Wikipedia XML export is now indexed as the "docname" field (previously it was indexed as "docid"). Additionaly, the SearchWithSort task now accepts all types that SortField can accept and no longer falls back to SortField.AUTO, which has been deprecated. (Mike McCandless) 7/20/2009 LUCENE-1755: Fix WriteLineDocTask to output a document if it contains either a title or body (or both). (Shai Erera via Mark Miller) 7/14/2009 LUCENE-1725: Fix the example Sort algorithm - auto is now deprecated and no longer works with Benchmark. Benchmark will now throw an exception if you specify sort fields without a type. The example sort algorithm is now typed. (Mark Miller) 7/6/2009 LUCENE-1730: Fix TrecContentSource to use ISO-8859-1 when reading the TREC files, unless a different encoding is specified. Additionally, ContentSource now supports a content.source.encoding parameter in the configuration file. (Shai Erera via Mark Miller) 6/26/2009 LUCENE-1716: Added the following support: doc.tokenized.norms: specifies whether to store norms doc.body.tokenized.norms: special attribute for the body field doc.index.props: specifies whether DocMaker should index the properties set on DocData writer.info.stream: specifies the info stream to set on IndexWriter (supported values are: SystemOut, SystemErr and a file name). (Shai Erera via Mike McCandless) 6/23/09 LUCENE-1714: WriteLineDocTask incorrectly normalized text, by replacing only occurrences of "\t" with a space. It now replaces "\r\n" in addition to that, so that LineDocMaker won't fail. (Shai Erera via Michael McCandless) 6/17/09 LUCENE-1595: This issue breaks previous external algorithms. DocMaker has been replaced with a concrete class which accepts a ContentSource for iterating over a content source's documents. Most of the old DocMakers were changed to a ContentSource implementation, and DocMaker is now a default document creation impl that provides an easy way for reusing fields. When [doc.maker] is not defined in an algorithm, the new DocMaker is the default. If you have .alg files which specify a DocMaker (like ReutersDocMaker), you should change the [doc.maker] line to: [content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource] i.e. doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker becomes content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker becomes content.source=org.apache.lucene.benchmark.byTask.feeds.SingleDocSource Also, PerfTask now logs a message in tearDown() rather than each Task doing its own logging. A new setting called [log.step] is consulted to determine how often to log. [doc.add.log.step] is no longer a valid setting. For easy migration of current .alg files, rename [doc.add.log.step] to [log.step] and [doc.delete.log.step] to [delete.log.step]. Additionally, [doc.maker.forever] should be changed to [content.source.forever]. (Shai Erera via Mark Miller) 6/12/09 LUCENE-1539: Added DeleteByPercentTask which enables deleting a percentage of documents and searching on them. Changed CommitIndex to optionally accept a label (recorded as userData=