/* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ using System; using Document = Lucene.Net.Documents.Document; using AlreadyClosedException = Lucene.Net.Store.AlreadyClosedException; using Directory = Lucene.Net.Store.Directory; using FSDirectory = Lucene.Net.Store.FSDirectory; using Lock = Lucene.Net.Store.Lock; using LockObtainFailedException = Lucene.Net.Store.LockObtainFailedException; using BitVector = Lucene.Net.Util.BitVector; using Analyzer = Lucene.Net.Analysis.Analyzer; using Similarity = Lucene.Net.Search.Similarity; using System.Collections; namespace Lucene.Net.Index { ///

An IndexWriter creates and maintains an index. ///

The create argument to the /// constructor /// determines whether a new index is created, or whether an existing index is /// opened. Note that you /// can open an index with create=true even while readers are /// using the index. The old readers will continue to search /// the "point in time" snapshot they had opened, and won't /// see the newly created index until they re-open. There are /// also constructors /// with no create argument which /// will create a new index if there is not already an index at the /// provided path and otherwise open the existing index.

///

In either case, documents are added with addDocument /// and removed with deleteDocuments. /// A document can be updated with updateDocument /// (which just deletes and then adds the entire document). /// When finished adding, deleting and updating documents, close should be called.

///

These changes are buffered in memory and periodically /// flushed to the {@link Directory} (during the above method /// calls). A flush is triggered when there are enough /// buffered deletes (see {@link #setMaxBufferedDeleteTerms}) /// or enough added documents since the last flush, whichever /// is sooner. For the added documents, flushing is triggered /// either by RAM usage of the documents (see {@link /// #setRAMBufferSizeMB}) or the number of added documents. /// The default is to flush when RAM usage hits 16 MB. For /// best indexing speed you should flush by RAM usage with a /// large RAM buffer. You can also force a flush by calling /// {@link #flush}. When a flush occurs, both pending deletes /// and added documents are flushed to the index. A flush may /// also trigger one or more segment merges which by default /// run with a background thread so as not to block the /// addDocument calls (see below /// for changing the {@link MergeScheduler}).

/// ///

The optional autoCommit argument to the /// constructors /// controls visibility of the changes to {@link IndexReader} instances reading the same index. /// When this is false, changes are not /// visible until {@link #Close()} is called. /// Note that changes will still be flushed to the /// {@link Lucene.Net.Store.Directory} as new files, /// but are not committed (no new segments_N file /// is written referencing the new files) until {@link #close} is /// called. If something goes terribly wrong (for example the /// JVM crashes) before {@link #Close()}, then /// the index will reflect none of the changes made (it will /// remain in its starting state). /// You can also call {@link #Abort()}, which closes the writer without committing any /// changes, and removes any index /// files that had been flushed but are now unreferenced. /// This mode is useful for preventing readers from refreshing /// at a bad time (for example after you've done all your /// deletes but before you've done your adds). /// It can also be used to implement simple single-writer /// transactional semantics ("all or none").

///

When autoCommit is true then /// every flush is also a commit ({@link IndexReader} /// instances will see each flush as changes to the index). /// This is the default, to match the behavior before 2.2. /// When running in this mode, be careful not to refresh your /// readers while optimize or segment merges are taking place /// as this can tie up substantial disk space.

///

Regardless of autoCommit, an {@link /// IndexReader} or {@link Lucene.Net.Search.IndexSearcher} will only see the /// index as of the "point in time" that it was opened. Any /// changes committed to the index after the reader was opened /// are not visible until the reader is re-opened.

///

If an index will not have more documents added for a while and optimal search /// performance is desired, then the optimize /// method should be called before the index is closed.

///

Opening an IndexWriter creates a lock file for the directory in use. Trying to open /// another IndexWriter on the same directory will lead to a /// {@link LockObtainFailedException}. The {@link LockObtainFailedException} /// is also thrown if an IndexReader on the same directory is used to delete documents /// from the index.

///

Expert: IndexWriter allows an optional /// {@link IndexDeletionPolicy} implementation to be /// specified. You can use this to control when prior commits /// are deleted from the index. The default policy is {@link /// KeepOnlyLastCommitDeletionPolicy} which removes all prior /// commits as soon as a new commit is done (this matches /// behavior before 2.2). Creating your own policy can allow /// you to explicitly keep previous "point in time" commits /// alive in the index for some time, to allow readers to /// refresh to the new commit without having the old commit /// deleted out from under them. This is necessary on /// filesystems like NFS that do not support "delete on last /// close" semantics, which Lucene's "point in time" search /// normally relies on.

///

Expert: /// IndexWriter allows you to separately change /// the {@link MergePolicy} and the {@link MergeScheduler}. /// The {@link MergePolicy} is invoked whenever there are /// changes to the segments in the index. Its role is to /// select which merges to do, if any, and return a {@link /// MergePolicy.MergeSpecification} describing the merges. It /// also selects merges to do for optimize(). (The default is /// {@link LogByteSizeMergePolicy}. Then, the {@link /// MergeScheduler} is invoked with the requested merges and /// it decides when and how to run the merges. The default is /// {@link ConcurrentMergeScheduler}.

///

/* * Clarification: Check Points (and commits) * Being able to set autoCommit=false allows IndexWriter to flush and * write new index files to the directory without writing a new segments_N * file which references these new files. It also means that the state of * the in memory SegmentInfos object is different than the most recent * segments_N file written to the directory. * * Each time the SegmentInfos is changed, and matches the (possibly * modified) directory files, we have a new "check point". * If the modified/new SegmentInfos is written to disk - as a new * (generation of) segments_N file - this check point is also an * IndexCommitPoint. * * With autoCommit=true, every checkPoint is also a CommitPoint. * With autoCommit=false, some checkPoints may not be commits. * * A new checkpoint always replaces the previous checkpoint and * becomes the new "front" of the index. This allows the IndexFileDeleter * to delete files that are referenced only by stale checkpoints. * (files that were created since the last commit, but are no longer * referenced by the "front" of the index). For this, IndexFileDeleter * keeps track of the last non commit checkpoint. */ public class IndexWriter { private void InitBlock() { similarity = Similarity.GetDefault(); } ///

Default value for the write lock timeout (1,000).

/// /// public static long WRITE_LOCK_TIMEOUT = 1000; private long writeLockTimeout = WRITE_LOCK_TIMEOUT; ///

Name of the write lock in the index.

public const System.String WRITE_LOCK_NAME = "write.lock"; /// /// /// /// public static readonly int DEFAULT_MERGE_FACTOR; ///

Value to denote a flush trigger is disabled

public const int DISABLE_AUTO_FLUSH = - 1; ///

Disabled by default (because IndexWriter flushes by RAM usage /// by default). Change using {@link #SetMaxBufferedDocs(int)}. ///

public static readonly int DEFAULT_MAX_BUFFERED_DOCS = DISABLE_AUTO_FLUSH; ///

Default value is 16 MB (which means flush when buffered /// docs consume 16 MB RAM). Change using {@link #setRAMBufferSizeMB}. ///

public const double DEFAULT_RAM_BUFFER_SIZE_MB = 16.0; ///

Disabled by default (because IndexWriter flushes by RAM usage /// by default). Change using {@link #SetMaxBufferedDeleteTerms(int)}. ///

public static readonly int DEFAULT_MAX_BUFFERED_DELETE_TERMS = DISABLE_AUTO_FLUSH; /// /// /// /// public static readonly int DEFAULT_MAX_MERGE_DOCS; ///

Default value is 10,000. Change using {@link #SetMaxFieldLength(int)}.

public const int DEFAULT_MAX_FIELD_LENGTH = 10000; ///

Default value is 128. Change using {@link #SetTermIndexInterval(int)}.

public const int DEFAULT_TERM_INDEX_INTERVAL = 128; ///

Absolute hard maximum length for a term. If a term /// arrives from the analyzer longer than this length, it /// is skipped and a message is printed to infoStream, if /// set (see {@link #setInfoStream}). ///

public static readonly int MAX_TERM_LENGTH; // The normal read buffer size defaults to 1024, but // increasing this during merging seems to yield // performance gains. However we don't want to increase // it too much because there are quite a few // BufferedIndexInputs created during merging. See // LUCENE-888 for details. private const int MERGE_READ_BUFFER_SIZE = 4096; // Used for printing messages private static System.Object MESSAGE_ID_LOCK = new System.Object(); private static int MESSAGE_ID = 0; private int messageID = - 1; private Directory directory; // where this index resides private Analyzer analyzer; // how to analyze text private Similarity similarity; // how to normalize private bool commitPending; // true if segmentInfos has changes not yet committed private SegmentInfos rollbackSegmentInfos; // segmentInfos we will fallback to if the commit fails private SegmentInfos localRollbackSegmentInfos; // segmentInfos we will fallback to if the commit fails private bool localAutoCommit; // saved autoCommit during local transaction private bool autoCommit = true; // false if we should commit only on close private SegmentInfos segmentInfos = new SegmentInfos(); // the segments private DocumentsWriter docWriter; private IndexFileDeleter deleter; private System.Collections.Hashtable segmentsToOptimize = new System.Collections.Hashtable(); // used by optimize to note those needing optimization private Lock writeLock; private int termIndexInterval = DEFAULT_TERM_INDEX_INTERVAL; private bool closeDir; private bool closed; private bool closing; // Holds all SegmentInfo instances currently involved in // merges private System.Collections.Hashtable mergingSegments = new System.Collections.Hashtable(); private MergePolicy mergePolicy = new LogByteSizeMergePolicy(); private MergeScheduler mergeScheduler = new ConcurrentMergeScheduler(); private System.Collections.ArrayList pendingMerges = new System.Collections.ArrayList(); private System.Collections.Hashtable runningMerges = new System.Collections.Hashtable(); private System.Collections.IList mergeExceptions = new System.Collections.ArrayList(); private long mergeGen; private bool stopMerges; ///

Used internally to throw an {@link /// AlreadyClosedException} if this IndexWriter has been /// closed. ///

/// AlreadyClosedException if this IndexWriter is protected internal void EnsureOpen() { if (closed) { throw new AlreadyClosedException("this IndexWriter is closed"); } } ///

Prints a message to the infoStream (if non-null), /// prefixed with the identifying information for this /// writer and the thread that's calling it. ///

public virtual void Message(System.String message) { if (infoStream != null) infoStream.WriteLine("IW " + messageID + " [" + SupportClass.ThreadClass.Current().Name + "]: " + message); } private void SetMessageID() { lock (this) { if (infoStream != null && messageID == - 1) { lock (MESSAGE_ID_LOCK) { messageID = MESSAGE_ID++; } } } } ///

Casts current mergePolicy to LogMergePolicy, and throws /// an exception if the mergePolicy is not a LogMergePolicy. ///

private LogMergePolicy GetLogMergePolicy() { if (mergePolicy is LogMergePolicy) return (LogMergePolicy) mergePolicy; else throw new System.ArgumentException("this method can only be called when the merge policy is the default LogMergePolicy"); } ///

Get the current setting of whether newly flushed /// segments will use the compound file format. Note that /// this just returns the value previously set with /// setUseCompoundFile(boolean), or the default value /// (true). You cannot use this to query the status of /// previously flushed segments.

/// ///

Note that this method is a convenience method: it /// just calls mergePolicy.getUseCompoundFile as long as /// mergePolicy is an instance of {@link LogMergePolicy}. /// Otherwise an IllegalArgumentException is thrown.

/// ///

/// /// public virtual bool GetUseCompoundFile() { return GetLogMergePolicy().GetUseCompoundFile(); } ///

Setting to turn on usage of a compound file. When on, /// multiple files for each segment are merged into a /// single file when a new segment is flushed.

/// ///

Note that this method is a convenience method: it /// just calls mergePolicy.setUseCompoundFile as long as /// mergePolicy is an instance of {@link LogMergePolicy}. /// Otherwise an IllegalArgumentException is thrown.

///

public virtual void SetUseCompoundFile(bool value_Renamed) { GetLogMergePolicy().SetUseCompoundFile(value_Renamed); GetLogMergePolicy().SetUseCompoundDocStore(value_Renamed); } ///

Expert: Set the Similarity implementation used by this IndexWriter. /// ///

/// /// public virtual void SetSimilarity(Similarity similarity) { EnsureOpen(); this.similarity = similarity; } ///

Expert: Return the Similarity implementation used by this IndexWriter. /// ///

This defaults to the current value of {@link Similarity#GetDefault()}. ///

public virtual Similarity GetSimilarity() { EnsureOpen(); return this.similarity; } ///

Expert: Set the interval between indexed terms. Large values cause less /// memory to be used by IndexReader, but slow random-access to terms. Small /// values cause more memory to be used by an IndexReader, and speed /// random-access to terms. /// /// This parameter determines the amount of computation required per query /// term, regardless of the number of documents that contain that term. In /// particular, it is the maximum number of other terms that must be /// scanned before a term is located and its frequency and position information /// may be processed. In a large index with user-entered query terms, query /// processing time is likely to be dominated not by term lookup but rather /// by the processing of frequency and positional data. In a small index /// or when many uncommon query terms are generated (e.g., by wildcard /// queries) term lookup may become a dominant cost. /// /// In particular, numUniqueTerms/interval terms are read into /// memory by an IndexReader, and, on average, interval/2 terms /// must be scanned for each random term access. /// ///

/// /// public virtual void SetTermIndexInterval(int interval) { EnsureOpen(); this.termIndexInterval = interval; } ///

Expert: Return the interval between indexed terms. /// ///

/// /// public virtual int GetTermIndexInterval() { EnsureOpen(); return termIndexInterval; } ///

Constructs an IndexWriter for the index in path. /// Text will be analyzed with a. If create /// is true, then a new, empty index will be created in /// path, replacing the index already there, if any. /// ///

/// the path to the index directory /// /// the analyzer to use /// /// true to create the index or overwrite /// the existing one; false to append to the existing /// index /// /// CorruptIndexException if the index is corrupt /// LockObtainFailedException if another writer ///