-----
 Indexer Design
 -----
 Brett Porter
 -----
 25 July 2006
 -----

~~ Copyright 2006 The Apache Software Foundation.
~~
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~      http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License.

~~ NOTE: For help with the syntax of this file, see:
~~ http://maven.apache.org/guides/mini/guide-apt-format.html

Indexer Design

  <<Note: The current indexer design is under review. This document will grow into what it should be, and the code and
  tests refactored to match>>

  ~~TODO: separate API design from Lucene implementation design

* Standard Artifact Index

  We currently want to index these elements from the repository:

    * for each artifact file: the artifact ID, version, group ID, classifier, type (extension), filename (including path
      from the repository base), checksums (md5, sha1) and size

    * for each artifact POM: the packaging, licenses, dependencies, build plugins, reporting plugins

    * plugin prefix

    * Java classes within a JAR artifact (delimited by \n)

    * filenames within an archive (delimited by \n)

    * the identifier of the source repository

  Each record in the index refers to an artifact. Since the content for a record can come from various sources, the
  record may need to be updated when different files that are related to the same artifact are discovered (ie, the
  POM, or for plugins the metadata that contains their prefix).

  To simplify this, the process for discovery is as follows:

    * Discovered artifacts will read the related POM and metadata from the repository to index, rather than relying on
      it being discovered. This ensures that partial discovery still yields correct results in all cases, and it is
      possible to construct the entire record without having to read back from the index.

    * POMs that do not have a packaging of POM are not sent to the indexer.

  The result of this process is that updates to a POM or repository metadata and not the corresponding artifact(s) will
  not update the index. As POMs should not be modified, this will not be a major concern. Likewise, updates to metadata
  will only accompany updates to the artifact itself, so will not cause a problem.

  The above case may have a problem if the discovery happens during the middle of a deployment outside of the
  repository manager (where the artifact is present, but the metadata or POM is not). To avoid such cases, the
  discoverer should only detect changes more than a minute old (this blackout should be configurable).

  Other techniques were considered:

    * Processing each artifact file individually, updating each record as needed.  This would result in having to read
      back each index record before writing. This is quite costly in Lucene as it would be "read, delete, add". You
      must have a reader and writer open for that process, and it greatly complicates the code.

    * Have three indices, one for each. This would complicate searching (and may affect ranking of results, though this
      was not analysed). While Lucene is
      {{{http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-b11296f9e7b2a5e7496d67118d0a5898f2fd9823} capable of
      searching multiple indices}}, it is expected that the results would be in the form of a list of separate records
      rather than the "table join" this effectively is. A similar derivative of this technique would be to store
      everything in one index, using a field (previously, doctype) to identify each record.

  Records in the index are keyed by their path from the repository root. While this is longer than using the
  dependency conflict ID, Lucene cannot delete by a combination of terms, so would require storing an additional
  field in the index where the file already exists.

  The plugin prefix could be found either from inside the plugin JAR (<<<META-INF/maven/plugin.xml>>>), or from the
  repository metadata for the plugin's group. For simplicity, the first approach will be used. This means at present
  there is no need to index the repository metadata, however that may be considered in future.

  Note that archetypes currently don't have a packaging associated with them in Maven, so it is not recorded in the POM.
  However, to be able to search by this type, the indexer will look for a <<<META-INF/maven/archetype.xml>>> file, and
  if found set its packaging to <<<maven-archetype>>>. In the future, this handling will be deprecated as the POMs
  can start using the appropriate packaging.

  The index is shared among multiple repositories. The source repository is recorded in the index record. The
  discovery/conversion/reporting mechanisms are expected to deal with duplicates before reaching the indexer, so if the
  indexer encounters an artifact from a different repository than it was already added, it will simply replace the
  record.

  When indexing metadata from a POM, the POM should be loaded using the Maven project builder so that inheritance and
  interpolation are performed. This ensures that the record is as complete as possible, and that searching by
  fields that are inherited will reveal both the parent and the children in the search results.

* Reduced Size Index

  An additional index is maintained by the repository manager in the
  {{{../apidocs/org/apache/maven/archiva/indexing/MinimalArtifactIndexRecord.html} MinimalIndex}} class. This
  indexes all of the same artifacts as the first index, but stores them with shorter field names and less information to
  maintain a smaller size. This index is appropriate for use by certain clients such as IDE integration for fast
  searching. For a fuller interface to the repository information, the integration should use the XMLRPC interface.

  The following fields are in the reduced index:

    * <<<j>>>: The JAR filename

    * <<<s>>>: The JAR size

    * <<<d>>>: The last modified timestamp

    * <<<c>>>: A list of classes in the JAR (\n delimited)

    * <<<m>>>: md5 checksum of the JAR

    * <<<pk>>>: the primary key of the artifact

  Only JARs are indexed at present. The JAR filename is used as the key for later deleting entries.

* Searching

  Searching will be reasonably flexible, though the general use case will be to enter a single parsed query that is
  applied to all fields in the index.

  Some features that will be available:

    * <Search through most fields for a particular keyword>: the general case described above.

    * <Search by a particular field (exact match)>: This would be needed for search by checksum.

    * <Search in a range of field values>: This would be needed for searching based on update time. Note that in
      Lucene it may be better to search by other fields (or return all), and then filter the results by dates rather
      than making dates part of a search query.

    * <Limit search to particular fields>: It will be useful to only search Java classes and packages, for example

  Another thing to note is that the search results should be able to be composed entirely from the index for performance
  reasons. It should not have to read any metadata files or properties of files such as size and checksum from the disk.
  This enables searching a repository remotely without having the physical repository available, which is useful for
  IDE integration among other things.

  Note that to be able to do an exact match search, a field must be stored untokenized. For fields where it makes sense
  to search both tokenized and untokenized, they will be stored twice. This currently includes: artifact ID, group ID,
  and version.