----- Indexer Design ----- Brett Porter ----- 25 July 2006 ----- ~~ Copyright 2006 The Apache Software Foundation. ~~ ~~ Licensed under the Apache License, Version 2.0 (the "License"); ~~ you may not use this file except in compliance with the License. ~~ You may obtain a copy of the License at ~~ ~~ http://www.apache.org/licenses/LICENSE-2.0 ~~ ~~ Unless required by applicable law or agreed to in writing, software ~~ distributed under the License is distributed on an "AS IS" BASIS, ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ~~ See the License for the specific language governing permissions and ~~ limitations under the License. ~~ NOTE: For help with the syntax of this file, see: ~~ http://maven.apache.org/guides/mini/guide-apt-format.html Indexer Design <> ~~TODO: separate API design from Lucene implementation design * Standard Artifact Index We currently want to index these elements from the repository: * for each artifact file: the artifact ID, version, group ID, classifier, type (extension), filename (including path from the repository base), checksums (md5, sha1) and size * for each artifact POM: the packaging, licenses, dependencies, build plugins, reporting plugins * plugin prefix * Java classes within a JAR artifact (delimited by \n) * filenames within an archive (delimited by \n) * the identifier of the source repository Each record in the index refers to an artifact. Since the content for a record can come from various sources, the record may need to be updated when different files that are related to the same artifact are discovered (ie, the POM, or for plugins the metadata that contains their prefix). To simplify this, the process for discovery is as follows: * Discovered artifacts will read the related POM and metadata from the repository to index, rather than relying on it being discovered. This ensures that partial discovery still yields correct results in all cases, and it is possible to construct the entire record without having to read back from the index. * POMs that do not have a packaging of POM are not sent to the indexer. The result of this process is that updates to a POM or repository metadata and not the corresponding artifact(s) will not update the index. As POMs should not be modified, this will not be a major concern. Likewise, updates to metadata will only accompany updates to the artifact itself, so will not cause a problem. The above case may have a problem if the discovery happens during the middle of a deployment outside of the repository manager (where the artifact is present, but the metadata or POM is not). To avoid such cases, the discoverer should only detect changes more than a minute old (this blackout should be configurable). Other techniques were considered: * Processing each artifact file individually, updating each record as needed. This would result in having to read back each index record before writing. This is quite costly in Lucene as it would be "read, delete, add". You must have a reader and writer open for that process, and it greatly complicates the code. * Have three indices, one for each. This would complicate searching (and may affect ranking of results, though this was not analysed). While Lucene is {{{http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-b11296f9e7b2a5e7496d67118d0a5898f2fd9823} capable of searching multiple indices}}, it is expected that the results would be in the form of a list of separate records rather than the "table join" this effectively is. A similar derivative of this technique would be to store everything in one index, using a field (previously, doctype) to identify each record. Records in the index are keyed by their path from the repository root. While this is longer than using the dependency conflict ID, Lucene cannot delete by a combination of terms, so would require storing an additional field in the index where the file already exists. The plugin prefix could be found either from inside the plugin JAR (<<>>), or from the repository metadata for the plugin's group. For simplicity, the first approach will be used. This means at present there is no need to index the repository metadata, however that may be considered in future. Note that archetypes currently don't have a packaging associated with them in Maven, so it is not recorded in the POM. However, to be able to search by this type, the indexer will look for a <<>> file, and if found set its packaging to <<>>. In the future, this handling will be deprecated as the POMs can start using the appropriate packaging. The index is shared among multiple repositories. The source repository is recorded in the index record. The discovery/conversion/reporting mechanisms are expected to deal with duplicates before reaching the indexer, so if the indexer encounters an artifact from a different repository than it was already added, it will simply replace the record. When indexing metadata from a POM, the POM should be loaded using the Maven project builder so that inheritance and interpolation are performed. This ensures that the record is as complete as possible, and that searching by fields that are inherited will reveal both the parent and the children in the search results. * Reduced Size Index An additional index is maintained by the repository manager in the {{{../apidocs/org/apache/maven/archiva/indexing/MinimalArtifactIndexRecord.html} MinimalIndex}} class. This indexes all of the same artifacts as the first index, but stores them with shorter field names and less information to maintain a smaller size. This index is appropriate for use by certain clients such as IDE integration for fast searching. For a fuller interface to the repository information, the integration should use the XMLRPC interface. The following fields are in the reduced index: * <<>>: The JAR filename * <<>>: The JAR size * <<>>: The last modified timestamp * <<>>: A list of classes in the JAR (\n delimited) * <<>>: md5 checksum of the JAR * <<>>: the primary key of the artifact Only JARs are indexed at present. The JAR filename is used as the key for later deleting entries. * Searching Searching will be reasonably flexible, though the general use case will be to enter a single parsed query that is applied to all fields in the index. Some features that will be available: * : the general case described above. * : This would be needed for search by checksum. * : This would be needed for searching based on update time. Note that in Lucene it may be better to search by other fields (or return all), and then filter the results by dates rather than making dates part of a search query. * : It will be useful to only search Java classes and packages, for example Another thing to note is that the search results should be able to be composed entirely from the index for performance reasons. It should not have to read any metadata files or properties of files such as size and checksum from the disk. This enables searching a repository remotely without having the physical repository available, which is useful for IDE integration among other things. Note that to be able to do an exact match search, a field must be stored untokenized. For fields where it makes sense to search both tokenized and untokenized, they will be stored twice. This currently includes: artifact ID, group ID, and version.