Subversion on Berkeley DB -*- text -*- There are many different ways to implement the Subversion filesystem interface. You could implement it directly using ordinary POSIX filesystem operations; you could build it using an SQL server as a back end; you could build it on RCS; and so on. This implementation of the Subversion filesystem interface is built on top of Berkeley DB (http://www.sleepycat.com). Berkeley DB supports transactions and recoverability, making it well-suited for Subversion. Nodes and Node Revisions In a Subversion filesystem, a `node' corresponds roughly to an `inode' in a Unix filesystem: * A node is either a file or a directory. * A node's contents change over time. * When you change a node's contents, it's still the same node; it's just been changed. So a node's identity isn't bound to a specific set of contents. * If you rename a node, it's still the same node, just under a different name. So a node's identity isn't bound to a particular filename. A `node revision' refers to a node's contents at a specific point in time. Changing a node's contents always creates a new revision of that node. Once created, a node revision's contents never change. When we create a node, its initial contents are the initial revision of the node. As users make changes to the node over time, we create new revisions of that same node. When a user commits a change that deletes a file from the filesystem, we don't delete the node, or any revision of it --- those stick around to allow us to recreate prior revisions of the filesystem. Instead, we just remove the reference to the node from the directory. ID's Within the database, we refer to nodes and node revisions using a string of three unique identifiers (the "node ID", the "copy ID", and the "txn ID"), separated by periods. node_revision_id ::= node_id '.' copy_id '.' txn_id The node ID is unique to a particular node in the filesystem across all of revision history. That is, two node revisions who share revision history (perhaps because they are different revisions of the same node, or because one is a copy of the other, e.g.) have the same node ID, whereas two node revisions who have no common revision history will not have the same node ID. The copy ID is a key into the `copies' table (see `Copies' below), and identifies that a given node revision, or one of its ancestors, resulted from a unique filesystem copy operation. The txn ID is just an identifier that is unique to a single filesystem commit. All node revisions created as part of a commit share this txn ID (which, incidentally, gets its name from the fact that this id is the same id used as the primary key of Subversion transactions; see `Transactions' below). A directory entry identifies the file or subdirectory it refers to using a node revision ID --- not a node ID. This means that a change to a file far down in a directory hierarchy requires the parent directory of the changed node to be updated, to hold the new node revision ID. Now, since that parent directory has changed, its parent needs to be updated, and so on to the root. We call this process "bubble-up". If a particular subtree was unaffected by a given commit, the node revision ID that appears in its parent will be unchanged. When doing an update, we can notice this, and ignore that entire subtree. This makes it efficient to find localized changes in large trees. A Word About Keys Some of the Subversion database tables use base-36 numbers as their keys. Some debate exists about whether the use of base-36 (as opposed to, say, regular decimal values) is either necessary or good. It is outside the scope of this document to make a claim for or against this usage. As such, the reader will please note that for the majority of the document, the use of the term "number" when referring to keys of database tables should be interpreted to mean "a monotonically increasing unique key whose order with respect to other keys in the table is irrelevant". :-) To determine the actual type currently in use for the keys of a given table, you are invited to check out the "Appendix: Filesystem structure summary" section of this document. NODE-REVISION and HEADER: how we represent a node revision We represent a given revision of a file or directory node using a list skel (see skel.h for an explanation of skels). A node revision skel has the form: (HEADER PROP-KEY KIND-SPECIFIC ...) where HEADER is a header skel, whose structure is common to all nodes, PROP-KEY is the key of the representation that contains this node's properties list, and the KIND-SPECIFIC elements carry data dependent on what kind of node this is --- file, directory, etc. HEADER has the form: (KIND PREDECESSOR-ID PREDECESSOR-COUNT) where: * KIND indicates what sort of node this is. It must be one of the following: - "file", indicating that the node is a file (see FILE below). - "dir", indicating that the node is a directory (see DIR below). * PREDECESSOR-ID, if present, indicates the node revision which is the immediate ancestor of this node. * PREDECESSOR-COUNT, if present, indicates the number of predecessors the node revision has (recursively). Note that a node cannot change its kind from one revision to the next. A directory node is always a directory; a file node is always a file; etc. The fact that the node's kind is stored in each node revision, rather than in some revision-independent place, might suggest that it's possible for a node change kinds from revision to revision, but Subversion does not allow this. PROP-KEY is a key into the `representations' table (which see), whose value is a representation pointing to a string (see `strings' table) that is a PROPLIST skel. The KIND-SPECIFIC portions are discussed below. PROPLIST: a property list is a list skel of the form: (NAME1 VALUE1 NAME2 VALUE2 ...) where each NAMEi is the name of a property, and VALUEi is the value of the property named NAMEi. Every valid property list has an even number of elements. FILE: how files are represented. If a NODE-REVISION's header's KIND is "file", then the node-revision skel represents a file, and has the form: (HEADER PROP-KEY DATA-KEY [EDIT-DATA-KEY]) where DATA-KEY identifies the representation for the file's current contents, and EDIT-DATA-KEY identifies the a representation currently available for receiving new contents for the file. See discussion of representations later. DIR: how directories are represented. If the header's KIND is "dir", then the node-revision skel represents a directory, and has the form: (HEADER PROP-KEY ENTRIES-KEY) where ENTRIES-KEY identifies the representation for the directory's entries list (see discussion of representations later). An entries list has the form (ENTRY ...) where each entry is (NAME ID) where: * NAME is the name of the directory entry, in UTF-8, and * ID is the ID of the node revision to which this entry refers REPRESENTATIONS: where and how Subversion stores your data. Some parts of a node revision are essentially constant-length: for example, the KIND field and the REV. Other parts can have arbitrarily varying length: property lists, file contents, and directory entry lists. This variable-length data is often similar from one revision to the next, so Subversion stores just the deltas between them, instead of successive fulltexts. The HEADER portion of a node revision holds the constant-length stuff, which is never deltified. The rest of a node revision just points to data stored outside the node revision proper. This design makes the repository code easier to maintain, because deltification and undeltification are confined to a layer separate from node revisions, and makes the code more efficient, because Subversion can retrieve just the parts of a node it needs for a given operation. Deltifiable data is stored in the `strings' table, as mediated by the `representations' table. Here's how it works: The `strings' table stores only raw bytes. A given string could be any one of these: - a file's contents - a delta that reconstructs file contents, or part of a file's contents - a directory entry list skel - a delta that reconstructs a dir entry list skel, or part of same - a property list skel - a delta that reconstructs a property list skel, or part of same There is no way to tell, from looking at a string, what kind of data it is. A directory entry list skel is indistinguishable from file contents that just happen to look exactly like the unparsed form of a directory entry list skel. File contents that just happen to look like svndiff data are indistinguishable from delta data. The code is able to interpret a given string because Subversion a) knows whether to be looking for a property list or some kind-specific data, b) knows the `kind' of the node revision in question, c) always goes through the `representations' table to discover if any undeltification or other transformation is needed. The `representations' table is an intermediary between node revisions and strings. Node revisions never refer directly into the `strings' table; instead, they always refer into the `representations' table, which knows whether a given string is a fulltext or a delta, and if it is a delta, what it is a delta against. That, combined with the knowledge in (a) and (b) above, allows Subversion to retrieve the data and parse it appropriately. A representation has the form: (HEADER KIND-SPECIFIC) where HEADER is (KIND FLAG ...) The KIND is "fulltext" or "delta", and currently the only FLAG is "mutable". KIND-SPECIFIC varies considerably depending on the kind of representation. Here are the two forms currently recognized: (("fulltext" ...) KEY) The data is at KEY in the `strings' table. (("delta" ...) (OFFSET WINDOW) ...) Each OFFSET indicates the point in the fulltext that this element reconstructs, and WINDOW says how to reconstruct it: WINDOW ::= (DIFF SIZE CHECKSUM REP-KEY [REP-OFFSET]) ; DIFF ::= ("svndiff" VERSION STRING-KEY) Notice that a WINDOW holds only metadata. REP-KEY says what the window should be applied against, or none if this is a self-compressed delta; SIZE says how much data this window reconstructs; CHECKSUM is a checksum on just the portion of fulltext reconstructed by this window; VERSION says what version of the svndiff format is being used (currently only version 0 is supported); and STRING-KEY says which string contains the actual svndiff data (there is no diff data held directly in the representations table, of course). Note also that REP-KEY might refer to a representation that itself requires undeltification. We use a delta combiner to combine all the deltas needed to reproduce the fulltext from some stored plaintext. Branko says this is what REP-OFFSET is for: > The offsets embedded in the svndiff are stored in a string; > these offsets would be in the representation. The point is that > you get all the information you need to select the appropriate > windows from the rep skel -- without touching a single > string. This means a bit more space used in the repository, but > lots less memory used on the server. We'll see if it turns out to be necessary. In the future, there may be other representations, for example indicating that the text is stored elsewhere in the database, or perhaps in an ordinary Unix file. Let's work through an example node revision: (("file" REV COUNT) PROP-KEY "2345") The entry for key "2345" in `representations' is: (("delta" TXN) (0 (("svndiff" 0 "1729") 65 CHECKSUM "2343"))) and the entry for key "2343" in `representations' is: (("fulltext" TXN) "1001") while the entry for key "1729" in `strings' is: which, when applied to the fulltext at key "1001" in strings, results in this new fulltext: "((some text) (that looks) (deceptively like) (directory entries))" Et voila! Subversion knew enough, via the `representations' and `strings' tables, to undeltify and get that fulltext; and knew enough, because of the node revision's "file" type, to interpret the result as file contents, not as a directory entry list. (Note that the `strings' table stores multiple DB values per key. That is, although it's accurate to say there is one string per key, the string may be divided into multiple consecutive blocks, all sharing that key. You use a Berkeley DB cursor to find the desired value[s], when retrieving a particular offset+len in a string.) Representations know nothing about ancestry -- the `representations' table never refers to node revision id's, only to strings or to other representations. In other words, while the `nodes' table allows recovery of ancestry information, the `representations' and `strings' tables together handle deltification and undeltification *independently* of ancestry. At present, Subversion generally stores the youngest strings in "fulltext" form, and older strings as "delta"s against them. However, there's nothing magic about that particular arrangement. Other interesting alternatives: * We could store the N most recently accessed strings as fulltexts, letting access patterns determine the most appropriate representation for each revision. * We could occasionally store deltas against the N'th younger revision, storing larger jumps with a frequency inverse to the distance covered, yielding a tree-structured history. Since the filesystem interface doesn't expose these details, we can change the representation pretty much as we please to optimize whatever parameter we care about --- storage size, speed, robustness, etc. Representations never share strings. Every string is represented by exactly one representation; every representation represents exactly one string. This is so we can replace a string with deltified version of itself, change the representation referring to it, and know that we're not messing up any other reps by doing so. Further Notes On Deltifying: ---------------------------- When a representation is deltified, it is changed in place, along with its underlying string. That is, the node revision referring to that representation will not be changed; instead, the same rep key will now be associated with different value. That way, we get reader locking for free: if someone is reading a file while Subversion is deltifying that file, one of the two sides will get a DB_DEADLOCK and svn_fs__retry_txn() will retry. ### todo: add a note about cycle-checking here, too. The Berkeley DB "nodes" table The database contains a table called "nodes", which is a btree indexed by node revision ID's, mapping them onto REPRESENTATION skels. Node 0 is always the root directory, and node revision ID 0.0.0 is always the empty directory. We use the value of the key 'next-id' to indicate the next unused node ID. Assuming that we store the most recent revision on every branch as fulltext, and all other revisions as deltas, we can retrieve any node revision by searching for the last revision of the node, and then walking backwards to specific revision we desire, applying deltas as we go. REVISION: filesystem revisions, and the Berkeley DB "revisions" table We represent a filesystem revision using a skel of the form: ("revision" TXN) where TXN is the key into the `transactions' table (which see) whose value is the transaction that was committed to create this revision. The database contains a table called "revisions", which is a record-number table mapping revision numbers onto REVISION skels. Since Berkeley DB record numbers start with 1, whereas Subversion filesystem revision numbers start at zero, revision V is stored as record number V+1 in the `revisions' table. Filesystem revision zero always has node revision 0.0.0 as its root directory; that node revision is guaranteed to be an empty directory. Transactions Every transaction ends when it is either successfully committed, or aborted. We call a transaction which has been either committed or aborted "finished", and one which hasn't "unfinished". Transactions are identified by unique numbers, called transaction ID's. Currently, transaction ID's are never reused, though this is not mandated by the schema. In the database, we always represent a transaction ID in its shortest ASCII form. The Berkeley DB `transactions' table records both unfinished and committed transactions. Every key in this table is a transaction ID. Unfinished transactions have values that are skels of the form: ("transaction" ROOT-ID BASE-ID PROPLIST COPIES) where: * ROOT-ID is the node revision ID of the transaction's root directory. * BASE-ID is the node revision ID of the root of the transaction's base revision. * PROPLIST is a skel giving the revision properties for the transaction. * COPIES contains a list of keys into the `copies' table, referencing all the filesystem copies created inside of this transaction. If the transaction is aborted, these copies get removed from the `copies' table. Committed transaction, however, have values that are skels of the form: ("committed" ROOT-ID REV PROPLIST COPIES) where: * ROOT-ID is the node revision ID of the committed transaction's (or revision's) root node. * REVISION represents the revision that was created when the transaction was committed. * PROPLIST is a skel giving the revision properties for the committed transaction. * COPIES contains a list of keys into the `copies' table, referencing all the filesystem copies created by this committed transaction. Nothing currently uses this information for committed transactions, but it could be useful in the future. As the sole exception to the rule above, the `transactions' table always has one entry whose key is `next-id', and whose value is the lowest transaction ID that has never yet been used. We use this entry to allocate ID's for new transactions. The `transactions' table is a btree, with no particular sort order. Changes As modifications are made (files and dirs added or removed, text and properties changed, etc.) on Subversion transaction trees, the filesystem tracks the basic change made in the Berkeley DB `changes' table. The `changes' table is a btree with Berkeley's "duplicate keys" functionality (and with no particular sort order), and maps the one-to-many relationship of a transaction ID to a "change" item. Change items are skels of the form: ("change" PATH ID CHANGE-KIND TEXT-MOD PROP-MOD) where: * PATH is the path that was operated on to enact this change. * ID is the node revision ID of the node changed (may be a zero atom, but only in the "reset" kind case). * CHANGE-KIND is one of the following: - "add" : PATH/ID was added to the filesystem. - "delete" : PATH/ID was removed from the filesystem. - "replace" : PATH/ID was removed, then re-added to the filesystem. - "modify" : PATH/ID was otherwise modified. - "reset" : Ignore any previous changes for PATH/ID in this txn. * TEXT-MOD is a bit specifying whether or not the properties of this node where modified. * PROP-MOD is a bit specifying whether or not the properties of this node where modified. In order to fully describe the changes made to any given path as part of a single transaction, one must read all the change items associated with the transaction's ID, and "collapse" multiple entries that refer to that path. Copies Each time a filesystem copy operation is performed, Subversion records meta-data about that copy. Copies are identified by unique numbers called copy ID's. Currently, copy ID's are never reused, though this is not mandated by the schema. In the database, we always represent a copy ID in its shortest ASCII form. The Berkeley DB `copies' table records all filesystem copies. Every key in this table is copy ID, and every value is a skel of the form: ("copy" SRC-PATH SRC-TXN DST-NODE-ID) where: * SRC-PATH and SRC-TXN are the canonicalized absolute path and transaction ID, respectively, of the source of the copy. * DST-NODE-ID represents the new node revision created as a result of the copy. As the sole exception to the rule above, the `copies' table always has one entry whose key is `next-id', and whose value is the lowest copy ID that has never yet been used. We use this entry to allocate new copy ID's. The `copies' table is a btree, with no particular sort order. Merge rules The Subversion filesystem must provide the following characteristics: - clients can submit arbitrary rearrangements of the tree, to be performed as atomic changes to the filesystem tree - multiple clients can submit non-overlapping changes at the same time, without blocking - readers must never block other readers or writers - writers must never block readers - writers may block writers Merging rules: The general principle: a series of changes can be merged iff the final outcome is independent of the order you apply them in. Merging two nodes, A and B, with respect to a common ancestor ANCESTOR: - First, the merge fails unless A, B, and ANCESTOR are all the same kind of node. - If A and B are text files: - If A is an ancestor of B, then B is the merged result. - If A is identical to B, then B (arbitrarily) is the merged result. - Otherwise, the merge fails. - If A and B are both directories: - For every directory entry E in either A, B, or ANCESTOR, here are the cases: - E exists in neither ANCESTOR nor A. - E doesn't exist in ANCESTOR, and has been added to A. - E exists in ANCESTOR, but has been deleted from A. - E exists in both ANCESTOR and A ... - but refers to different nodes. - but refers to different revisions of the same node. - and refers to the same node revision. The same set of possible relationships with ANCESTOR holds for B, so there are thirty-six combinations. The matrix is symmetrical with A and B reversed, so we only have to describe one triangular half, including the diagonal --- 21 combinations. - (6) E exists in neither ANCESTOR nor A: - (1) E exists in neither ANCESTOR nor B. Can't occur, by assumption that E exists in either A, B, or ancestor. - (1) E has been added to B. Add E in the merged result. *** - (1) E has been deleted from B. Can't occur, by assumption that E doesn't exist in ANCESTOR. - (3) E exists in both ANCESTOR and B. Can't occur, by assumption that E doesn't exist in ancestor. - (5) E doesn't exist in ANCESTOR, and has been added to A. - (1) E doesn't exist in ANCESTOR, and has been added to B. Conflict. - (1) E exists in ANCESTOR, but has been deleted from B. Can't occur, by assumption that E doesn't exist in ANCESTOR. - (3) E exists in both ANCESTOR and B. Can't occur, by assumption that E doesn't exist in ANCESTOR. - (4) E exists in ANCESTOR, but has been deleted from A. - (1) E exists in ANCESTOR, but has been deleted from B. If neither delete was a result of a rename, then omit E from the merged tree. *** Otherwise, conflict. - E exists in both ANCESTOR and B ... - (1) but refers to different nodes. Conflict. - (1) but refers to different revisions of the same node. Conflict. - (1) and refers to the same node revision. Omit E from the merged tree. *** - (3) E exists in both ANCESTOR and A, but refers to different nodes. - (1) E exists in both ANCESTOR and B, but refers to different nodes. Conflict. - (1) E exists in both ANCESTOR and B, but refers to different revisions of the same node. Conflict. - (1) E exists in both ANCESTOR and B, and refers to the same node revision. Replace E with A's node revision. *** - (2) E exists in both ANCESTOR and A, but refers to different revisions of the same node. - (1) E exists in both ANCESTOR and B, but refers to different revisions of the same node. Try to merge A/E and B/E, recursively. *** - (1) E exists in both ANCESTOR and B, and refers to the same node revision. Replace E with A's node revision. *** - (1) E exists in both ANCESTOR and A, and refers to the same node revision. - (1) E exists in both ANCESTOR and B, and refers to the same node revision. Nothing has happened to ANCESTOR/E, so no change is necessary. *** == something actually happens Non-Historical Properties [[Yes, do tell.]] Layers In previous structurings of the code, I had trouble keeping track of exactly who has implemented which promises, based on which other promises from whom. I hope the arrangement below will help me keep things straight, and make the code more reliable. The files are arranged in order from low-level to high-level: each file depends only on services provided by the files before it. skel.c, id.c, dbt.c, convert-size.c Low-level utility functions. fs_skels.c Routines for marshaling between skels and native FS types. fs.c Creating and destroying filesystem objects. err.c Error handling. nodes-table.c, txn-table.c, rev-table.c, reps-table.c, strings-table.c Create and open particular database tables. Responsible for intra-record consistency. node-rev.c Creating, reading, and writing node revisions. Responsible for deciding what gets deltified when. reps-strings.c Retrieval and storage of represented strings. This will handle delta-based storage, dag.c Operations on the DAG filesystem. "DAG" because the interface exposes the filesystem's sharing structure. Enforce inter-record consistency. tree.c Operations on the tree filesystem. This layer is built on top of dag.c, but transparently distinguishes virtual copies, making the underlying DAG look like a real tree. This makes incomplete transactions behave like ordinary mutable filesystems. delta.c Computing deltas. Appendix: Filesystem structure summary ====================================== Berkeley DB tables ------------------ "nodes" : btree(ID -> NODE-REVISION) "revisions" : recno(REVISION) "transactions" : btree(TXN -> TRANSACTION, "next-id" -> TXN) "changes" : btree(TXN -> CHANGE) "copies" : btree(CPY -> COPY, "next-id" -> CPY) "strings" : btree(STR -> STRING, "next-id" -> STR) "representations" : btree(REP -> REPRESENTATION, "next-id" -> REP) Syntactic elements ------------------ Table keys: ID ::= NODE-REV-ID ; TXN ::= number-36 ; CPY ::= number-36 ; STR ::= number-36 ; REP ::= number-36 ; Property lists: PROPLIST ::= (PROP ...) ; PROP ::= atom atom ; Filesystem revisions: REVISION ::= ("revision" TXN) ; Transactions: TRANSACTION ::= UNFINISHED-TXN | COMMITTED-TXN UNFINISHED-TXN ::= ("transaction" ROOT-ID BASE-ID PROPLIST COPIES) ; COMMITTED-TXN ::= ("committed" ROOT-ID REV PROPLIST COPIES) ; ROOT-ID ::= NODE-REV-ID ; BASE-ID ::= NODE-REV-ID ; COPIES ::= (CPY ...) ; REV ::= number ; Changes: CHANGE ::= ("change" PATH ID CHANGE-KIND TEXT-MOD PROP-MOD) ; CHANGE-KIND ::= "add" | "delete" | "replace" | "modify" ; TEXT-MOD ::= atom ; PROP-MOD ::= atom ; Copies: COPY ::= ("copy" SRC-PATH SRC-TXN DST-NODE-ID) SRC-PATH ::= atom ; SRC-REV ::= TXN ; DST-NODE-ID ::= NODE-REV-ID ; Entries lists: ENTRIES ::= (ENTRY ...) ; ENTRY ::= (NAME ID) ; NAME ::= atom ; Node revisions: NODE-REVISION ::= FILE | DIR ; FILE ::= (HEADER PROP-KEY DATA-KEY [EDIT-DATA-KEY]) ; DIR ::= (HEADER PROP-KEY ENTRIES-KEY) ; HEADER ::= (KIND [PREDECESSOR-ID [PREDECESSOR-COUNT]]) ; KIND ::= "file" | "dir" ; PREDECESSOR-ID ::= NODE-REV-ID ; PREDECESSOR-COUNT ::= number ; PROP-KEY ::= atom ; REP-KEY ::= atom ; Representations: REPRESENTATION ::= FULLTEXT | DELTA ; FULLTEXT ::= (HEADER STRING-KEY) ; DELTA ::= (HEADER (OFFSET WINDOW) ...) ; WINDOW ::= (DIFF SIZE CHECKSUM REP-KEY [REP-OFFSET]) ; DIFF ::= ("svndiff" VERSION STRING-KEY) ; VERSION ::= number ; REP-KEY ::= atom ; STRING-KEY ::= atom ; OFFSET ::= number ; REP-OFFSET ::= number ; HEADER ::= (KIND TXN) ; KIND ::= "fulltext" | "delta" ; SIZE ::= number ; CHECKSUM ::= ("md5" BYTES) ; BYTES ::= atom ; Strings: STRING ::= RAWTEXT | LISTTEXT | DIFFTEXT RAWTEXT ::= /{anything.class}*/ ; LISTTEXT ::= list ; DIFFTEXT ::= /{anything.class}*/ ; Node revision IDs: NODE-REV-ID ::= NODE-ID '.' CPY '.' TXN ; NODE-ID ::= number ; Lexical elements ---------------- Numbers: number ::= /{digit.class}+/ ; number-36 ::= /{base36.class}+/ ; (Note: the following are described in skel.h) Skels: skel ::= atom | list; list ::= list.head list.body.opt list.tail ; atom ::= atom.imp-len | atom.exp-len ; list.head ::= '(' spaces.opt ; list.tail ::= spaces.opt ')' ; list.body.opt ::= | list.body ; list.body ::= skel | list.body spaces.opt skel ; atom.imp-len ::= /{name.class}[^\(\){ws.class}]*/ ; atom.exp-len ::= /({digit.class}+){ws.class}.{\1}/ ; spaces.opt ::= /{ws.class}*/ ; Character classes: ws.class ::= [\t\n\f\r\ ] ; digit.class ::= [0-9] ; name.class ::= [A-Za-z] ; base36.class ::= [a-z0-9] anything.class ::= anything at all ;