= How to interpret Subversion dumpfiles = Version 1.1, 2013-02-02 == Introduction == The Subversion dumpfile format is a serialized description of the actions required to (re)build a version history. from scratch. The goal of this document is that it be sufficient for people writing dumpfile interpreters to emulate the actions the dumpfile describes on a versioned filesystem-like store, such as another version-control system. It derives from and incorporates some incomplete notes from before r39883. === Unresolved questions === 1. In interpreting a Node record which has both a copyfrom source and a property section, it is possible that the copy source node itself has a property section. How are they to be combined? 2. The section on the semantics of kinds of operations documents a minor bug at r39883 in the behavior of "add". Has this been fixed? Portions of text relevant to these questions are tagged with FIXME. == Syntax == === Encoding and delimiters === Subversion dumpfiles are plain byte streams. The structural parts are ASCII. Text sections and property key/value pairs may be interpreted as binary data in any encoding by client tools. A dumpfile consists of four kinds of records. A record is a group of RFC822-style header lines (each consisting of a key, followed by a colon, followed by text data to end of line), followed by an empty spacer line, followed optionally by a body section. If the body section is present, another empty spacer line separates it from the following record. For forward compatibility, unrecognized headers are ignored. === Record types === Dumpfiles include four record types. Two, the version stamp and UUID record, consist of single header lines. The bulk of a dumpfile consists of Revision and Node records. ==== Version stamp records ==== A version stamp record is always the first line of the file and looks like this: ------------------------------------------------------------------- SVN-fs-dump-format-version: \n ------------------------------------------------------------------- where is replaced by the dump format version. Except where specified, the descriptions in this document apply to all versions of the format. ==== UUID records ==== Versions 2 and later may have a UUID record following the version stamp. It is of the form ------------------------------------------------------------------- UUID: ------------------------------------------------------------------- where the is the UUID of the originating repository. An example UUID is "7bf7a5ef-cabf-0310-b7d4-93df341afa7e". As generated by Subversion, these UUIDs are "Version 1", incorporating the MAC of the originating machine. The presentation is in RFC4122 form without the "urn:" or "uuid:" prefixes. ==== Revision records ==== A Revision record has three headers and is usually followed by a property section. Expect the following form and sequence: ------------------------------------------------------------------- Revision-number: [Prop-content-length:

] Content-length: ! ------------------------------------------------------------------- with the Revision-number header always first and the '!' indicating a mandatory empty spacer line.

gives the length in bytes of the following property section. gives the body length of the entire Revision record. These two numbers will be *identical* for a Revision record; the Content-length header is added for the benefit of software that can parse RFC-822 messages. A revision record is followed by one or more Node records (see below). ==== Node records ==== Each Revision record is followed by one or more Node records. Node records have the following sequence of header lines: ------------------------------------------------------------------- Node-path: [Node-kind: {file | dir}] Node-action: {change | add | delete | replace} [Node-copyfrom-rev: ] [Node-copyfrom-path: ] [Text-copy-source-md5: ] [Text-copy-source-sha1: ] [Text-content-md5: ] [Text-content-sha1: ] [Text-content-length: ] [Prop-content-length:

] [Content-length: Y] ! ------------------------------------------------------------------- Bracketing in [] indicates optional lines; { | } is an alternation group. Dump decoders should be prepared for the optional lines after Node-action to be in any order, except that Content-length is always last if it present. A Node record describes an action on a path relative to the repository root, and always begins with the Node-path specification. The Node-kind line indicates whether the path is a file or directory. The header value will be one of the strings "file" or "dir". This header may be (and usually is) absent if the node action is a delete. The Node-action line is always present and specifies the type of operation for this node. The header value is one of the strings "change", "add", "delete", or "replace". These operations will be described in detail later in this document. Either both the Node-copyfrom-rev and Node-copyfrom-path lines will be present, or neither will be. They pair to describe a copy source for the node. Copy-source semantics will be described in detail later in this document. The Text-content-{md5,sha1} and Text-copy-source-{md5,sha1} lines are hash integrity checks and will be present only if Text-content-length and the copyfrom pair (respectively) are also present. A decoder may use them to verify that the source content they refer to has not been corrupted. Text-content-length will be present only when there is a text section. Zero is a legal value for this length, indicating an empty file. Prop-content-length will be present only when there is a properties section. Content-length will be present if there is either a text or a properties section. This is not always the case. In particular, a delete operation cannot have either. Some other operations that use copyfrom sources may also not have either. Again, the '!' stands in for a mandatory empty line following the RFC822-style headers. A body may follow. === Property sections === A Revision record *may* have a property section, and a Node record *may* have a property section. Every record with a property section has a Prop-content-length header. A property section consists of pairs of key and value records and is ended by a fixed trailer. Here is an example attached to a Revision record: ------------------------------------------------------------------- Revision-number: 1422 Prop-content-length: 80 Content-length: 80 K 6 author V 7 sussman K 3 log V 33 Added two files, changed a third. PROPS-END ------------------------------------------------------------------- The fixed trailer is "PROPS-END\n" and its length is included in the Prop-content-length. Before it, each K and V record consists of a header line giving the length of the key or value content in bytes. The content follows. The content is itself always followed by \n. In version 3 of the format, a third type 'D' of property record is introduced to describe property deletion. This feature will be described later, in the specification of delta dumps. == Semantics == === The kinds of things === There are four kinds of things described by a dumpfile: paths, properties, content, and flows. The distinctions among content, paths, and flows matter for understanding some operations. A path is a filesystem location (a file or directory). There are two kinds of paths in a dumpfile; node paths and copy sources. Properties are key-value pairs associated with revisions or paths. Subversion interprets and reserves some properties, those beginning with "svn:". Others are not interpreted by Subversion; they may may be set and read for the convenience of other applications, such as repository browsers or translators. A flow is a sequence of actions on a file or directory path that is considered to be a single history for change-tracking purposes. Creating a flow tells Subversion that you want to track the history of the path or paths it contains. Destroying a flow breaks the chain of history; changes will not be tracked across the break, even if another flow is created at the same path. A copy operation creates a new flow connected to the flow from which it was copied. Content is what file paths point at (one timewise slice of a flow). It is the payload of program source code, documents, images, and so forth that a version control system actually manages. A Node record describes a change in properties, the addition or deletion of a flow, or a change in content. It must do at least one of these things, otherwise it would be a no-op and omitted. When no copyfrom is present, and the action isn't an add or copy, then the kind of the thing identified by (PATH, REVISION) must agree with the kind of the thing identified by (PATH, -1+REVISION). Terminological node: in Subversion-speak, the term "node" is historically ambiguous. Sometimes it refers to what this document calls a "flow", and sometimes it refers to the internal per-revision structure that a Node record represents (that is, just one action in a flow). For clarity, most of this document avoids the term "node" in favor of the more specific "flow" and "Node record", but knowing about this issue will help if you read the Ancient History section. === The kinds of operations === .File operations |====================================================================== | | add | delete | replace | change | |Can have text section? | optional | no | optional | optional | |Can have property section? | optional | no | optional | optional | |Can have copy source? | optional | no | optional | no | |Fails on existent path | yes* | no | no | no | |Fails on non-existent path | no | yes | yes | yes | |====================================================================== FIXME: As of December 2011 there is a minor bug: Adding a file with history twice _in two different revisions_ succeeds silently. .Directory operations |====================================================================== | | add | delete | replace | change | |Can have text section? | no | no | no | no | |Can have property section? | optional | no | optional | required | |Can have copy source? | optional | no | optional | no | |Fails on existent path | yes | no | no | no | |Fails on non-existent path | no | yes | yes | yes | |====================================================================== A Node record represents an operation that does one of four things: add, delete, change, or replace. Node records can carry content in one (or both!) of two ways: from a text section or from a copy source (that is, a copy-path and copy-revision pair). Giving a copy source appends the node to the flow of which that source is part; when you 'add' or 'replace' with a copy source, the content at the path becomes a copy of the source (but see below for a qualification about directories). Giving a text section also changes the content of the flow. In the (unusual) case that a node has both a copy source and a text section, the correct semantics is to attach the path to the source flow and then change the content. An add operation creates a new flow for a file or directory. See the table above for possible operand combinations. A delete operation deletes a flow and its content. If the path is a file, the file is deleted. If the path is a directory, the directory and all its children are deleted. A subsequent add at the same path will create a new and different flow with its own history. A change operation changes properties on a file or directory path. See the table above for possible operand combinations. A replace operation behaves exactly like a delete followed by an add (destroying an old flow, producing a new one) when it has no copy source. When a replace has a copy source, it produces a new flow with history extending back through the copy source. A Node record representing a replace operation may have a property section. The main reason "replace" exists is because it helps sequential processors of the dump stream avoid possibly notifying about multiple actions on the same path. It is even possible to have a replace with a copyfrom source *and* text, such as would result from this on the client side: ------------------------------------------------------------------- $ svn rm dir/file.txt $ svn cp otherdir/otherfile.txt dir/file.txt $ echo "Replacement text" > dir/file.txt $ svn ci -m "Replace dir/file.txt with a copy of otherdir/otherfile.txt and replace its text, too." ------------------------------------------------------------------- Subversion filesystems do not allow the root directory ("/") to be deleted or replaced. === Some details about copyfroms === The source and target of a copyfrom are always of like kind; that is, Subversion dump will never generate a node with a source type of file and a target type of directory or vice-versa. Interpreting copyfrom_path for file copies is straightforward; the target pathname gets the contents of the source pathname. Directory copies (the primitive beneath branching and tagging) are tricky. For each source path under the source directory, a new path is generated by removing the head segment of the pathname that is the source directory. That new path under the target directory gets the content of the source path. After this operation: ------------------------------------------------------------------- Node-path: x/y/z Node-kind: dir Node-action: add Node-copyfrom-rev: 10 Node-copyfrom-path: a/b/c ------------------------------------------------------------------- the file a/b/c/d will have been be copied to x/y/z/d. A single revision may include multiple copyfrom Node records, even multiple copyfroms to the same directory, even mixed directory and file copies to the same directory. === Properties and persistence === The properties section of a Revision record consists of some (possibly empty) subset of the three reserved revision properties: svn:author, svn:date, and svn:log, along with any other revision properties. The revision properties do not persist to later revisions. Each revision has exactly the revision properties specified in its revision record, or no revision properties if there is no property section. The key thing to know about Node properties is that they are persistent, once set, until modified by a future property section on the same path. Normally, a dumpfile re-lists the entire property set for a directory or file in every Node record that changes any part of it. (But see the material on delta dumps for an exception.) This implies that to delete a given property from a path, a dumpfile generator will issue a Node record with all other properties listed in it; to delete all properties from a path, the dumpfile generator will simply issue a node with an empty properties section. Note that this is different from an *absent* properties section, which will change no properties and will be associated with a change to content! === Representation of symbolic links === When the Subversion client sends a content blob representing a symbolic link (that is, with the svn:special property) the contents of the blob is not just the link's target path. It will have the prefix "link ". The client likewise interprets this prefix at checkout time. In the future, other special blob formats with other prefix keywords may be defined. None such yet exist as of revision 1441992 (February 2013). === Implementation pragmatics === Because directory operations with copyfroms don't specify all the file paths they modify, an interpreter for this format must build a map of the paths in the file store it is manipulating, and update that map as it processes each Node record. On a repository with thousands of commits, the per-revision list of maps can become quite large. For space economy, the file map for each revision can be discarded after it is processed *unless it is a source revision for a copyfrom*. == An example == Here's an example of revision 1422, which added a new directory "baz", added a new file "bop" inside it, and modified the file "foo.c": ------------------------------------------------------------------- Revision-number: 1422 Prop-content-length: 80 Content-length: 80 K 6 author V 7 sussman K 3 log V 33 Added two files, changed a third. PROPS-END Node-path: bar/baz Node-kind: dir Node-action: add Prop-content-length: 35 Content-length: 35 K 10 svn:ignore V 4 TAGS PROPS-END Node-path: bar/baz/bop Node-kind: file Node-action: add Prop-content-length: 76 Text-content-length: 54 Content-length: 130 K 14 svn:executable V 2 on K 12 svn:keywords V 15 LastChangedDate PROPS-END Here is the text of the newly added 'bop' file. Whee. Node-path: bar/foo.c Node-kind: file Node-action: change Text-content-length: 102 Content-length: 102 Here is the fulltext of my change to an existing /bar/foo.c. Notice that this file has no properties. ------------------------------------------------------------------- == Format variants == === Version 3 format === Version 3 format is a delta dump; text changes are represented as diffs against the original file, and properties as incremental changes to a persistent set (that is, a property section does not necessarily implicitly clear the property set on a path before the new property settings are evaluated). This change is a space optimization. It requires additional computing time to integrate the diff history. Version 3 is generated by SVN versions 1.1.0-present, if requested by the user. This format is equivalent to the VERSION 2 format except for the following: 1. The format starts with the new version number of the dump format ("SVN-fs-dump-format-version: 3\n"). 2. There are several new optional headers for Node records: ------------------------------------------------------------------- [Text-delta: true|false] [Prop-delta: true|false] [Text-delta-base-md5: blob] [Text-delta-base-sha1: blob] [Text-copy-source-sha1: blob] [Text-content-sha1: blob] ------------------------------------------------------------------- The default value for the boolean headers is "false". If the value is set to "true", then the text and property contents will be treated as deltas against the previous contents of the flow (as determined by copy history for adds with history, or by the value in the previous revision for changes--just as with commits). Property deltas have the same format as regular property lists except that (1) properties with the same value as in the previous contents of the flow are not printed, and (2) deleted properties will be written out as ------------------------------------------------------------------- D ------------------------------------------------------------------- just as a regular property is printed, but with the "K " changed to a "D " and with no value part. Text deltas are written out as a series of svndiff0 windows. If Text-delta-base-md5 is provided, it is the checksum of the base to which the text delta is applied; note that older versions (pre-1.5) of 'svnadmin load' may ignore the checksum. Text-delta-base-sha1, Text-copy-source-sha1, and Text-content-sha1 are not currently used by the loader. They are written by 1.6-and-later versions of Subversion so that future loaders can optionally choose which checksum to use for checking for corruption. === Archaic version 1 format === There are actually two types of version 1 dump streams. The regular ones are generated since r2634 (svn 0.14.0). Older ones also claim to be version 1, but miss the Props-content-length and Text-content-length fields in the block header. In those days there *always* was a properties block. This note is included for historical completeness only, at is it highly unlikely that any Subversion instances that old remain in production. == Implementation choices for optional behaviour == This section lists some of the ways existing implementations interpret the optional aspects of the specification. When a Revision record has no revision properties, svnadmin and svnrdump write an empty properties section whereas svndumpfilter omits the properties section. (At least in Subversion 1.0 through 1.8.) == Ancient history == Old discussion: (This file started as a proposal, preserved here for posterity.) A proposal for an svn filesystem dump/restore format. === Two problems we want to solve === 1. When we change our node-id schema, we need to migrate all of our data (by dumping and restoring). 2. Serves as a backup format. Could be read by other software tools someday. === Design Goals === A. Written as two new public functions in svn_fs.h. To be invoked by new 'svnadmin' subcommands. B. Format uses only timeless fs concepts. The dump format needs to reference concepts that we *know* are general enough to never change. These concepts must exist independently of any internal node-id schema, or any DB storage backend. In other words, we're talking about the basic ideas in our original "design spec" from May 2000. === Format Semantics === Here are the timeless semantics of our fs design -- the things that would be stored in our dump format. - A filesystem is an array of trees. Each tree is called a "revision" and has unversioned properties attached. - A revision has a tree of "nodes" hanging off of it. Actually, the nodes in the filesystem form a DAG. A revision always points to an initial node that represents the 'root' of some tree. - The majority of a tree's nodes are hard-links (references) to nodes that were created in earlier trees. - A node contains - versioned text - versioned properties - predecessor history: "which node am I a variant of?" - copy history: "which node am I a copy of?" The history values can be non-existent (meaning the node is completely new), or can have a value of {revision, path}. === Refinement of proposal #2: === (after discussion with gstein) Each node starts with RFC822-style headers at the top. The final header is a 'Content-length:', followed by the content, so record boundaries can be inferred. The content section has two implicit parts: a property hash, and the fulltext. The division between these two sections is implied by the "PROPS-END\n" tag at the end of the prophash. In the case of a directory node or a revision, only the prophash is present. //End of document.