Merge Tracking Design

*** UNDER CONSTRUCTION ***

Subversion's merge tracking uses a layered design, with the user-visible operations based primarily on the information from the merge history.

Merge History
Data Structures
Merge Operations (TODO)
Audit Operations
Other Operations (TODO)

Merge History

Or, Tracking What Revisions Have Been Merged Where provides the information used by Subversion's merge tracking-related capabilities (history sensitive merging, etc.). The design is based on Dan Berlin's proposal (Message-Id: <1146242044.23718.1.camel@dberlin0-corp.corp.google.com>), and subsequent edits.

Goals

The goal of the Merge History portion of the design is to track the information needed by the operations outlined by the majority of the use cases (e.g. the revision numbers being merged by a merge operation), and keeping this information in the right places as various operations (copy, delete, add, etc.) are performed. This portion of the design does not encompass the operations themselves.

The goals:

To be able to track this down to what files in a working copy and be able to determine what files have had what revisions merged into them.
To not need to contact the server more than we already do now to determine which revisions have been merged in a file or directory (e.g. some contact is acceptable, asking the server about each file is not).
To be able to edit merge information in a human-readable form.
For the information to be stored in a space efficient manner, and to be able to determine the revisions merged into a given file/director in a time efficient manner.
Still getting a conservatively correct answer (not worse than what we have now) when no merge info is available.
To be able to collect, transmit, and keep this information up to date as much as possible on the client side.
To be able to index this information in the future order to answer queries.

Non-goals for the 1.5 design include:

Doing actual history sensitive merging. Subversion does not yet have sufficient support for creation of fully accurate changeset graphs, which are necessary to handle cyclic merging (e.g. of so-called "reflected" revisions, also known as repeated bi-directional mergeing). For details on this problem (caused by lack of tracking of multiple parents for a change), see this discussion thread, and especially this follow-up.
Curing cancer (aka being all things to all people).

Information Storage

The first question that many people ask is "where should we store the merge information?" (what we store will be covered next). A merge history property, named SVN_PROP_MERGE_INFO (e.g. svn:mergeinfo) stored in directory and file properties. Each will store the full, complete list of current merged-in changes, as far as it knows. This ensures that the merge algorithm and other consumers do not have to walk back through old revisions in order to compute a complete list of merge information for a path.

Directly merged into the item means changes from any merge that have affected this path, which includes merges into parents, grandparents, etc., that had some affect on the path.

Doing this storage of complete information at each point makes manual editing safe, because the changes to a path's merge info are localized to that path.

However, as a space optimization, if the information on a sub-directory or file is exactly the same as the merge information for its parent directory, it may be elided in favor of that parent information. This eliding may is done on the fly, but could also be done during a postpass (e.g. a svnadmin mergeinfo-optimize). Eliding information means that an implementation may have to walk parent directories in order to gather information about merge info (however, this may have been necessary anyways). It is expected that directory trees are not that deep, and the lookup of merge info properties quick enough (due to indexing, etc.), to make this eliding not affect performance too much.

Eliding will never affect the semantics of merge information, as it should only be performed in the case when it was exactly the same, and if it was exactly the same, it could not have had an effect on the merge results.

Any path may have merge info attached to it.

The way we choose which path's merge info to use in case of conflicts involves a simple system of inheritance [1], where the "most specific" place wins. This means that if the property is set on a file, that completely overrides the directory level properties for the directory containing the file. Non-inheritable merge info can be set on directories, signifying that the merge info applies only to the directory but not its children.

The way we choose which to store to depends on how much and where you merge, and will be covered in the semantics.

The reasoning for this system is to avoid having to either copy info everywhere, or crawl everywhere, in order to determine which revisions have been applied. At the same time, we want to be space and time efficient, so we can't just store the entire revision list everywhere.

As for what is stored:

A survey of the community shows a slight preference for a human editable storage format along the lines of how svnmerge.py stores its merge info (e.g. path name and list of revisions). Binary storage of such information would buy, on average, a 2-3 byte decrease per revision/range in size over ASCII [2], while making it not directly human-readable/editable.

The revisions we have merged into something are represented as a path, a colon, and then a comma separated revision list, containing one or more revision or revision ranges. Revision range end and beginning points are separated by "-".

Grammar

Token	Definition
revisionrange	REVISION "-" REVISION
revisioneelement	(revisionrange \| REVISION)"*"?
rangelist	revisioneelement (COMMA revisioneelement)*
revisionline	PATHNAME COLON rangelist
top	revisionline (NEWLINE revisionline)*

This merge history ("top"), existing on a path specifies all the changes that have ever been merged into this object (file, dir or repo) within this repository. It specifies the sources of the merges, (and thus two or more pathnames may be required to represent one source object at different revisions due to renaming).

The optional "*" following a revisionelement token signifies a non-inheritable revision/revision range.

This list will not be stored in a canonicalized minimal form for a path (e.g. it may contain single revision numbers that could fall into ranges). This is chiefly because the benefit of such a canonical format -- which is slightly easier for visual comparison, but not indexing -- is outweighed by the fact that generating a canonical form may require groveling through a lot of information to determine what that minimal canonical form is. In particular, it may be that the revision list "5,7,9" is, in minimal canonical form, "5-9", because 6 and 8 do not have any affect on the path name that 5 and 9 are from. Canonicalization could be done as a server side post pass because the information is stored in properties.

Note that this revision format will not scale on its own if you have a list of million revisions. None will easily. However, because it is stored in properties, one can change the WC and FS backends to simply do something different with this single property if they wanted to. Given the rates of change of various very active repositories, this will not be a problem we need to solve for many many years.

The payload of SVN_PROP_MERGE_INFO will be duplicated in an index separate from the FS which is created during svnadmin create, or on-demand for a pre-existing repository which has started using Subversion 1.5. This index will support fast querying, be populated during a merge or svnadmin load, and cough up its contents as needed during API calls. The contents of SVN_PROP_MERGE_INFO is stored redundantly in index (to the FS). Dan Berlin has prototyped a simple index using sqlite3. David James later proposed a more normalized schema design, some of the features of which may become useful for implementing Merge Tracking functionality in a more performant manner.

Information Updating

Each operation you can perform may update or copy the merge info associated with a path.

svn add: No change to merge info.

svn delete: No direct change to merge info (indirectly, because the props go away, so does the merge info for the file).

svn copy: Makes a full copy of any explicit merge info from the source path to the destination path. Also adds "implied" merge info from the source path.

svn rename: Same as svn copy.

svn merge: Adds or subtracts to the merge info, according to the following:

Where to store the info:
1. If the merge target is a single file, the merge info goes to the property SVN_PROP_MERGE_INFO set on that file.
2. If the merge target is a directory, the merge info goes to the property SVN_PROP_MERGE_INFO set on the shallowest directory of the merge (e.g. the topmost directory affected by the merge) that will require different info than the info already set on other directories.
The last clause of rule 2 is only meant to handle cherry picking and multiple merges. In the case that people repeatedly merge the same tree into the same tree, the information will end up only on the shallowest directory of the merge. If changes are selectively applied (e.g. all changes are applied to every directory but one), the information will be on the shallowest common ancestor of all those directories, as well as information being placed on the directory where the changes are not applied, so that it will override the information from that shallow directory. See cherry picking example for more details. Besides selective application, apply changes that affect some directory, and then applying different changes to subdirectories of that directory, will also produce merge info on multiple directories in a given path.
What info is stored:
1. If you are merging in reverse, revisions are subtracted from the revision lines, but we never write out anti-revisions. Thus, if you subtract all the merged revisions, you just get an empty list, and if you do a reverse merge from there, you still get an empty list.
2. If you are merging forward, the revision(s) you are merging is added to the range list in sorted order (such that all revisions and revision ranges in the list are monotonically increasing from left to right).
3. The path (known as PATHNAME in the grammar) used as the key to determine which revision line to change is the sub-directory path being merged from, relative to the repo root, with the repo URL stripped from it.
In the case that we are merging changes that themselves contain merge info, the merge info properties must be merged. The effect of this is that indirect merge info becomes direct merge info as it is integrated as part of the merge info now set on the property. The way this merge is performed is to merge the revision lists for each identical pathname, and to copy the rest. Blocking of merges and how this affects this information is not covered in this design. The indirect info merging is *in addition* to specifying the merge that we are now doing. See the repeated merge with indirect info example for an example.

Thus a merge of revisions 1-9 from http://example.com/repos-root/trunk would produce "/trunk:1-9"

Cross-repo merging is a bridge we can cross if we ever get there :).

Examples

Repeated merge

(I have assumed no renames here, and that all directories were added in rev 1, for simplicity. The pathname will never change, this should not cause any issues that need examples.)

Assume trunk has 9 revisions, 1-9.

A merge of /trunk into /branches/release will produce the merge info "/trunk: 1-9".

Assume trunk now has 6 additional revisions, 14-18.

A merge of /trunk into /branches/release should only merge 14-18 and produce the merge info "/trunk: 1-9,14-18". This merge info will be placed on /branches/release.

(Note the canonical minimal form of the above would be 1-18, as 9-14 do not affect that path. This is also an acceptable answer, as is any variant that represents the same information.)

Repeated merge with indirect info

Assume the repository is in the state it would be after the "Repeated merge" example. Assume additionally, we now have a branch /branches/next-release, with revisions 20-24 on it.

We wish to merge /branches/release into /branches/next-release.

A merge of /branches/release into /branches/next-release will produce the merge info:

"/branches/release: 1-24
 /trunk:1-9,14-18"

This merge info will be placed on /branches/next-release.

Note that the merge information about merges *to* /branches/release has been added to our merge info.

A future merge of /trunk into /branches/next-release, assuming no new revisions on /trunk, will merge nothing.

Cherry picking a change to a file

Assume the repository is in the state it would be after the "Repeated merge with indirect info" example.

Assume we have revision 25 on /trunk, which affects /trunk/foo.c and /trunk/foo/bar/bar.c

We wish to merge the portion of change affecting /trunk/foo.c

A merge of revision 25 of /trunk/foo.c into /branches/release/foo.c will produce the merge info:

"/trunk/foo.c:1-9,14-18,25".  
This merge information will be placed on /branches/release/foo.c

All other merge information will still be intact on /branches/release (ie there is information on /branches/release's directory).

(The cherry picking one directory case is the same as file, with files replaced with directories, hence i have not gone through the example).

Merging changes into parents and then merging changes into children

Assume the repository is in the state it would be after the "Repeated merge with indirect info" example. Assume we have revision 25 on /trunk, which affects /trunk/foo Assume we have revision 26 on /trunk, which affects /trunk/foo/baz We wish to merge revision 25 into /branches/release/foo, and merge revision 26 into /branches/release/foo/baz.

A merge of revision 25 of /trunk/foo into /branches/release/foo will produce the merge info:

"/trunk:1-9,14-18,25"

This merge information will be placed on /branches/release/foo

A merge of revision 26 of /trunk/foo/baz into /branches/release/foo/baz will produce the merge info:

"/trunk:1-9,14-18,26".

This merge information will be placed on /branches/release/foo/baz.

Note that if you instead merge revision 26 of /trunk/foo into /branches/release/foo, you will get the same effect, but the merge info will be:

"/trunk:1-9,14-18,25-26".

This merge information will be placed on /branches/releases/foo

Both are different "spellings" of the same merge information, and future merges should produce the same result with either merge info (one is of course, more space efficient, and transformation of one to the other could be done on the fly or as a postpass, if desired).

All other merge information will still be intact on /branches/release (e.g. there is information on /branches/release's directory).

FAQ

What happens if someone commits a merge with a non-merge tracking client?

It simply means the next time you merge, you may receive conflicts that you would have received if you were using a non-history-sensitive client.

What happens if the merge history is not there?

The same thing that happens if the merge history is not there now.

Are there many different "spellings" of the same merge info?

Yes. Depending on the URLs and target you specify for merges, I believe it is possible to end up with merge info in different places, or with slightly different revision lines that have the same semantic effect (e.g. info like /trunk:1-9 vs /trunk:1-8\n/trunk/bar:9 when revision 9 on /trunk only affected /trunk/bar), so you can end up with merge info in different places, even though the semantic result will be the same in all cases.

Can we do history sensitive WC-to-WC merges without contacting the server?

No. But you probably couldn't anyway without all repo merge data stored locally.

What happens if a user edits merge history incorrectly?

They get the results specified by their merge history.

What happens if a user manually edits a file and unmerges a revision (e.g. not using a "reverse merge" command), but doesn't update the merge info to match?

The merge info will believe the change has still been merged. This is a similar effect to performing a manual merge.

What happens if I svn move/rename a directory, and then merge it somewhere?

This doesn't change history, only the future, thus we will simply add the merge info for that directory as if it was a new directory. We will not do something like attempt to modify all merge info to specify the new directory, as that would be wrong.

I don't think only that copying info on svn copy is correct. What if you copy a dir with merge info into a dir where the dir has merge info -- won't it get the info wrong now?

No. Let's say you have:

a/foo (merge info: /trunk:5-9
a/branches/bar (merge info: /trunk:1-4)

If you copy a/foo into a/branches/bar, we now have:

a/branches/bar (merge info: /trunk:1-4)
a/branches/bar/foo (merge info: /trunk:5-9)

This is strictly correct. The only changes which have been merged into a/branches/bar/foo, are still 5-9. The only changes which have been merged into /branches/bar are 1-4. No merges have been performed by your copy, only copies have been performed. If you perform a merge of revisions 1-9 into bar, the results one would expect that the history sensitive merge algorithm will skip revisions 5-9 for a/branches/bar/foo, and skip revisions 1-4 for a/branches/bar. The above information gives the algorithm the information necessary to do this. So if you want to argue svn copy has the wrong merge info semantics, it's not because of the above, AFAIK :)

Footnotes

This is not going to be a full blown design for property inheritance, nor should this design depend on such a system being implemented.
Assuming 4 byte revision numbers, and repos with revisions numbering in the hundreds of thousands. You could do slightly better by variable length encoding of integers, but even that will generally be 4 bytes for hundreds of thousands of revs. Thus, we have strings like "102341" vs 4 byte numbers, meaning you save about 2 bytes for a 4 byte integer. Range lists in binary would need a distinguisher from single revisions, adding a single bit to both (meaning you'd get 31 bit integers), and thus, would require 8 bytes per range vs 12 bytes per range. While 30% is normally nothing to sneeze at space wise, it's also not significantly more efficient in time, as most of the time will not be spent parsing revision lists, but doing something with them. The space efficiency therefore does not seem to justify the cost you pay in not making them easily editable.

Data Structures

Merge Tracking is implemented using a few simple data structures.

merge info: An apr_hash_t mapping merge source paths to a range list.
range list: An apr_array_header_t containing svn_merge_range_t * elements.
svn_merge_range_t: A revision range, modelled using a start and end svn_revnum_t, which are identical if the range consists of only a single revision.

Audit Operations

As outlined in the requirements and use cases, merge tracking auditting is the ability to report information about merge operations. It consists of three sections:

Changeset Merge Availability (TODO)
Find Changeset (TODO)
Commutative Author and Revision Reporting

Commutative Author and Revision Reporting

Most of the logic for reporting will live in libsvn_repos, with appropriate changes to the API and additional parameters exposed to the client through the RA layer. We are looking at an API which takes one or more paths and a revision (range?), and returns the merge info which was added or removed in that revision. For the 'blame' case, we may also need to include some type of line number parameter, the handling of which is going to get ugly, since our FS isn't based on a weave.

Existing functions of interest are svn_repos_fs_get_merge_info() and svn_fs_merge_info__get_merge_info().

`svn log` Implementation

Prior to merge tracking, log messages had a linear relationship to one another. That is, the only information gleaned from the order in which a message was returned was where the revision number of that message fell in relation to the revision numbers of messages preceding and succeeding it.

The introduction of merge tracking changes that paradigm. Log messages for independent revisions are still linearly related as before, but log messages for merging revisions now have children. These children are log messages for the revisions which have been merged, and they may in turn also have children.

The result is a tree structure which the repository layer builds as it collects log message information. This tree structure then gets serialized and marshaled back to the client, which can then rebuilt the tree if needed. Additionally, less information needs to be explicitly given, as the tree structure itself implies revision relationships.

We currently use the svn_log_message_receiver_t interface to return log messages between application layers. To enable communication of the tree structure, we add another parameter, child_count. When child_count is zero, the node is a leaf node. When child_count is greater than zero, the node is an interior node, with the given number of children. These children may also have children and indicate such by their own child_count parameters. Children are returned in-band through the receiver interface immediately following their parents. Consumers of this API can be aware of the number of children and rebuild the tree, or pass the values farther up the application stack. In effect, this method implements a preorder traversal of the log message tree.

(For convenience, we may want to consolidate all the data parameters of svn_log_message_receiver_t into a single structure.)

A revision, R, is considered a merging revision if the mergeinfo for any path for which the log was requested changed between R and R-1. The difference in the mergeinfo, both added revisions and removed revisions, between R and R-1 indicates the revisions which are children of R.

The exception for this is the case of a copy which is the creation of a branch. When a branch is created, the mergeinfo for R-1 is empty, and the mergeinfo for R is 1:R-1. In this case, the revision should not be considered a merging revision, and none of the revisions R:R-1 should be shown as R's children.

`svn blame` Implementation

Even though the command line client doesn't consume both the original author and revision and the merging author and revision, the blame API should provide both for use by other clients.

$Date$