Merge Tracking Design

*** UNDER CONSTRUCTION ***

Subversion's merge tracking uses a layered design, with the user-visible operations based primarily on the information from the merge history.

Merge History
Merge Operations (TODO)
Audit Operations (TODO)
Other Operations (TODO)

Merge History

Or, Tracking What Revisions Have Been Merged Where provides the information used by Subversion's merge tracking-related capabilities (history sensitive merging, etc.). The design is based on Dan Berlin's proposal (Message-Id: <1146242044.23718.1.camel@dberlin0-corp.corp.google.com>).

The following content has been temporarily superseded by notes on the development branch until the design completely stabilizes.

Goals

The goal of the Merge History portion of the design is to track the information needed by the operations outlined by the use cases (e.g. the revision numbers being merged by a merge operation), and keeping this information in the right places as various operations (copy, delete, add, etc.) are performed. This portion of the design does not encompass the operations themselves.

The goals:

To be able to track this down to what files in a working copy and be able to determine what files have had what revisions merged into them.
To not need to contact the server more than we already do now to determine which revisions have been merged in a file or directory (ie some contact is acceptable, asking the server about each file is not).
To be able to edit merge information in a human editable form.
For the information to be stored in a space efficient manner, and to be able to determine the revisions merged into a given file/director in a time efficient manner.
Still getting a conservatively correct answer (not worse than what we have now) when no merge info is specified.
To be able to collect, transmit, and keep this information up to date as much as possible on the client side.
To be able to index this information in the future order to answer queries.

Pre-notes: Whether to store info in revprops, or just on dirs and files, is still an open question. It makes certain semantics clearer on what operations do below and how to proceed easier. The question is whether it is efficient enough time wise when we go to retrieve merge info, and whether it complicates what merge has to do too much. It also removes all of the listed prerequisites.

Information Storage

The first question that many people ask is "where should we store the merge information?" (what we store will be covered next). A merge history property, named SVN_MERGE_PROPERTY (e.g. svn:mergeinfo) stored in the revision properties, directory properties, and file properties. Each will store the full, complete list of current merged in changes, as far as it knows. This ensures that the merge algorithm and other consumers do not have to walk back revisions in order to get the transitive closure of the revision list.

The way we choose which of file, dir, revprop merge info to use in case of conflicts simple system of inheritance [1] where the "most specific" place wins. This means that if the property is set on a file, that completely overrides the directory and revision level properties.

The way we choose which to store to depends on how much and where you merge, and will be covered in the semantics.

The reasoning for this system is to avoid having to either copy info everywhere, or crawl everywhere, in order to determine which revisions have been applied. At the same time, we want to be space and time efficient, so we can't just store the entire revision list everywhere.

As for what is stored:

A survey of the community shows a slight preference for a human editable storage format along the lines of how svnmerge stores its merge info (e.g. path name and list of revisions). Binary storage of such information would buy, on average, a 2-3 byte decrease per revision/range in size over ASCII [2], while making it not directly human readable/editable.

The revisions we have merged into something are represented as a path, a colon, and then a comma separated revision list, containing one or more revision or revision ranges. Revision range end and beginning points are separated by "-".

Grammar

Token	Definition
revisionrange	REVISION "-" REVISION
revisioneelement	revisionrange \| REVISION
revisionlist	revisioneelement (COMMA revisioneelement)*
revisionline	PATHNAME COLON revisionlist
top	revisionline (NEWLINE revisionline)*

This list will not be stored in a canonicalized minimal form for a path (e.g. it may contain single revision numbers that could fall into ranges). This is chiefly because the benefit of such a canonical format (slightly easier for comparison, but not indexing) is outweighed by the fact that generating a canonical form may require groveling through a lot of information to determine what that minimal canonical form is. In particular, it may be that the revision list "5,7,9" is, in minimal canonical form, "5-9", because 6 and 8 do not have any affect on the path name that 5 and 9 are from. Canonicalization could be done as a server side post pass because the information is stored in properties.

Note that this revision format will not scale on its own if you have a list of million revisions. None will easily. However, because it is stored in properties, one can change the WC and FS backends to simply do something different with this single property if they wanted to. Given the rates of change of various very active repositories, this will not be a problem we need to solve for many many years.

The current thinking is that the payload of SVN_MERGE_PROPERTY will be stored in an index separate from the FS which is created during svnadmin create. This index will support fast querying, be populated during a merge or svnadmin load, and cough up its contents during a svn propget SVN_MERGE_PROPERTY or svnadmin dump. The contents of SVN_MERGE_PROPERTY will not be stored redundantly in the FS (only in the index). Dan Berlin is prototyping this index using sqlite3, and David James has a (generic) schema design underway.

Information Updating

Each operation you can perform may update or copy the merge info associated with a path, file, or revision.

svn add: No change to merge info

svn delete: No direct change to merge info (indirectly, because the props go away, so does the merge info for the file)

svn rename: No change to merge info

svn copy: Copies the merge info from the source path to the destination path, if any.

This includes copying info from revprops, if necessary, by determining if the merge info exists in a revprop for the last changed commit for the source path, and copying it to the new revprop if it does (someone probably needs to check if this is the right semantic :P)

All copies are full-copies of the merge information.

svn merge: Adds or subtracts to the merge info, according to the following:

Where to put the info:
1. If the merge target is a single file, the merge info goes to the property SVN_MERGE_PROPERTY set on that file.
2. If the merge target is a non-wc-root directory, the merge info goes to the property SVN_MERGE_PROPERTY set on the directory.
3. If the merge target is a wc-root directory, the merge info goes to the property SVN_MERGE_PROPERTY set on the revprop.
What info is put:
1. If you are merging in reverse, revisions are subtracted from the revision lines, but we never write out anti-revisions. Thus, if you subtract all the merged revisions, you just get an empty list, and if you do a reverse merge from there, you still get an empty list.
2. If you are merging forward, the revision(s) you are merging is added to the revision line in sorted order (such that all revisions and revision ranges in the list are monotonically increasing from left to right). The exact details of how the range is represented in terms of a list of single revs, or a revision range, is left as a quality of implementation detail. The only requirement is that the range be correct.
3. The path (known as PATHNAME in the grammar) used as the key to determine which revision line to change is the subdirectory path being merged from, relative to the repo root, with the repo url stripped from it.

Thus a merge of revisions 1-9 from http://foo.bar.com/reposroot/trunk would produce "/trunk:1-9"

cross-repo merging is a bridge we can cross if we ever get there :).

Prerequisites

Need to be able to set a revprop to be stored on commit (see issue 1976).
Need to be able to say to copy a revprop from a particular revision and only contact the server at commit time.
Need to be able to have auth treat SVN_MERGE_PROPERTY revprop differently from other revprops (either by special casing the cases users do care about controlling, or special casing props users don't care about controlling, etc.) so that people who don't have access to the revprops can still do history sensitive merges of directories they do have access to.

FAQ

What happens if someone commits a merge with a non-merge tracking client?

It simply means the next time you merge, you may receive conflicts that you would have received if you were using a non-history-sensitive client.

Can we do without the revprop portion of this design?

Technically yes, but it may require more crawling and querying at merge time.

Can we do history sensitive WC-to-WC merges without contacting the server?

No. But you probably couldn't anyway, even if the "revprop not being stored locally" issue were not here.

What happens if the merge history is not there?

The same thing that happens if the merge history is not there now.

What happens if a user edits merge history incorrectly?

They get the results specified by their merge history.

How does the revprop stay up to date?

We copy it from revision to revision.

What happens if a user manually edits a file and unmerges a revision (e.g. not using a "reverse merge" command), but doesn't update the merge info to match?

The merge info will believe the change has still been merged. This is a similar effect to performing a manual merge.

What happens if I svn move/rename a directory, and then merge it somewhere?

This doesn't change history, only the future, thus we will simply add the merge info for that directory as if it was a new directory. We will not do something like attempt to modify all merge info to specify the new directory, as that would be wrong.

I don't think only that copying info on svn copy is correct. What if you copy a dir with merge info into a dir where the dir has merge info -- won't it get the info wrong now?

No. Let's say you have:

a/foo (merge info: /trunk:5-9
a/branches/bar (merge info: /trunk:1-4)

If you copy a/foo into a/branches/bar, we now have:

a/branches/bar (merge info: /trunk:1-4)
a/branches/bar/foo (merge info: /trunk:5-9)

This is strictly correct. The only changes which have been merged into a/branches/bar/foo, are still 5-9. The only changes which have been merged into /branches/bar are 1-4. No merges have been performed by your copy, only copies have been performed. If you perform a merge of revisions 1-9 into bar, the results one would expect that the history sensitive merge algorithm will skip revisions 5-9 for a/branches/bar/foo, and skip revisions 1-4 for a/branches/bar. The above information gives the algorithm the information necessary to do this. So if you want to argue svn copy has the wrong merge info semantics, it's not because of the above, AFAIK :)

Footnotes

This is not going to be a full blown design for property inheritance, nor should this design depend on such a system being implemented.
Assuming 4 byte revision numbers, and repos with revisions numbering in the hundreds of thousands. You could do slightly better by variable length encoding of integers, but even that will generally be 4 bytes for hundreds of thousands of revs. Thus, we have strings like "102341" vs 4 byte numbers, meaning you save about 2 bytes for a 4 byte integer. Range lists in binary would need a distinguisher from single revisions, adding a single bit to both (meaning you'd get 31 bit integers), and thus, would require 8 bytes per range vs 12 bytes per range. While 30% is normally nothing to sneeze at space wise, it's also not significantly more efficient in time, as most of the time will not be spent parsing revision lists, but doing something with them. The space efficiency therefore does not seem to justify the cost you pay in not making them easily editable.