*** UNDER CONSTRUCTION ***
Subversion's merge tracking uses a layered design, with the user-visible operations based primarily on the information from the merge history.
Or, Tracking What Revisions Have Been Merged Where provides the information used by Subversion's merge tracking-related capabilities (history sensitive merging, etc.). The design is based on Dan Berlin's proposal (Message-Id: <1146242044.23718.1.camel@dberlin0-corp.corp.google.com>) and subsequent edits.
The goal of the Merge History portion of the design is to track the
information needed by the operations outlined by the majority of the
use cases (e.g. the revision numbers
being merged by a merge operation), and keeping this information in
the right places as various operations (copy
,
delete
, add
, etc.) are performed. This
portion of the design does not encompass the operations
themselves.
The goals:
Non-goals for the 1.5 design include:
The first question that many people ask is "where should we store
the merge information?" (what we store will be covered next). A merge
history property, named SVN_PROP_MERGEINFO
(e.g. svn:mergeinfo
) stored in directory and file
properties. Each will store the full, complete list
of current merged-in changes, as far as it knows. This ensures that
the merge algorithm and other consumers do not have to walk back
through old revisions in order to compute a complete list of merge
information for a path.
Directly merged into the item means changes from any merge that have affected this path, which includes merges into parents, grandparents, etc., that had some affect on the path.
Doing this storage of complete information at each point makes manual editing safe, because the changes to a path's merge info are localized to that path.
However, as a space optimization, if the information on a
sub-directory or file is exactly the same as the merge information for
its parent directory, it may be elided in favor of that
parent information. This eliding may is done on the fly, but could
also be done during a postpass (e.g. a svnadmin
mergeinfo-optimize
). Eliding information means that an
implementation may have to walk parent directories in order to gather
information about merge info (however, this may have been necessary
anyways). It is expected that directory trees are not that deep, and
the lookup of merge info properties quick enough (due to indexing,
etc.), to make this eliding not affect performance too much.
Eliding will never affect the semantics of merge information, as it should only be performed in the case when it was exactly the same, and if it was exactly the same, it could not have had an effect on the merge results.
Any path may have merge info attached to it.
The way we choose which path's merge info to use in case of conflicts involves a simple system of inheritance [1], where the "most specific" place wins. This means that if the property is set on a file, that completely overrides the directory level properties for the directory containing the file. Non-inheritable merge info can be set on directories, signifying that the merge info applies only to the directory but not its children.
The way we choose which to store to depends on how much and where you merge, and will be covered in the semantics.
The reasoning for this system is to avoid having to either copy info everywhere, or crawl everywhere, in order to determine which revisions have been applied. At the same time, we want to be space and time efficient, so we can't just store the entire revision list everywhere.
As for what is stored:
A survey of the community shows a slight preference for a human
editable storage format along the lines of how
svnmerge.py
stores its merge info (e.g. path name and
list of revisions). Binary storage of such information would buy, on
average, a 2-3 byte decrease per revision/range in size over ASCII [2], while making it not directly
human-readable/editable.
The revisions we have merged into something are represented as a path, a colon, and then a comma separated revision list, containing one or more revision or revision ranges. Revision range end and beginning points are separated by "-".
Token | Definition |
---|---|
revisionrange | REVISION "-" REVISION |
revisioneelement | (revisionrange | REVISION)"*"? |
rangelist | revisioneelement (COMMA revisioneelement)* |
revisionline | PATHNAME COLON rangelist |
top | revisionline (NEWLINE revisionline)* |
This merge history ("top"), existing on a path specifies all the changes that have ever been merged into this object (file, dir or repo) within this repository. It specifies the sources of the merges, (and thus two or more pathnames may be required to represent one source object at different revisions due to renaming).
The optional "*" following a revisionelement token signifies a non-inheritable revision/revision range.
This list will not be stored in a canonicalized minimal form for a path (e.g. it may contain single revision numbers that could fall into ranges). This is chiefly because the benefit of such a canonical format -- which is slightly easier for visual comparison, but not indexing -- is outweighed by the fact that generating a canonical form may require groveling through a lot of information to determine what that minimal canonical form is. In particular, it may be that the revision list "5,7,9" is, in minimal canonical form, "5-9", because 6 and 8 do not have any affect on the path name that 5 and 9 are from. Canonicalization could be done as a server side post pass because the information is stored in properties.
Note that this revision format will not scale on its own if you have a list of million revisions. None will easily. However, because it is stored in properties, one can change the WC and FS backends to simply do something different with this single property if they wanted to. Given the rates of change of various very active repositories, this will not be a problem we need to solve for many many years.
The payload of SVN_PROP_MERGEINFO
will be duplicated
in an index separate from the FS which is created during
svnadmin create
, or on-demand for a pre-existing
repository which has started using Subversion 1.5. This index will
support fast querying, be populated during a merge or svnadmin
load
, and cough up its contents as needed during API calls.
The contents of SVN_PROP_MERGEINFO
is stored redundantly
in index (to the FS). Dan Berlin has prototyped a simple index using
sqlite3. David James later proposed a more normalized schema design,
some of the features of which may become useful for implementing Merge
Tracking functionality in a more performant manner.
Each operation you can perform may update or copy the merge info associated with a path.
svn add
: No change to merge info.
svn delete
: No direct change to merge info
(indirectly, because the props go away, so does the merge info for the
file).
svn copy
: Makes a full copy of any explicit merge info
from the source path to the destination path. Also adds "implied"
merge info from the source path.
svn rename
: Same as svn copy
.
svn merge
: Adds or subtracts to the merge info,
according to the following:
SVN_PROP_MERGEINFO
set on that file.SVN_PROP_MERGEINFO
set on the shallowest
directory of the merge (e.g. the topmost directory affected by the
merge) that will require different info than the info already set
on other directories.Thus a merge of revisions 1-9 from http://example.com/repos-root/trunk would produce "/trunk:1-9"
Cross-repo merging is a bridge we can cross if we ever get there :).
(I have assumed no renames here, and that all directories were added in rev 1, for simplicity. The pathname will never change, this should not cause any issues that need examples.)
Assume trunk has 9 revisions, 1-9.
A merge of /trunk into /branches/release will produce the merge info "/trunk: 1-9".
Assume trunk now has 6 additional revisions, 14-18.
A merge of /trunk into /branches/release should only merge 14-18 and produce the merge info "/trunk: 1-9,14-18". This merge info will be placed on /branches/release.
(Note the canonical minimal form of the above would be 1-18, as 9-14 do not affect that path. This is also an acceptable answer, as is any variant that represents the same information.)
Assume the repository is in the state it would be after the "Repeated merge" example. Assume additionally, we now have a branch /branches/next-release, with revisions 20-24 on it.
We wish to merge /branches/release into /branches/next-release.
A merge of /branches/release into /branches/next-release will produce the merge info:
"/branches/release: 1-24 /trunk:1-9,14-18"
This merge info will be placed on /branches/next-release.
Note that the merge information about merges *to* /branches/release has been added to our merge info.
A future merge of /trunk into /branches/next-release, assuming no new revisions on /trunk, will merge nothing.
Assume the repository is in the state it would be after the "Repeated merge with indirect info" example.
Assume we have revision 25 on /trunk, which affects /trunk/foo.c and /trunk/foo/bar/bar.c
We wish to merge the portion of change affecting /trunk/foo.c
A merge of revision 25 of /trunk/foo.c into /branches/release/foo.c will produce the merge info:
"/trunk/foo.c:1-9,14-18,25". This merge information will be placed on /branches/release/foo.c
All other merge information will still be intact on /branches/release (ie there is information on /branches/release's directory).
(The cherry picking one directory case is the same as file, with files replaced with directories, hence i have not gone through the example).
Assume the repository is in the state it would be after the "Repeated merge with indirect info" example. Assume we have revision 25 on /trunk, which affects /trunk/foo Assume we have revision 26 on /trunk, which affects /trunk/foo/baz We wish to merge revision 25 into /branches/release/foo, and merge revision 26 into /branches/release/foo/baz.
A merge of revision 25 of /trunk/foo into /branches/release/foo will produce the merge info:
"/trunk/foo:1-9,14-18,25"
This merge information will be placed on /branches/release/foo
A merge of revision 26 of /trunk/foo/baz into /branches/release/foo/baz will produce the merge info:
"/trunk/foo/baz:1-9,14-18,26".
This merge information will be placed on /branches/release/foo/baz.
Note that if you instead merge revision 26 of /trunk/foo into /branches/release/foo, you will get the same effect, but the merge info will be:
"/trunk/foo:1-9,14-18,25-26".
This merge information will be placed on /branches/releases/foo
Both are different "spellings" of the same merge information, and future merges should produce the same result with either merge info (one is of course, more space efficient, and transformation of one to the other could be done on the fly or as a postpass, if desired).
All other merge information will still be intact on /branches/release (e.g. there is information on /branches/release's directory).
What happens if someone commits a merge with a non-merge tracking client?
It simply means the next time you merge, you may receive conflicts that you would have received if you were using a non-history-sensitive client.
What happens if the merge history is not there?
The same thing that happens if the merge history is not there now.
Are there many different "spellings" of the same merge info?
Yes. Depending on the URLs and target you specify for merges, I believe it is possible to end up with merge info in different places, or with slightly different revision lines that have the same semantic effect (e.g. info like /trunk:1-9 vs /trunk:1-8\n/trunk/bar:9 when revision 9 on /trunk only affected /trunk/bar), so you can end up with merge info in different places, even though the semantic result will be the same in all cases.
Can we do history sensitive WC-to-WC merges without contacting the server?
No. But you probably couldn't anyway without all repo merge data stored locally.
What happens if a user edits merge history incorrectly?
They get the results specified by their merge history.
What happens if a user manually edits a file and unmerges a revision (e.g. not using a "reverse merge" command), but doesn't update the merge info to match?
The merge info will believe the change has still been merged. This is a similar effect to performing a manual merge.
What happens if I svn
move
/rename
a directory, and then merge it
somewhere?
This doesn't change history, only the future, thus we will simply add the merge info for that directory as if it was a new directory. We will not do something like attempt to modify all merge info to specify the new directory, as that would be wrong.
I don't think only that copying info on svn
copy
is correct. What if you copy a dir with merge info into a
dir where the dir has merge info -- won't it get the info wrong
now?
No. Let's say you have:
a/foo (merge info: /trunk:5-9 a/branches/bar (merge info: /trunk:1-4)
If you copy a/foo into a/branches/bar, we now have:
a/branches/bar (merge info: /trunk:1-4) a/branches/bar/foo (merge info: /trunk:5-9)
This is strictly correct. The only changes which have been merged into a/branches/bar/foo, are still 5-9. The only changes which have been merged into /branches/bar are 1-4. No merges have been performed by your copy, only copies have been performed. If you perform a merge of revisions 1-9 into bar, the results one would expect that the history sensitive merge algorithm will skip revisions 5-9 for a/branches/bar/foo, and skip revisions 1-4 for a/branches/bar. The above information gives the algorithm the information necessary to do this. So if you want to argue svn copy has the wrong merge info semantics, it's not because of the above, AFAIK :)
Merge Tracking is implemented using a few simple data structures.
apr_hash_t
mapping merge source paths to a range
list.apr_array_header_t
containing
svn_merge_range_t *
elements.svn_merge_range_t
start
and
end
svn_revnum_t
, which are identical if
the range consists of only a single revision.The functional specification and the 1.5 release notes detail the command-line client behavior and conflict resolution API.
As outlined in the requirements and use cases, merge tracking auditting is the ability to report information about merge operations. It consists of three sections:
Most of the logic for reporting will live in libsvn_repos
, with
appropriate changes to the API and additional parameters exposed to the client
through the RA layer. We are looking at an API which takes one or more paths
and a revision (range?), and returns the merge info which was added or removed
in that revision. For the 'blame' case, we may also need to include some type
of line number parameter, the handling of which is going to get ugly, since our
FS isn't based on a weave.
Existing functions of interest are svn_repos_fs_get_mergeinfo()
and svn_fs_merge_info__get_mergeinfo()
.
svn log
ImplementationPrior to merge tracking, log messages had a linear relationship to one another. That is, the only information gleaned from the order in which a message was returned was where the revision number of that message fell in relation to the revision numbers of messages preceding and succeeding it.
The introduction of merge tracking changes that paradigm. Log messages for independent revisions are still linearly related as before, but log messages for merging revisions now have children. These children are log messages for the revisions which have been merged, and they may in turn also have children.
The result is a tree structure which the repository layer builds as it collects log message information. This tree structure then gets serialized and marshaled back to the client, which can then rebuilt the tree if needed. Additionally, less information needs to be explicitly given, as the tree structure itself implies revision relationships.
We currently use the svn_log_message_receiver_t
interface
to return log messages between application layers. To enable communication
of the tree structure, we add another parameter, child_count
.
When child_count
is zero, the node is a leaf node. When
child_count
is greater than zero, the node is an interior node,
with the given number of children. These children may also have children and
indicate such by their own child_count
parameters. Children
are returned in-band through the receiver interface immediately following their
parents. Consumers of this API can be aware of the number of children and
rebuild the tree, or pass the values farther up the application stack. In
effect, this method implements a preorder traversal of the log message tree.
(For convenience, we may want to consolidate all the data parameters of
svn_log_message_receiver_t
into a single structure.)
A revision, R, is considered a merging revision if the mergeinfo for any path for which the log was requested changed between R and R-1. The difference in the mergeinfo, both added revisions and removed revisions, between R and R-1 indicates the revisions which are children of R.
The exception for this is the case of a copy which is the creation of a branch. When a branch is created, the mergeinfo for R-1 is empty, and the mergeinfo for R is 1:R-1. In this case, the revision should not be considered a merging revision, and none of the revisions R:R-1 should be shown as R's children.
svn blame
ImplementationEven though the command line client doesn't consume both the original author and revision and the merging author and revision, the blame API should provide both for use by other clients.
$Date$