Merge Tracking Functional Specification

*** UNDER CONSTRUCTION ***

Merge tracking functional specification.

TODO: Incorporate CollabNet Summit findings. Describe how each requirement will actually function for Subversion. Remove redundancies.

Repeated Merge

There are two general schemes for solving the repeated merge problem.

The Most Recent Common Ancestor approach (MRCA)

In this scheme, you record an optional set of merge sources in each node-revision. When asked to do a merge with only one source (that is, just "svn merge URL", with no second argument), you compute the most recent ancestor and do a three-way merge between the common ancestor, the given URL, and the WC.

To compute the most recent ancestor, you chain off the immediate predecessors of each node-revision. The immediate predecessors are the direct predecessor (the most recent node-revision within the node) and the merge sources. An interleaved breadth-first search should find the most recent common ancestor.

The Ancestry Set approach (AS)

In this scheme, you record the full ancestry set for each node-revision -- that is, the set of all changes which are accounted for in that node-revision. (How you store this ancestry set is unimportant; the point is, you need a reasonably efficient way of determining it when asked.) If you are asked to "svn merge URL", you apply the changes present in URL's ancestry but absent in WC's ancestry. Note that this is not a single three-way merge; you may have to apply a large number of disjoint changes to the WC.

For a longer description of this approach, see the "Merging and Ancestry" section of the original design doc.

Ancestry-Sensitive Line-Based Merge

Make 'hunks' of contextually-merged text sensitive to ancestry.

A high-resolution version of repeated merge. Rather than tracking whole changesets, we track the lineage of specific lines of code within a file. The basic idea is that when re-merging a particular hunk of code, the contextual-merging process is aware that certain lines of code already represent the merging of particular lines of development. Jack Repenning has a great example of this from ClearCase (see ASCII diagram below).

See the variance adjusted patching document for an extended discussion of how to implement this by composing diffs; see svn_diff_diff4() for an implementation of same. We may be closer to ancestry-sensitive merging than we think.

Here's an example demonstrating how individual lines of code can be tracked. In this diagram, we're drawing the lineage of a single file, with time flowing downwards. The file begins life with three lines of text, "1\n2\n\3\n". The file then splits into two lines of development.

                    1     
                    2     
                    3     
                  /   \   
                 /     \  
                /       \ 
            one           1   
            two           2.5 
            three         3   
             |     \      |
             |      \     |   
             |       \    |            
             |        \   |            
             |         \ one                ## This node is a human's
             |           two-point-five     ## merge of two sides.
             |           three        
             |            |
             |            |
             |            |
            one          one
            Two          two-point-five
            three        newline       
               \         three  
                \         |   
                 \        |
                  \       |
                   \      |
                    \     |
                     \    |
                      \   |
                       \  |
                         one                ## This node is a human's
                         Two-point-five     ## merge of the changes
                         newline            ## since the last merge.
                         three

It's the second merge that's important here.

In a system like Subversion, the second merge of the left branch to the right will fail miserably: the whole file's contents will be placed within conflict markers. That's because it's trying to dumbly apply a patch that changes "1\n2\n3" to "one\nTwo\nthree", and the target file has no matching lines at all.

A smarter system (like Clearcase) would remember that the previous merge had happened, and specifically notice that the lines "one" and "three" are the results of that previous merge. Therefore, it would ask the human only to deal with the "Two" versus "two-point-five" conflict; the earlier changes ("1\n2\n3" to "one\ntwo\nthree") would already be accounted for.

Comparisons, Arguments, and Questions

AS allows you to merge changes from a branch out of order, without doing any bookkeeping. MRCA requires you to merge changes from a branch in order.

At first look, MRCA may be simpler to implement, since it results in a single three-way merge. However, it likely does not handle all edge cases. For instance, it may break down faster if the merging topology is not hierarchical.

MRCA may be easier for users to understand, even though AS is probably simpler to a mathematician.

Consistency with other modern version controls systems is desirable.

If a user asks to merge a directory, should we apply MRCA or AS to each subdirectory and file to determine what ancestor(s) to use? Or should we apply MRCA or AS just once, to the directory itself? The latter approach seems simpler and more efficient, but will break down quickly if the user wants to merge subdirectories of a branch in advance of merging in the whole thing.