Oh Most Noble and Virile Emacs, please be in -*- outline -*- mode! Inversion: A Version Control System You Actually Want To Use ------------------------------------------------------------ ($Revision: 1.41 $ of this document.) * Introduction This document describes Inversion, a replacement for CVS. We may sometimes call it "Ivn", to save space. It will be described largely in terms of how it differs, or doesn't differ, from CVS; if you don't already know CVS, this document will be pretty unreadable. Where you see the word "version", it means the same thing "revision" does in the CVS documentation. Everyone always ends up saying "version" anyway, then correcting themselves, so let's just give up already. Besides, it's shorter to type. "Project" and "Repository" also mean what you think they do -- a repository is a place where multiple projects are stored. Sometimes I might slip and say "repository" where I should have said "project", because CVS uses the r-word for both. Bear with me. * Requirements ** All CVS features are supported. Everything CVS has, we want. Versioning, folding of non-conflicting changes, detection of conflicting changes, branching, merging, historical diffs, log messages, line-by-line history ("cvs annotate"), the works. Generally, Inversion's conceptual interface to a particular feature will be as similar to CVS's as possible, except when there's a compelling reason to do otherwise. ** Directories, renames, and file meta-data are versioned. Naturally! By the way, rename support means you can trace through ancestry by name *or* by entity. For example, if you say "Give me version 12 of foo.c", do you mean version 12 of the file whose name is *now* foo.c (but perhaps it was named bar.c back at version 12), or the file whose name was foo.c in version 12 (perhaps that file no longer exists, or has a different name now)? Inversion offers a way to specify which interpretation you want, while defaulting to the most reasonable interpretation in context. We haven't decided what "reasonable" means yet, but we're working on that. ** Version numbers are project-wide and atomic. Unlike CVS, each Inversion commit results in a single new version number, which applies to the project as a whole (i.e., it applies to every object in the project, whether changed or not). There are no longer separate per-file version numbers, at least not user-visible ones. Note: Well, it may be useful to give users access to individual files' or directories' "commit count" (same as the per-file "revision number" in CVS). Commits are atomic -- either the entire commit succeeded, or the entire commit failed. No one can ever retrieve a project tree in an inconsistent state. ** Text vs Binary issues are handled better than in CVS ... depending on your definition of "better", of course. Historically, binary files have been problematic in CVS for two unrelated reasons: keyword expansion, and line-end conversion. Keyword expansion is, for example, when CVS expands "$ Revision$" into "$ Revision 1.1$", and each successive commit updates that to the next revision number. There are a number of keywords in CVS: "$ Author$", "$ Date$", and so on. (Note: the examples above contain an extra space to prevent keyword expansion from happening to them!) Line-end conversion is when CVS gives plaintext files in the working copy the appropriate line-ending conventions (LF, CRLF, or CR) for that platform (the repository always stores text files in Unix LF format). Line-end conversion requires knowing that the file is plaintext, of course -- doing it on an image file or other binary would only mess things up. Both keyword substitution and line-end conversion are possible only with plain text files. CVS only recognizes two file types anyway: plain text, and binary. CVS assumes files are plain text unless you tell it otherwise. Inversion, for its initial release at least, will recognize the same two types (later, it may learn to do keyword substitution inside word-processor formats, etc). The question is, how does Inversion determine whether a file is plain text or not? Experience with CVS suggests that assuming text unless told otherwise is a losing strategy -- people frequently forget to mark images and other opaque formats as binary, and then have to spend time figuring out why CVS mangled their data. So, Inversion assumes files are binary, unless they have a standard text extension (.c, .h, .pl, .html, .txt, and so on), or if a user explicitly marks the file as being texty. In addition to a pre-defined list of text extensions, each project can have a custom extension list as well. Also, of course, a user can choose to mark a file as text or binary no matter what its extension. Text files undergo line-end conversion by default. Users can turn conversion on or off per project, or per file pattern, or per file. Text files do *not* undergo keyword substitution by default. Users can turn substituiton on or off per project, or per file pattern, or per file. (The reason keyword substitution is off by default is that I've seen it cause problems in files where variable names or string constants happened to contain keywords. But these problems are somewhat rare, of course, and usually easily remedied, so I'm not entirely convinced that defaulting to `on' would be all that bad either... thoughts?) By the way, any of the above properties attached to a file are really attached to a particular version, and each new version inherits from the previous version except when told otherwise. Thus type changes are recorded like any other historical data (in practice, the type will probably be stored in the entity's property list, see implementation discussions farther down). ** I18N/Multilingual support. Inversion has I18N support -- commands, user messages, and errors can be customized to the appropriate human language at build-time. Also, there will be I18N support for the *names* as well as the contents of versioned entities. For purposes of keyword expansion and line-end conversion, Inversion also understands the UTF-* encodings (but this may be after the first release). (It may be workable to just treat everything as UTF-8 at first; for non-UTF-8 text files with a few "meta" characters, this might at worst result in some number of bytes after the meta character being ignored... no big loss, except when a keyword code or line-end falls within the lost bytes, but that probably wouldn't happen too often...) *** User-visible messages Inversion can be localized so that its messages print out in a build-time-configured language. ** Branching and tagging are constant-time operations in Inversion. CVS makes an unnecessary distinction between branches and tags, and then goes on to implement them in an inefficient way. Inversion supports them both with one efficient operation: `clone'. To clone a project is to create another project exactly like it, except that the new project "knows" that the old project is its ancestor. Version numbers in the new project start at the ancestor's version number. Clearly, at the moment of creation, a clone requires only a tiny amount of constant space -- the clone is like an "alias" for the old project. And if you never commit anything on the clone, then it's just like a CVS tag. If you start committing on the clone, then it's a branch. Voila! This should also get us CVS "vendor branching" (since we'll have real rename and directory support). (But note that it will be supported and normal to define the clone's name inside the original project's namespace, because people mentally think of tags and branches as being "inside" the project, not as being fully independent top-level objects themselves. And we certainly don't want cross-project name conflicts in tags and branches!) [Thanks to Jim Blandy for pointing out that `clone' is a clean way to get both tagging and branching.] ** Repeated merges are handled gracefully. Inversion merges are the same, in principle, as CVS merges: merging means taking the differences between version A and B, and applying those differences as a patch to version X (which may simply be the working copy, rather than a formal repository version). BUT, Ivn remembers what has already been merged, and doesn't merge anything twice unless forced to. So if you merge the differences between and into , and then you do it again a month later, the second merge will only get the changes from to , *not* from to . This will require a more complete specification, of course, but above is the general idea. Note that per-object merge records will need to be maintained on the server side, since people often need to merge project subsets. (There are some theoretical problems with remembering merge sources -- knowing where the merged data came from implies some sort of universal repository registry. However, our first goal is to make sure that multiple merges from branches made *in the same repository as the original project* compound gracefully. Remembering merges from remote sources is more difficult, due to the difficulty of distinguishing remote sources, but there are good "90%" solutions that will work in practice). ** Inversion has flexible log message policy A small matter, but one dear to our hearts... Log messages are a matter of project policy, not version control software policy. If you give no log message, then Inversion defaults to an empty message. Maybe there can be a per-project configuration option, requiring non-empty log messages for commits *in that project only*, though I have doubts about even the usefulness of that. (CVS tries to require log messages, but fails: we've all seen empty log messages in CVS, where the user committed with deliberately empty quotes, so let's please stop the madness now.) ** Conflicts are handled as in CVS, but a little better For text files, Inversion resolves conflicts similarly to CVS, by folding repository changes into the working files. For *both* text and binary files, Inversion also always puts the pristine repository version in one temporary file, and the pristine working copy version in another temporary file. Thus, in a text conflict, the user has three files to choose from 1) the combined file, with conflict markers 2) the original working copy file 3) the repository revision from which the update was taken CVS provides the first two, but doesn't directly provide the third. It's true that one could retrieve the repository version from the repository (using update -p or whatever), but it's more convenient to have it right at hand. When the conflict has been resolved and the working copy is being committed, Inversion can automatically remove the two pristine files. *** Content knowledge allows better conflict resolution strategies Certain kinds of conflicts can be resolved without human intervention. For example, files in the format of /etc/passwd just need to keep lines unique on username and/or user ID. But when merging new repository data into a modified working copy of such a file, it's possible to get a textual conflict even with no "semantic conflict". If the Inversion client knew something about the format of passwd files, then it could merge in the new lines without creating a conflict. A similar rule could be used for ChangeLogs, based on the date in the header lines. And so on. These features may not be in the first release of Inversion, but let's keep them in mind. ** Support for plug-in client-side diff programs There is no need for Inversion to have every possible diff mechanism built in (or even any mechanism built in). Instead, it invokes the appropriate client-side diff'er on the two versions of the file(s) locally. For text files, the appropriate diff'er is the system `diff' program, or maybe "emacs -f ediff". For image files, it might be some fancy visual comparison product. Probably basic text diffing should be built into the client, but that doesn't obviate the need to for clients to be able to specify alternate (or additional) difference viewers even for text. ** Multisite / Local Repository Two commonly requested enhancements for CVS are "multisite" and "local repositories". Although the two terms are sometimes used interchangably, they refer to two different things (explained below). Inversion could eventually support both, but they are not priorities for the first release. Nothing in the design prevents them from being added on later. *** Multisite This is like the ClearCase multisite feature. Essentially, it is a redundant distributed repository. The repository exists on two or more cooperatively mirroring servers (each one presumably being close, network-wise, to its intended users). Commits on any server are visible on all of them. I'm not sure how conflict resolution is handled when people at different mirrors change the same file at at the same time -- perhaps one of the commits is backed out and returned to that user? Hmmm. Anyway, CVS has gotten away with simply assuming decent Net access for all users; this doesn't satisfy everyone, but for the first release of Inversion we should probably do the same. *** Local repository This is one that people request a lot: the ability to commit changes first to a local "working repository" (not visible to the rest of the world), and then commit what's in the working repository to the real repository (with the several commits being then folded into one commit, usually). Why do people want this? I think mostly it's the psychological comfort of making a snapshot whenever one reaches a good stopping point, but not necessarily wanting all those "comfort points" to become publically-visible commits. This is understandable, but of course the local repository will have to refuse to commit changes if any of the files turn out not to be up-to-date. (There may also be a very cheap way to implement 90% of this feature by making skeleton duplicates of the working copy's directory tree and copying the changed files into that skeleton...) (Are there any other reasons people want this feature? I remember seeing posts on info-cvs explaining what appeared to be other reasons, but frankly I could never understand them very well...) For now, similar effects are achievable without involving the version control system, so this feature is not a priority for the first release of Inversion. * Design ** Repository *** The basic repository structure [This design is drawn from Jim Blandy's "Subversion" spec, but with some changes from Ben and Karl (i.e., Jim is not to be held responsible for everything you read below :-) ).] Suppose we have a new project, at version 1, looking like this (pardon the CVS syntax): prompt$ inv checkout myproj U myproj/ U myproj/B U myproj/A U myproj/A/subA U myproj/A/subA/fish U myproj/A/subA/fish/foo prompt$ (Only the file `foo' is a regular file, everything else in myproj is a directory so far). Let's see what this looks like as an abstract data structure in the repository, and how that structure works in various operations (such as update, commit, and branch). This data structure is described as though it lives in permanent heap memory; how we will actually accomplish permanent storage is an implementation detail, (probably involving DBM files of some sort). In the diagrams that follow, straight horizontal lines with arrowheads represent backwards time-flow (the regression of versions), and any other kind of path represents a parent-to-child connection in a directory hierarchy. Boxes are "nodes". A node is either a file or a directory -- a letter in the upper left indicates which kind. A node's name is stored in the node's parent, not in the node itself, for reasons that will become obvious. A file node has a byte-string for its content, whereas directory nodes have a list of dir_entries, eaching pointing to another node. At the top of the repository is an array of version numbers, stretching off to infinity. Since the project is at version 1, only index 1 points to anything; it points to the root node of version 1 of the project: Figure 1: The `myproj' repository at version 1. ( myproj's version array ) ______________________________________________________ |___1_______2________3________4________5_________6_____...etc... | | ___|_____ |D | | | | A | /* Two dir_entries, `A' and `B'. */ | \ | | B \ | |__/___\__| / \ | \ | \ ___|____ \ ________ |D | |D | | | | | | | | subA | /* One dir_entry, `subA'. */ |________| |___\____| \ \ ___\____ |D | | | | fish | /* One dir_entry, `fish'. */ |___\____| \ \ ___\____ |D | | | | foo | /* One dir_entry, `foo'. */ |___\____| \ \ ___\____ |F | | | | | /* (Contents of foo not shown.) */ |________| What happens when we modify `foo' and commit? First, we make a new `foo' node, containing the diff from version 2 of foo to version 1. Let's use ":N" to express version relationships; so, this diff is the result of: diff foo:2 foo:1 Note the the order there: this is a reverse diff. The idea is that foo:2 will hold the full text of this new version, and foo:1 will become a diff. This is done to save space, it does not affect the semantics of the implementation. (Also note that we're assuming a diff program that can handle binary data!) The new node is not connected to anything yet, it's just hanging out there in space: ________ |F | | | | | |________| It doesn't even know what version it belongs to; we'll get to that in a moment. Next, link the new node into the tree, where the previous node was, and create a "back-link" (I forgot to tell you, nodes have a back-link field) from the new node to the old node. Back-links are shown as parenthesized numbers with an arrow leading to the older node: Figure 2: The `myproj' repository, in an intermediate state during the commit of version 2 (a modification to the file `foo'). ______________________________________________________ |___1_______2________3________4________5_________6_____...etc... | | ___|_____ |D | | | | A | | \ | | B \ | |__/___\__| / \ | \ | \ ___|____ \ ________ |D | |D | | | | | | | | subA | |________| |___\____| \ \ ___\____ |D | | | | fish | |___\____| \ \ ___\____ |D | | | | foo | |___\____| \ \ ___\____ ________ |F | |F | | | | | | (1)------------->| | | | | | | full | | diffy | |contents| |contents| |________| |________| A back-link is a version number attached to a pointer into the past. The version number tells you the version of what the links points to, not what it points from. So if you're looking for FILE:N, and when you get a node for that file you find a back-link pointing to a version >=N, then you know you must follow that link. But I digress. Right now, we're learning how to commit foo:2, and there's one more step needed to complete that commit -- register the new version at the top of the repository, thus making it externally visible. That's done by making entry 2 in the version array point to something: Figure 3: The `myproj' repository, after commit of version 2. ______________________________________________________ |___1_______2________3________4________5_________6_____...etc... | | | / ___|_____/ |D | | | | A | | \ | | B \ | |__/___\__| / \ | \ | \ ___|____ \ ________ |D | |D | | | | | | | | subA | |________| |___\____| \ \ ___\____ |D | | | | fish | |___\____| \ \ ___\____ |D | | | | foo | |___\____| \ \ ___\____ ________ |F | |F | | | | | | (1)------------->| | | | | | | full | | diffy | |contents| |contents| |________| |________| Version 2 points to exactly the same root node as version 1, because that directory hasn't changed at all. Nor has its child, nor has anything until you get all the way down to foo. That's why nodes don't store version numbers -- the exact same node may appear in many different versions. Here's how to retrieve foo:N, in general: 1. Go to the version table, find version N. 2. Walk down the tree in the obvious way, starting from the root node N points to. As you walk, anytime you get to a node with a back-link >=N, follow the link before continuing downward. (Follow this rule even when you get to a node for `foo'.) 3. When you have nowhere else to go, this is the droid you're looking for. Now watch what happens when we add a new file `bar' into `fish' (i.e., bar will be a sibling of `foo'). Here's the new tree, the intermediate steps not being shown: Figure 4: The `myproj' repository, after commit of version 3 (the addition of a sibling to `fish/foo'). ______________________________________________________ |___1_______2________3________4________5_________6_____...etc... | | / | / / ___|_____/ / |D |______/ | | | A | | \ | | B \ | |__/___\__| / \ | \ | \ ___|____ \ ________ |D | |D | | | | | | | | subA | |________| |___\____| \ \ ___\____ |D | | | | fish | |___\____| \ \ ___\____ ________ |D | |D | | | | | | (2)--------->| | | | | foo | | | |___/____| | | / | | / | bar | / / | / /| foo | / / |___\____| / / \ / / \ / / __\___/_ ________ ______/_ |F | |F | |F | | | | | | | | (1)------------->| | | | | | | | |________| | full | | diffy | |contents| |contents| |________| |________| Trace various retrievals in the above structure, and you will see that + fish:2 and fish:1 are the same node, as they should be. + But fish:3 is different from them, which is also as it should be. + foo:3 and foo:2 are the same node, as they should be. + But foo:1 is different from them, which is also as it should be. Thus, the traversal cost of retrieving version N of something is equal to its depth in the tree plus the number of changes it or its ancestors have undergone. Unchanged entities cost nothing. Whether fish:2 is stored as some sort of diff from fish:3 is an implementation detail. We could do it that way, but it may not be necessary, since new directory nodes would only be created when a new file or directory is added to the project, or a name is changed. Most commits tend to be edits to existing files. Just to drive the point mercilessly home, and to explore the theoretical limits of ASCII diagrams, here is myproj:4, in which a README was added to the project's top level directory: Figure 5: The `myproj' repository, with new top-level README: ______________________________________________________ |___1_______2________3________4________5_________6_____...etc... | / / / | / / / ____|____/ / / _________ |D |______/ / |D | | | / | | | (3)------------------/---------------->| | | | / _____|__ B | | README | / / | | | | |___________/ / | A | | B | | ____________/ |___|_____| | / |__|_________ / | | / A | \/ _________ | |_|___\___| /\ |F | | | \________ / \_____| | | | | / | | | | |/ | | | | / |_________| | | /| | | / | | | _______/ | | | | | | | | | | ___|__|_ __|_____ | |D | |D | | | | | |________________________| | | | subA | |________| |___\____| \ \ ___\____ |D | | | | fish | |___\____| \ \ ___\____ ________ |D | |D | | | | | | (2)--------->| | | | | foo | | | |___/____| | | / | | / | bar | / / | / /| foo | / / |___\____| / / \ / / \ / / __\___/_ ________ ______/_ |F | |F | |F | | | | | | | | (1)------------->| | | | | | | | |________| | full | | diffy | |contents| |contents| |________| |________| There, wasn't that pretty? I knew you would. *** Crash-proof repository mutation and locking We want to make repository changes in a such way that the repository is in a "sane" (unambiguous and readable) state at every step. At the same time, we want to keep locking to a minimum: no operation locks out readers, no read-only operation locks out anyone, and write operations lock out other writers for as little time as possible. These two issues are not directly related, but they must be synchronized with each other. This is complex, so instead of specifying the sequence of events formally right away, let's run through an example without locking first. Here's how to commit a new version of foo: Step 1: Create the new node for foo:2. ________ |F | | | | | |________| Nothing is attached to this new node, therefore the repository is in a sane state. If the server crashes at this moment, there will be no problem (except that a cleanup thread might, at its leisure, delete the unreachable node). The new node has the full contents of foo:2. Step 2: Create the back-link from foo:2 to foo:1. ________ ________ |F | |F | | (1)----------------->| | | | | | |________| |________| Again, the repository doesn't yet know that the new node even exists. The back-link is a pointer from foo:2 to foo:1, not the other way around, so foo:1 isn't even aware of the link. Both nodes have the full contents of their respective version; no diffs have been made yet. Step 3: Change the link in the parent node to point to the new foo:2 node (which in turn points to foo:1). \ ___\____ |D | | | | foo | |___\____| \ \ ___\____ ________ |F | |F | | | | | | (1)------------->| | | | | | | full | | full | |contents| |contents| |________| |________| The repository is still in a mostly sane state. If the machine crashes now, there's this slightly weird situation in which a node for foo:2 is present in the tree, but the version `myproj:2' doesn't exist yet. However, if you follow the three-step algorithm for retrieving FILE:N, you'll see that you still reach foo:1 just fine -- it's as though foo:2 isn't there. Another way to look at it is that version 2 of the project does not exist yet, therefore you can't ask for it; if you ask for the head version, you'll get version 1. Step 3: Hook up version 2 to the root node. ____________________________ |___1_______2________3_______...etc... | | | / ___|_____/ |D | | | | A | | \ | | B \ | |__/___\__| / \ | \ You'll notice that nowhere in there did we give foo:1 diffy contents. A separate thread gambols about the repository, taking care of such mundane tasks. There's no reason to slow up commits with it. (Of course, there's a "diff-bit" on each node, saying whether its contents are stored full-text or as a diff against the node that back-links to this one.) Now let's generalize that procedure, and add locking... The important thing to realize is that someone else may initiate a commit before or during our commit, and their commit might finish before or after ours. We don't want to make their commit wait unnecessarily, so we're not actually trying for a particular version number -- we just want the next available version number at the time our commit finishes. Below, a "lock" is a cookie consisting of user:num, where num is the same for all locks in this commit, and different from any other current locks (even ones with the same user). Locks persist through server crashes (todo: can we implement that efficiently?) 1. Lock the nodes you're committing new versions for (if you're adding a new file then there's no node to lock, though). These primary locks will last throughout the commit. Block on all locks while you block on any, so your commit will be atomic. 2. Create the new, unattached nodes that will hold the new versions. Fill them up with the new contents. Create the back-links. 3. Lock the parents of the nodes you're committing (call these "secondary locks"). Again, block on all while block on any. 4. Set dir_entries in the locked parents to point at the newly created child nodes. Although this cannot be atomic if there are multiple parents, it is safe because the server knows that a node doesn't really exist if its parent is locked. (Amend the node-finding algorithm given earlier to always follow back-links from a node whose parent is locked, until you find a node with the same lock as that parent). 5. Now create the new version number; just take the next available number and hook it up to the root node, but put a special "negative lock" on the version number first. This "negative lock" has the same id number as all the other locks in this commit, but it is interpreted differently. Its point is to invalidate the other locks while they're being removed -- the idea is that if you would block on a lock, first check if there's a corresponding negative lock on a version number somewhere. If there is, then you don't have to block on the original lock, because all the important parts of its commit are complete, except for lock removal. Now, we could actually stop right there. The commit is done; all the information is in the right place in the repository. Removing the locks, and then removing the negative lock, *could* be done by a separate thread, just like diffication. But it probably makes sense just to take care of lock removal right now, although we can still have a separate thread that looks for inoperative locks (to clean up after server crashes). So: 6. Remove all the other locks. 7. Remove the negative lock from the version number. Voila. There's probably a variation whereby one *does* reserve the version number at the outset of the commit. If there's any reason why getting the next available version number as of commit start is preferable to as of commit end, then we can do that. But I don't see why it would be preferable -- and I like the mellow approach where you just grab whatever version happens to be in line when you're ready for it. Thoughts? *** How renames work, and what they imply Exactly how you think they do. To rename an entity, you make a new node for its parent directory. In this new node, the entity's dir_entry has the new name, but still points to the same node as the old dir_entry. That's why a name is stored in the parent, not in the thing being named. This implies that each entity (where "entity" means roughly "a most-recent node along with everything it backlinks to") has some unique identifier aside from its name. These identifiers are stored alongside the name in the parent's dir_entry, and are unique within the project. Having both the human-visible name and an internal identifier allows Inversion to choose between following a history by name or by identity, which can result in two different retrieval scenarios (although such circumstances are unusual, we must support them). These unique identitifiers, once assigned, are _never_ changed. When an entity is renamed, only the name changes, not the id number. *** How removal works. Exactly how you think it does. You commit a new version of the parent node, which simply doesn't have an entry for the removed child. *** How resurrection works. Suppose you removed `foo' several versions ago and now you want it back. When the user the working copy types $ inv resurrect foo what happens? Well, in the repository, the server starts at the base version's node for the current directory (i.e., foo's parent directory). Note that's "base" version, not "latest" version -- the base is the one on which the current working copy is based. Starting there, follow the backlinks, asking at each step whether there's a dir_entry for `foo'. If there is, follows that link -- that gets you the most recent `foo'. Of course, there could have been several different entities in the history named `foo' in this directory. Which is the one the user wants? By default, the server assumes the user meant the most recent foo. But the user could ask for one at a specific version: $ inv resurrect foo:5 In that case, the server follows backlinks until it gets to the directory belonging to version 5, and then looks for an dir_entry pointing to foo. Or, the user might first need to know how many different entities have been named `foo' here, before choosing one to resurrect: $ inv resurrect -l foo /* created removed versions entity_name entity_id */ 2000-08-01 2000-10-14 1-5 foo 1729 2000-12-15 2001-05-29 7-12 foo 1952 2001-11-29 2002-03-20 20-103 foo 1729 From this we see that an entity named `foo' was created in this directory, later removed, then another entirely different entity named foo was added, then later removed, then the first foo was resurrected and lasted for 83 versions before being removed (had it never been removed, the third "removed" date would be an empty field). The procedure by which the server generates this list should be obvious. Now the user decides to resurrect the middle foo. Usually she would want it at its latest revision, in this case 12: $ inv resurrect foo:12 Please note: the examples here were meant only to illustrate various resurrection scenarios, and should not be taken as client user-interface specifications. *** How clones (tags and branches) work Branches and tags are both implemented in terms of "clones". Cloning a project is constant-time and constant-space -- you just make a new name that points back to the original project. All clones are automatically tags; and once you start committing on a clone, it becomes a branch as well. Branches *never* affect the original project -- the original data structure remains untouched, it does not even know there's a branch attached to it. As will become clear below, this forces us to do a little more node duplication than some other schemes would, but it wins overall because it allows users to tag and branch projects to which they don't have write access. (Try *that* with CVS!) Let's make a clone of myproj, called `xproj', based on myproj:5 (as it happens, 5 is the highest version in myproj, but that's not a requirement -- a branch might sprout off any version in myproj's history): ______________________________________________________ myproj |___1_______2________3________4________5_________6_____...etc... / / .----(back to myproj)-----> + / / ____/_________________________________________________ xproj |___5_______6________7________8________9_________10____...etc... The xproj branch starts at version 5; if you ask for a version 5 or younger, you will get it, but it will be same as that version of myproj. (The reasons for doing things this way will become clear later.) Right after xproj is created, it has no nodes of its own -- every request is referred to myproj's repository. When you commit a change to `foo', here's what happens: In order to commit a new version of foo, we'd have to change foo's parent node to refer to the new foo node. But we can't -- we don't have write access to myproj! So we have to make a local copy of foo's parent. But that leaves the parent's parent, over in myproj, still pointing at myproj's version of the parent... You can see where this is going, I'm sure. The problem bubbles right up to the top, requiring us to create a root node locally. All its dir_entries that are not ancestors of foo just refer back to nodes in myproj; those that are ancestors refer to new local copies of nodes along the line of ancestry, all the way down to foo: ______________________________________________________ myproj |___1_______2________3________4________5_________6_____...etc... / / .----(back to myproj)-----> + / / ____/_________________________________________________ xproj |___5_______6________7________8________9_________10____...etc... | | ___|_____ |D | | | | README -------------. | | \ | A | \ | \ | \ | B \ | \ (points back |__/___\__| \______ to a node / \ in myproj) / \ / \ / \ ________ (points back |D | to a node | | in myproj) | subA | |___\____| \ \ ___\____ |D | | | | fish | |___\____| \ \ ___\____ |D | | | | foo | |___\____| \ \ ___\____ |F | (points back | (5)----------> to a node | | in myproj) |________| This "line-of-ancestry copying" looks a lot more expensive than it really is. Remember that this example project consists mostly of directories, but a real project is mostly files, whose nodes never need to be copied except when they change. And directory nodes only need to be copied the first time one of their descendents changes -- after that, the directory node is already there. Thus, a branch fills out over time, eventually holding a more-or-less complete skeleton of the project's directories, plus any changed or added files. The exact method by which local entries "point back" to the original project is not settled yet. It might be some general URL-like syntax that we can use in all sorts of circumstances, like user@inv.hostname.domain:repositoryname:myproj:5:/A/subA/fish/foo (This will come in handy later, when we need a way to get unique labels for the diffy chromosome chunks in partial merging... but more about that when we get to it. :-) ) **** "Shallow" and "deep" clones (insurance against loss of trunk) The cheapest kind of clone is constant-time and -space. Call it a "surface" clone, because it includes by reference wherever possible. For tags and branches being stored in the same repository as the original project, surface is the way to go. But if you're making the clone on another server, then you have to make a political judgement involving the importance of your clone, of the historical data antecedant to your clone, and of the original repository's stability. What if you think that source repository might disappear on you? Then you want something more than just a surface clone. Inversion offers two choices: "shallow" and "deep" clones. Shallow clones locally duplicate all the data for the version in which this clone is rooted, but they don't duplicate anything older than that. Deep clones duplicate the *entire* project -- the whole node structure. (A deep clone still remembers its ancestry, however). Here's are the three possible kinds of clones, for reference: 1. Surface clones Refers to original project wherever possible. 2. Shallow clones Duplicates the node structure of the source version that roots this clone. Older versions are included by reference, however. Ancestry is remembered. 3. Deep clones Duplicates the entire node structure of the source root version, including back-links. Ancestry is remembered. fooo: [This is why clones start at the version number they're rooted at... (todo: explain this)] fooo: Explain why remembering ancestry is important even for shallow and deep clones. **** Merging branches whoax *** How meta-data is stored Each node has a property list (key->value, key->value). Among the things we can store in the property list are file permissions, data type override (text, binary, whatever), whether or not to do platform-specific line-end conversions, etc. It might be good to have two plists, actually: `inv_props', and `user_props'. Inversion only uses the ones in inv_props, and promises a "store-and-ignore" policy for user_props. And users should never store anything in inv_props, only Inversion handles that. *** A closer look at the `node' data structure (and friends) typedef struct dir_entry { char *name; /* Child's name, multi-lingual I hope */ node *child; /* The target node. Given here as a pointer, but it might be a node_id in reality. */ } dir_entry; typedef struct node { unsigned long int entity_id; /* Unvarying unique id for this entity */ unsigned long int node_id; /* Unique id for this node (See struct dir_entry above about pointer vs node_id.) */ int type; /* File, directory, symlink, etc */ BOOL diffy; /* Holds full text or diff? */ plist *inv_props; /* Properties managed by Inversion. */ plist *usr_props; /* Properties ignored by Inversion. */ /* I think we might need parent pointers to do multiple hard links correctly; but there might be another way... */ struct node **parents; /* See notes about locking above. This may be stored as "out-of-band" data outside the node. */ lock *lock; /* I'm conjecturing that dir_entries will be stored as chains; but it may turn out to be more sensible to store them as text, just like regular contents. */ union contents { struct dir_entry **dir_entries; char *data; } } node; None of this is set in stone, of course; the boolean `diffy' might instead be a string `diff_method', for which null means full contents at this node. The contents might be a string even when this node represents a directory -- a special line-by-line texty format would list its dir entries, and that could make it easier for us to store directory changes in a diffy way. ** Client There's not much to say about the client; it just has to speak the protocol and do the right thing. However, the layout of working copies needs to be specified. *** Working copies The tentative plan is to use an IVN/ subdirectory, just like the CVS/ subdirectory. However, the IVN/ subdirectory will be at the top level of the checked-out portion of the project. The reason for this is that then an IVN/ administrative subdir in a subdirectory of the project signifies that this subtree comes from some other project (perhaps from a different repository). This makes project "transclusion" easier. If your project depends on a library that someone else is maintaining, you can include a working copy of their project inside yours, without taking over versioning in any way. The contents of the IVN/ subdir are still under consideration, but it should be possible to sketch generally what's in there: A file indicating the base version of this working copy. Something like a CVS/Entries file A base tree of the project, for client->server diff transmission (OR, rsync-like records in the Entries-like file :-) ) Of course, the format of the so-called Entries file will be very different from CVS. This working copy administrative area has not been given much thought, because it will largely come out in the wash after the repository and protocol specifications are done. ** Other miscellaneous design goals *** Natively client/server. Even when accessing a local repository, the ivn client connects to a ivn server and talks the ivn protocol. This is for maintainability -- in CVS, it's easy to write a bug by implementing something correctly for either the client/server or the standalone but not both (and more often neither, but that's another story). The protocol may be DAV, if DAV turns out to be suitable. Still under investigation. *** The client/server protocol sends diffs in both directions Inversion transmits diffs both ways (unlike CVS, which transmits diffs from server to client, but full files going the other way, thus using more bandwidth than necessary). Ivn accomplishes this either by keeping pristine copies locally, or using the rsync algorithm; which way we do it is an implementation detail that hasn't been decided yet. *** Minimize network turnarounds The CVS client/server protocol goes to great effort to make sure that the client and server need only a minimum of exchanges per operationn. I believe it is "theoretically" possible to never need more than 2 turnarounds (i.e., client asks server preliminary questions, server responds, client then does the real request, server responds), and most things can be done with just 1. *** Minimize database turnarounds (repository storage issues) [Note: we're no longer sure there will even be a database, but the concerns below apply to whatever static block of sludge the repository data is stored in, be it Oracle, Berkeley DBM, or flat files on disk.] We need to make sure that traversing tree structures does not require N separate database accesses, where N is proportional to the depth or breadth of the traversal. For example, if retrieving an old version requires chaining down from the present version, and each link on the chain is another database access, that's bad. Database accesses should be thought of like network turnarounds -- one hopes to arrange things so that any operation requires only a (small) constant number of them, not N. *** Times should be proportional to change size, not project size The time to complete any operation should be proportional to the size of the change that operation relates to, not the total amount of data in the project or in the repository. Committing three files should take roughly the same amount of time in a project with ten files as in a project with forty thousand. Same with diffing three files, etc. An exception is made for recency: retrieving an old version may take longer than retrieving a recent one, because old versions will take longer to construct. This is a tradeoff of using diff-based storage, and has worked out pretty well in practice. ** Scenarios Each scenario should be mentally played out for a given design, analyzing for performance and scalability issues (particularly in network traffic, storage sizze, and storage accesses). If you spot a weakness in a design, add a scenario here that highlights it: even if that particular design gets tossed, the scenario will probably reveal similar weaknesses in other designs. (On the other hand, if a scenario never reveals weaknesses in anything, then remove it -- it's not helpful if it only demonstrates the ease of implementing that operation well. Currently this list is a brain dump; there's no reason it can't be shortened.) TODO: incorporate the WebDAV/DeltaV scenarios list here too. + Commit from an up-to-date wcopy + Commit from non-up-to-date wcopy, with the modified files not-up-to-date + Commit from non-up-to-date wcopy, with the modified files up-to-date + Updating to most recent version(s), with no conflicts + Updating to most recent version(s), with conflicts + Party A commits while Party B's commit is in progress, but no conflicts In this context, "conflicts" means the same entities were changed, even if the changes are non-overlapping. + Party A commits while Party B's commit is in progress, with conflicts In this context, "conflicts" means the same entities were changed, even if the changes are non-overlapping. Conflicts can mean changing the same file, or adding the same file, or one person modifying it while the other wants to delete it, or whatever. (I don't think it will matter in the end, as the implementation is mainly concerned with the idea that someone wants to do something with a non-up-to-date file). + Add a new entity (directory or file) + Add a new entity (directory or file) from a non-up-to-date wcopy Both with current directory up-to-date, and not. (First determine what desired behavior is!) + Rename an entity + Retrieve an old version of an entity that has been renamed + Remove an entity + Resurrect an entity + Move an entity to a new directory (may be different from rename) + Diff two particular versions of a file + Diff a working (uncommitted) version against an old version of a file + Retrieve version X of a file + Retrieve version X of a directory tree + Show annotations of both modified and old versions of files. (Golly, we need some word other than "annotations". To me it implies that one has added log messages to individual lines, or something like that, not that CVS is showing you who did what and when.) + Show annotations of a directory, similarly. Who created the file, and who last committed to it. Should this be a wish-list item, though? CVS doesn't do it, but then again, CVS doesn't do directories either... + Make a branch + Make a tag (same as making a branch, in all proposed designs) + Merge a branch into trunk Not talking about the user interface to branch merging, here, but rather internal implementation: how the system discovers what is the same vs what needs merging, what conflicts, etc. + Merge a branch into trunk again + Merge a branch into another branch Is this fundamentally different from merging into trunk? Is the trunk "special" in any way, implementation-wise, or is it just the first branch? + Change a previously committed log message + more to come... * Meta-implementation This section describes the process of implementing Inversion -- what order we do things in, and how. Once the design is ready, the parallizable parts of the software should be determined, and tasks divvied up. At this point or very soon after, the client/server protocol needs to be designed too. We will evaluate WebDAV (using the same examination-under-common-ops technique as with the data design); if WebDAV is suitable, it would be a good choice, because it already has some political acceptability. During the initial coding phase, everything that can be farmed out to external programs will be. For example, the diff routines will just invoke xdelta and diff (or whatever), and the client side will invoke an external patch. In the long run, these things will very likely be internalized for speed, but their semantics won't change when we do that, and it's more important to get the stuff up and running as soon as we can, and optimize afterwards. There should be an automated test-suite. It should not be written in Bourne shell script (we can at least learn this lesson from CVS!). Authentication questions will be put off until most of the software is working, since they probably don't affect the rest of the design much. * Jason's comments Please note that these comments are based on an older incarnation of the design. I haven't gone over them yet to see which ones still apply. -kff --------------------8-<-------cut-here---------8-<----------------------- >From: Jason Elliot Robbins >Subject: Inversion feature outline (was: for meeting w/ Rational) >To: kfogel@red-bean.com >cc: jrobbins@collab.net, npm@collab.net, brian@collab.net, jon@collab.net >Date: Thu, 02 Mar 2000 19:02:47 -0800 > >Hi Karl, > >Here are some comments I came up with while reading your Inversion >document on my last plane trip. Overall this document looks like a >good start, but there is a lot to work out. > >When would you like to put this document on the inversion.tigris.org >web site? > > >Here are my comments: > >I know this is long, the best comments are 1, 7, 12, 14, and 17. > >1. We should plan for web integration. That could mean a lot of >things. One idea off the top of my head is to actually use a URL for >the CVSROOT: I would make it an actual http:// scheme, rather than >some new ivn: scheme, so that there is a clear place to host a web >page with instructions about that repository. Alternatively, we could >use a new scheme like ivn:argouml.tigris.org, so long as the CVSROOT >is very memorable to people who know the web address. Right now the >maping between projects (e.g., www.webmacro.org, or java.tigris.org) >and their CVSROOT varies a lot from site to site. It would be nice to >make this a no brainer. Well, I'm not so sure... The reason this mapping (www<-->repository) varies so much is political, not technical. For example, it so happens that the hostname argouml.tigris.org hosts one repository with one project in it, so there would be no ambiguity in referencing that project by the hostname. But there are many servers that host multiple projects, and not all of those projects have their own domains. In many (most?) cases, the repository server is on a different host from the Web server. Or: sometimes people want to deliberately associate their project with an organization, so they don't want a strong link between the project's name and its Web address. Also, as far as I can tell, people don't really have trouble figuring out a project's web site given a working copy, or vice-versa. They may have trouble *using* CVS to get the working copy, but that's a different matter... I think it's actually not our place to make assumptions about how people want to present their project to the world. It *is* our place to make sure that our system provides no obstacles to whatever method a user chooses to make her project web-visible, but "no obstacles" doesn't mean special support for anything in particular. If someone wants to keep their project's web pages in the project's source tree, and point their web server at the right place, there's nothing stopping them from doing that. If they arrange it so the names are the same, that's great -- and if they don't, there might be a reason for that. Hmmm, I can go into greater detail on this if you want; my hands are tired of typing right now so I'll leave this question here. >2. I would like to make the difficult things in CVS easier in Ivn. I >think you mention that you would like to make things like moving files >and directories, etc. easier. Actually, the second paragraph in >Meta-implementation is definately the right idea. Looking at all the >multi-step proceedures outlined in your CVS book and making them one >step in Ivn would be great. Oh gosh you bet! :-) >3. The concept of assigning a new version number to every file >everytime there is a commit is rather foreign to me. Is this used in >other vc systems? Are other people familiar with this practice? Does >that mean that it would be very awkward to verbally tell someone to >check out a certain version (e.g., version 2378812, instead of 4.7)? >Would it be possible to tell at a glance how many versions a file has >gone through? I don't know; I don't know; no I don't think so; and yes. :-) Perhaps we could make a distinction between a file's commit count and its project's commit count -- i.e., two kinds of version numbers, treated equally; which one you use depends on circumstances. Let's mull on that. I do worry about the potential for confusion, though... How big do project version numbers get? Well, to give you an idea, the entire CVS tree (dating from 3 Dec 1994) has 4254 commits. 4254 isn't such a bad version number -- I wouldn't mind having to deal with a few more digits than that, personally. I don't think of it as assigning a new version number to every file on every commit -- that's not what's happening inside anyway. Instead, I think of it as a single version number for every *commit*, and a commit involves certain files. As for files not involved in that commit, if you ask for one of them at that version number, you'll just get the most recent version not exceeding the one in question. [My table description may seem not match what I'm saying here, but that table desc is just a draft and will be updated.] >4. Why make keyword expansion off by default. If I want to do it I >will have to search through the documentation to find the right >option. I realize that it might mess up binary files, is there >another way to make that safe. Well, right now, if you don't want to do it, you have to search through the documentation to turn it off. :-) Seriously, I'm by no means sure that off is better, I just lean that way because I've been nailed several times in the past by not turning it off when I should have (not talking about binary files, I mean text files that happened to have text that matched keywords, but I didn't want that text to be expanded). [todo: more to say here, fingers falling off, will respond more later.] >7. How about the idea of "local branches" that are stored on a >developer's working machine in addition to the local working copy of >the checked out files. This is basically a fraction of the total >repository stored on each client. The local reposiroty would be >filled in on demand, e.g., if you do a diff against an old version, >that version would be stored in your local repository. Developers >could also commit to the local branch several times before eventually >merging into the central servers main branch. If local branches are >accessable via the internet then we could basically do away with >patches. If a developer does work that he wants to contribute back to >an open source project, he could just email the URL of his local >server. Also, the concept of local branches could be generalized to >build a higherachy of servers with the global project home server at >the top and various mirrors in the middle, LAN servers near the >bottom, and each developers' local server at the bottom. This has the >advantage that every developer can have commit access to some >repository, we get a range of trust rather than a boolean permisssion >to commit to the central server or not. Sure! I think what we have to do is design this in such a way that it doesn't take any special support from us to create this concept. That is, if one knows the client/server protocol, one can implement local branches without us changing anything. So, we should think about this as we go, but I don't think we should actually make it part of the first release. We should just try to have a design that doesn't make this feature impossible. [There are some "theoretical" problems that come up in schemes like this, having to do with conflict resolution of course, which we have to think through. So we may have to back down from this idea to an arbitrary degree... I hope not too much though, 'cause it's neat.] >8. I would like to have some kind of plug-in permission system. I am >wary of yet another set of user account records and permissions. We >have found that half the tools we want to integrate into tigris has its >own user account system, and that makes a lot of duplicate data that >needs to be kept in synch. Yeah, some way of saying "default to the system users and access methods on this host", and then you can add inversion-only users as necessary. (?) >9. Some of the data in your tables is not as normalized as it cood >be. I know this may not be a priority, but here are some that I >noticed.: the names of entities and their states could be in separate >tables. Line mods could be a table of line ranges rather than text, I >guess it depends on what format you were thinking of for that text >field, but it seems like you might do a lot of parsing. The names of >properties in key value pairs could be stored in a properties names >table. Yup, thanks. Still a draft, we'll normalize. >10. I am a little concerned about storing the parent container >referenace and a basename. I would like to have permissions limited >to a given scope that matches the complete path. For example, I would >like to grant a particular developer the ability to commit >.*/dist/.*\.html but not any .java files or any file not in a >subdirectory of some dir named named api. You have some of this in >your ACCESSES table. For the sourcexchange.com web site we developed >a much more general permission scheme that you can read about at >http://joist.tigris.org. I'll look into this. >11. I would like to see some audit mechanism. I think in this version >of the document you left off the basic ability to see who did each >commit. More generally, we should consider the repositiory >administrator role in all its glory. I dont want to require a lot of >administration work, but whatever needs to be done should be put under >the same "evaluation across operations" as we do for developer >operations. Oops... the problem referred to in your second sentence is due to an error in transferring my tables from paper to disk. Fixed now. :-) In general, yes, I think I see what you mean about admin role. >12. It would be nice if the server could be installed without root >access. This would allow a lot more people to set up servers for >their local work group etc. As a student, I never had root access to >a big machine, and a lto of people working on OS are students. Some >small local server is also important to the local branch idea. Apache >can be installed without having root access, and so can ezmlm mailing >lists, I really like that power for everyone. The security >implications are good and it would surely simplify installation. No reason why this can't happen. Nothing in the design as it stands right now requires root, AFAIK. >13. We should consider the following levels of internet connectivity: >high speed, low speed, disconnected (e.g., laptops and modems), flaky >connection. Low speed constant connections (e.g., in distant >contries) might benefit from background filling in of their local >repository history. Flaky connections might benefit from a resume >capability (I guess that comes for free in CVS, but we need to make >sure there are never any remaining lock files left on either end). Well, we will make sure that locking can handle connections broken at any point, by doing things carefully ("carefully" in the same sense as a filesystem that guarantees mv to be atomic even if the power goes out). >14. Would you consider implementing Inversion in Java. The advantage >is that it runs on all platforms, and JDBC connects to all database, >and RMI would make the client-server implementation easier. Web >integration with java servles would certainly be easier. Speed is >somewhat of an issue, but java is pretty fast now and a lot of the >speed will be determined by the database rather than our code. I am >not concerned about CPU-bound speed problems because processors are >getting insanely fast and we can have fast machines in our racks. >Speed could be a problem for small (student) projects that dont have >good machines, but then again "small project" implies that they >probably dont have that many concurrent accesses anyway. I think >selling inversion as "version control that you actually want to use" >trumps speed claims anyway, we should make it easy in use and easy to >enhance. Java would also give us no-brainer internationalization >support. In my experience (not nearly as great as yours) Java is the least portable language: it is unusual in requiring *both* a compiler and a run-time, and naturally I've had problems with both. Maybe things are different now... I do think (as a political thing) that when people hear Java they shudder in anticipation of the problems they're going to have getting the thing running. Even though they may be wrong, this reaction is something to take into account. I mean, I guess I can't say "No, I wouldn't even consider implementing this in Java", but I certainly do lean against it... Internationalization support would be nice, but is it that hard in other languages, even C? >15. Compression. It would be great if the protocol batched up all the >files that need to be transfered for a given transaction and then >compressed the whole set of files using a standard zip algorithm or >something. Typically compressing many similar files can give better >results than compressing each indepently. This is assuming >dictionary-based compression, I think. Good idea! >16. What is the import/export format for this? Would I just do a >mysqldump? Should there be an XML file format for versions and log >entries and such? Dunno. :) >17. It would be nice to have scripts executed for any possible >action. CVS has scripts that run on many actions, e.g., commit and >tagging. Actually, if we use java it would be possible to define a >sandbox to run these scripts safely. That would allow developers to >upload "inversion agents" that are .class files invoked whenever an >action is taken on the server and do some kind of testing or >notification on behalf of that particular developer. This could be a >good hook for integrating Inversion with a server-farm build system, >implementing the hierarchy of servers described in suggestion #7. Yeah; one of the things I want to do is go over the commit process step-by-step and figure out every possible place where we can give users a hook to run a script. CVS offers a few good places; we can offer more, I think. >18. You have specified that the contents of the source files should >be in the database. An alternative woulf be to just have the name of >some file in the file system. Could you explore/explain that tradeoff? Yes -- we'll do whichever is better. :-) Seriously, I don't know how various databases perform with big text chunks. Unless it's a serious hit, though, I think we should try to stay in the database. I mean, why use it unless we're going to use it? The only reason I can see to implement an extra level of indirection here would be if it gave us some big performance boost. * Wish List put blue-sky requests here...