Oh Most Noble and Virile Emacs, please be in -*- outline -*- mode!

     Inversion: A Version Control System You Actually Want To Use
     ------------------------------------------------------------
              ($Revision: 1.41 $ of this document.)

* Introduction

This document describes Inversion, a replacement for CVS.  We may
sometimes call it "Ivn", to save space.  It will be described largely
in terms of how it differs, or doesn't differ, from CVS; if you don't
already know CVS, this document will be pretty unreadable.

Where you see the word "version", it means the same thing "revision"
does in the CVS documentation.  Everyone always ends up saying
"version" anyway, then correcting themselves, so let's just give up
already.  Besides, it's shorter to type.

"Project" and "Repository" also mean what you think they do -- a
repository is a place where multiple projects are stored.  Sometimes I
might slip and say "repository" where I should have said "project",
because CVS uses the r-word for both.  Bear with me.

* Requirements

** All CVS features are supported.

Everything CVS has, we want.  Versioning, folding of non-conflicting
changes, detection of conflicting changes, branching, merging,
historical diffs, log messages, line-by-line history ("cvs annotate"),
the works.

Generally, Inversion's conceptual interface to a particular feature
will be as similar to CVS's as possible, except when there's a
compelling reason to do otherwise.

** Directories, renames, and file meta-data are versioned.

Naturally!

By the way, rename support means you can trace through ancestry by
name *or* by entity.  For example, if you say "Give me version 12 of
foo.c", do you mean version 12 of the file whose name is *now* foo.c
(but perhaps it was named bar.c back at version 12), or the file whose
name was foo.c in version 12 (perhaps that file no longer exists, or
has a different name now)?  Inversion offers a way to specify which
interpretation you want, while defaulting to the most reasonable
interpretation in context.  We haven't decided what "reasonable" means
yet, but we're working on that.

** Version numbers are project-wide and atomic.

Unlike CVS, each Inversion commit results in a single new version
number, which applies to the project as a whole (i.e., it applies to
every object in the project, whether changed or not).  There are no
longer separate per-file version numbers, at least not user-visible
ones.

   Note: Well, it may be useful to give users access to individual
   files' or directories' "commit count" (same as the per-file
   "revision number" in CVS).

Commits are atomic -- either the entire commit succeeded, or the
entire commit failed.  No one can ever retrieve a project tree in an
inconsistent state.

** Text vs Binary issues are handled better than in CVS

... depending on your definition of "better", of course.

Historically, binary files have been problematic in CVS for two
unrelated reasons: keyword expansion, and line-end conversion.

Keyword expansion is, for example, when CVS expands "$ Revision$" into
"$ Revision 1.1$", and each successive commit updates that to the next
revision number.  There are a number of keywords in CVS: "$ Author$",
"$ Date$", and so on.

  (Note: the examples above contain an extra space to prevent keyword
   expansion from happening to them!) 

Line-end conversion is when CVS gives plaintext files in the working
copy the appropriate line-ending conventions (LF, CRLF, or CR) for
that platform (the repository always stores text files in Unix LF
format).  Line-end conversion requires knowing that the file is
plaintext, of course -- doing it on an image file or other binary
would only mess things up.

Both keyword substitution and line-end conversion are possible only
with plain text files.  CVS only recognizes two file types anyway:
plain text, and binary.  CVS assumes files are plain text unless you
tell it otherwise.

Inversion, for its initial release at least, will recognize the same
two types (later, it may learn to do keyword substitution inside
word-processor formats, etc).  The question is, how does Inversion
determine whether a file is plain text or not?

Experience with CVS suggests that assuming text unless told otherwise
is a losing strategy -- people frequently forget to mark images and
other opaque formats as binary, and then have to spend time figuring
out why CVS mangled their data.

So, Inversion assumes files are binary, unless they have a standard
text extension (.c, .h, .pl, .html, .txt, and so on), or if a user
explicitly marks the file as being texty.  In addition to a
pre-defined list of text extensions, each project can have a custom
extension list as well.  Also, of course, a user can choose to mark a
file as text or binary no matter what its extension.

Text files undergo line-end conversion by default.  Users can turn
conversion on or off per project, or per file pattern, or per file.

Text files do *not* undergo keyword substitution by default.  Users
can turn substituiton on or off per project, or per file pattern, or
per file.

(The reason keyword substitution is off by default is that I've seen
it cause problems in files where variable names or string constants
happened to contain keywords.  But these problems are somewhat rare,
of course, and usually easily remedied, so I'm not entirely convinced
that defaulting to `on' would be all that bad either... thoughts?)

By the way, any of the above properties attached to a file are really
attached to a particular version, and each new version inherits from
the previous version except when told otherwise.  Thus type changes
are recorded like any other historical data (in practice, the type
will probably be stored in the entity's property list, see
implementation discussions farther down).

** I18N/Multilingual support.

Inversion has I18N support -- commands, user messages, and errors can
be customized to the appropriate human language at build-time.  Also,
there will be I18N support for the *names* as well as the contents of
versioned entities.

For purposes of keyword expansion and line-end conversion, Inversion
also understands the UTF-* encodings (but this may be after the first
release).

(It may be workable to just treat everything as UTF-8 at first; for
non-UTF-8 text files with a few "meta" characters, this might at worst
result in some number of bytes after the meta character being
ignored... no big loss, except when a keyword code or line-end falls
within the lost bytes, but that probably wouldn't happen too often...)

*** User-visible messages

Inversion can be localized so that its messages print out in a
build-time-configured language.

** Branching and tagging are constant-time operations in Inversion.

CVS makes an unnecessary distinction between branches and tags, and
then goes on to implement them in an inefficient way.

Inversion supports them both with one efficient operation: `clone'.
To clone a project is to create another project exactly like it, except
that the new project "knows" that the old project is its ancestor.
Version numbers in the new project start at the ancestor's version
number.

Clearly, at the moment of creation, a clone requires only a tiny amount
of constant space -- the clone is like an "alias" for the old project.
And if you never commit anything on the clone, then it's just like a
CVS tag.

If you start committing on the clone, then it's a branch.  Voila!

This should also get us CVS "vendor branching" (since we'll have real
rename and directory support).

(But note that it will be supported and normal to define the clone's
name inside the original project's namespace, because people mentally
think of tags and branches as being "inside" the project, not as being
fully independent top-level objects themselves.  And we certainly
don't want cross-project name conflicts in tags and branches!)

[Thanks to Jim Blandy <jimb@red-bean.com> for pointing out that
`clone' is a clean way to get both tagging and branching.]

** Repeated merges are handled gracefully.

Inversion merges are the same, in principle, as CVS merges: merging
means taking the differences between version A and B, and applying
those differences as a patch to version X (which may simply be the
working copy, rather than a formal repository version).

BUT, Ivn remembers what has already been merged, and doesn't merge
anything twice unless forced to.  So if you merge the differences
between <Branch A, Version N> and <Branch A, Tip> into
<Branch B>, and then you do it again a month later, the second
merge will only get the changes from <Branch A, Previous Tip> to
<Branch A, Current Tip>, *not* from <Branch A, Version
N> to <Branch A, Current Tip>.

This will require a more complete specification, of course, but
above is the general idea.  Note that per-object merge records will
need to be maintained on the server side, since people often need to
merge project subsets.

(There are some theoretical problems with remembering merge sources --
knowing where the merged data came from implies some sort of universal
repository registry.  However, our first goal is to make sure that
multiple merges from branches made *in the same repository as the
original project* compound gracefully.  Remembering merges from remote
sources is more difficult, due to the difficulty of distinguishing
remote sources, but there are good "90%" solutions that will work in
practice).

** Inversion has flexible log message policy

A small matter, but one dear to our hearts...

Log messages are a matter of project policy, not version control
software policy.  If you give no log message, then Inversion defaults
to an empty message.  Maybe there can be a per-project configuration
option, requiring non-empty log messages for commits *in that project
only*, though I have doubts about even the usefulness of that.

(CVS tries to require log messages, but fails: we've all seen empty
log messages in CVS, where the user committed with deliberately empty
quotes, so let's please stop the madness now.)

** Conflicts are handled as in CVS, but a little better

For text files, Inversion resolves conflicts similarly to CVS, by
folding repository changes into the working files.

For *both* text and binary files, Inversion also always puts the
pristine repository version in one temporary file, and the pristine
working copy version in another temporary file.

Thus, in a text conflict, the user has three files to choose from

   1) the combined file, with conflict markers
   2) the original working copy file
   3) the repository revision from which the update was taken

CVS provides the first two, but doesn't directly provide the third.
It's true that one could retrieve the repository version from the
repository (using update -p or whatever), but it's more convenient to
have it right at hand.  When the conflict has been resolved and the
working copy is being committed, Inversion can automatically remove
the two pristine files.

*** Content knowledge allows better conflict resolution strategies

Certain kinds of conflicts can be resolved without human intervention.
For example, files in the format of /etc/passwd just need to keep
lines unique on username and/or user ID.  But when merging new
repository data into a modified working copy of such a file, it's
possible to get a textual conflict even with no "semantic conflict".

If the Inversion client knew something about the format of passwd
files, then it could merge in the new lines without creating a
conflict.

A similar rule could be used for ChangeLogs, based on the date in the
header lines.  And so on.

These features may not be in the first release of Inversion, but let's
keep them in mind.

** Support for plug-in client-side diff programs

There is no need for Inversion to have every possible diff mechanism
built in (or even any mechanism built in).  Instead, it invokes the
appropriate client-side diff'er on the two versions of the file(s)
locally.  For text files, the appropriate diff'er is the system `diff'
program, or maybe "emacs -f ediff".  For image files, it might be some
fancy visual comparison product.

Probably basic text diffing should be built into the client, but that
doesn't obviate the need to for clients to be able to specify
alternate (or additional) difference viewers even for text.

** Multisite / Local Repository

Two commonly requested enhancements for CVS are "multisite" and "local
repositories".  Although the two terms are sometimes used
interchangably, they refer to two different things (explained below).
Inversion could eventually support both, but they are not priorities
for the first release.  Nothing in the design prevents them from being
added on later.

*** Multisite

This is like the ClearCase multisite feature.  Essentially, it is a
redundant distributed repository.  The repository exists on two or
more cooperatively mirroring servers (each one presumably being close,
network-wise, to its intended users).  Commits on any server are
visible on all of them.  I'm not sure how conflict resolution is
handled when people at different mirrors change the same file at at
the same time -- perhaps one of the commits is backed out and returned
to that user?  Hmmm.

Anyway, CVS has gotten away with simply assuming decent Net access for
all users; this doesn't satisfy everyone, but for the first release of
Inversion we should probably do the same.

*** Local repository

This is one that people request a lot: the ability to commit changes
first to a local "working repository" (not visible to the rest of the
world), and then commit what's in the working repository to the real
repository (with the several commits being then folded into one
commit, usually).

Why do people want this?  I think mostly it's the psychological
comfort of making a snapshot whenever one reaches a good stopping
point, but not necessarily wanting all those "comfort points" to
become publically-visible commits.  This is understandable, but of
course the local repository will have to refuse to commit changes if
any of the files turn out not to be up-to-date.  (There may also be a
very cheap way to implement 90% of this feature by making skeleton
duplicates of the working copy's directory tree and copying the
changed files into that skeleton...)

(Are there any other reasons people want this feature?  I remember
seeing posts on info-cvs explaining what appeared to be other reasons,
but frankly I could never understand them very well...)

For now, similar effects are achievable without involving the version
control system, so this feature is not a priority for the first
release of Inversion.

* Design

** Repository

*** The basic repository structure

[This design is drawn from Jim Blandy's "Subversion" spec, but with
some changes from Ben and Karl (i.e., Jim is not to be held
responsible for everything you read below :-) ).]

Suppose we have a new project, at version 1, looking like this (pardon
the CVS syntax):

   prompt$ inv checkout myproj
   U myproj/
   U myproj/B
   U myproj/A
   U myproj/A/subA
   U myproj/A/subA/fish
   U myproj/A/subA/fish/foo
   prompt$

(Only the file `foo' is a regular file, everything else in myproj is a
directory so far).

Let's see what this looks like as an abstract data structure in the
repository, and how that structure works in various operations (such
as update, commit, and branch).  This data structure is described as
though it lives in permanent heap memory; how we will actually
accomplish permanent storage is an implementation detail, (probably
involving DBM files of some sort).

In the diagrams that follow, straight horizontal lines with arrowheads
represent backwards time-flow (the regression of versions), and any
other kind of path represents a parent-to-child connection in a
directory hierarchy.  Boxes are "nodes".  A node is either a file or a
directory -- a letter in the upper left indicates which kind.  A
node's name is stored in the node's parent, not in the node itself,
for reasons that will become obvious.

A file node has a byte-string for its content, whereas directory nodes
have a list of dir_entries, eaching pointing to another node.

At the top of the repository is an array of version numbers,
stretching off to infinity.  Since the project is at version 1, only
index 1 points to anything; it points to the root node of version 1 of
the project:


      Figure 1: The `myproj' repository at version 1.


                    ( myproj's version array )
       ______________________________________________________
      |___1_______2________3________4________5_________6_____...etc...
          |
          |
       ___|_____
      |D        |
      |         |
      |   A     |                 /* Two dir_entries, `A' and `B'. */
      |    \    |
      |   B \   | 
      |__/___\__|
        /     \
       |       \
       |        \
    ___|____     \ ________
   |D       |     |D       |
   |        |     |        |
   |        |     | subA   |      /* One dir_entry, `subA'. */
   |________|     |___\____|
                       \
                        \
                      ___\____
                     |D       |
                     |        |
                     | fish   |   /* One dir_entry, `fish'. */
                     |___\____|
                          \
                           \
                         ___\____
                        |D       |
                        |        |
                        | foo    |  /* One dir_entry, `foo'. */
                        |___\____|
                             \
                              \
                            ___\____
                           |F       |
                           |        |
                           |        |   /* (Contents of foo not shown.) */
                           |________|


What happens when we modify `foo' and commit?  First, we make a new
`foo' node, containing the diff from version 2 of foo to version 1.
Let's use ":N" to express version relationships; so, this diff is the
result of:

   diff foo:2 foo:1

Note the the order there: this is a reverse diff.  The idea is that
foo:2 will hold the full text of this new version, and foo:1 will
become a diff.  This is done to save space, it does not affect the
semantics of the implementation.  (Also note that we're assuming a
diff program that can handle binary data!)

The new node is not connected to anything yet, it's just hanging out
there in space:
                         ________
                        |F       |
                        |        |
                        |        |
                        |________|

It doesn't even know what version it belongs to; we'll get to that in
a moment.

Next, link the new node into the tree, where the previous node was,
and create a "back-link" (I forgot to tell you, nodes have a
back-link field) from the new node to the old node.  Back-links are
shown as parenthesized numbers with an arrow leading to the older
node:

      Figure 2: The `myproj' repository, in an intermediate state
                during the commit of version 2 (a modification to
                the file `foo').

       ______________________________________________________
      |___1_______2________3________4________5_________6_____...etc...
          |
          |
       ___|_____
      |D        |
      |         |
      |   A     |
      |    \    |
      |   B \   | 
      |__/___\__|
        /     \
       |       \
       |        \
    ___|____     \ ________
   |D       |     |D       |
   |        |     |        |                         
   |        |     | subA   |                         
   |________|     |___\____|                         
                       \                             
                        \                            
                      ___\____                       
                     |D       |                      
                     |        |                      
                     | fish   |                      
                     |___\____|                      
                          \                          
                           \                         
                         ___\____                    
                        |D       |                   
                        |        |                   
                        | foo    |                   
                        |___\____|                   
                             \                       
                              \                      
                            ___\____            ________
                           |F       |          |F       |
                           |        |          |        |
                           |  (1)------------->|        |
                           |        |          |        |
                           | full   |          | diffy  |
                           |contents|          |contents|
                           |________|          |________|


A back-link is a version number attached to a pointer into the past.
The version number tells you the version of what the links points to,
not what it points from.  So if you're looking for FILE:N, and when
you get a node for that file you find a back-link pointing to a
version >=N, then you know you must follow that link.

But I digress.  Right now, we're learning how to commit foo:2, and
there's one more step needed to complete that commit -- register the
new version at the top of the repository, thus making it externally
visible.  That's done by making entry 2 in the version array point to
something:

      Figure 3: The `myproj' repository, after commit of version 2. 

       ______________________________________________________
      |___1_______2________3________4________5_________6_____...etc...
          |       |
          |      /
       ___|_____/
      |D        |
      |         |
      |   A     |
      |    \    |
      |   B \   | 
      |__/___\__|
        /     \
       |       \
       |        \
    ___|____     \ ________
   |D       |     |D       |
   |        |     |        |
   |        |     | subA   |
   |________|     |___\____|
                       \
                        \
                      ___\____
                     |D       |
                     |        |
                     | fish   |
                     |___\____|
                          \ 
                           \                         
                         ___\____                    
                        |D       |                   
                        |        |                   
                        | foo    |                   
                        |___\____|                   
                             \                       
                              \                      
                            ___\____            ________
                           |F       |          |F       |
                           |        |          |        |
                           |  (1)------------->|        |
                           |        |          |        |
                           | full   |          | diffy  |
                           |contents|          |contents|
                           |________|          |________|


Version 2 points to exactly the same root node as version 1, because
that directory hasn't changed at all.  Nor has its child, nor has
anything until you get all the way down to foo.      

That's why nodes don't store version numbers -- the exact same node
may appear in many different versions.               

Here's how to retrieve foo:N, in general:            

   1. Go to the version table, find version N.       

   2. Walk down the tree in the obvious way, starting from the root
      node N points to.  As you walk, anytime you get to a node with a
      back-link >=N, follow the link before continuing downward.
      (Follow this rule even when you get to a node for `foo'.)

   3. When you have nowhere else to go, this is the droid you're
      looking for.                                   

Now watch what happens when we add a new file `bar' into `fish'
(i.e., bar will be a sibling of `foo').  Here's the new tree, the
intermediate steps not being shown:                  

      Figure 4: The `myproj' repository, after commit of version 3
                (the addition of a sibling to `fish/foo').

       ______________________________________________________
      |___1_______2________3________4________5_________6_____...etc...
          |       |       /                          
          |      /       /                           
       ___|_____/       /                            
      |D        |______/                             
      |         |                                    
      |   A     |
      |    \    |
      |   B \   | 
      |__/___\__|
        /     \
       |       \
       |        \
    ___|____     \ ________
   |D       |     |D       |
   |        |     |        |
   |        |     | subA   |
   |________|     |___\____|
                       \
                        \
                      ___\____
                     |D       |
                     |        |
                     | fish   |
                     |___\____|
                          \ 
                           \
                         ___\____        ________
                        |D       |      |D       |
                        |        |      |        |
                        |  (2)--------->|        |
                        |        |      |  foo   |
                        |        |      |___/____|
                        |        |         /
                        |        |        /
                        | bar    |       /
                        /        |      /
                       /| foo    |     /
                      / |___\____|    /
                     /       \       /
                    /         \     /
                   /         __\___/_            ________
            ______/_        |F       |          |F       |
           |F       |       |        |          |        |
           |        |       |  (1)------------->|        |
           |        |       |        |          |        |
           |________|       | full   |          | diffy  |
                            |contents|          |contents|
                            |________|          |________|


Trace various retrievals in the above structure, and you will see that

   + fish:2 and fish:1 are the same node, as they should be.

   + But fish:3 is different from them, which is also as it should be.

   + foo:3 and foo:2 are the same node, as they should be.

   + But foo:1 is different from them, which is also as it should be.

Thus, the traversal cost of retrieving version N of something is equal
to its depth in the tree plus the number of changes it or its
ancestors have undergone.  Unchanged entities cost nothing.

Whether fish:2 is stored as some sort of diff from fish:3 is an
implementation detail.  We could do it that way, but it may not be
necessary, since new directory nodes would only be created when a new
file or directory is added to the project, or a name is changed.  Most
commits tend to be edits to existing files.       

Just to drive the point mercilessly home, and to explore the
theoretical limits of ASCII diagrams, here is myproj:4, in which a
README was added to the project's top level directory:

      Figure 5: The `myproj' repository, with new top-level README:

       ______________________________________________________
      |___1_______2________3________4________5_________6_____...etc...
          |      /       /         /              
          |     /       /         /               
      ____|____/       /         /               _________
     |D        |______/         /               |D        |
     |         |               /                |         |
     |   (3)------------------/---------------->|         |
     |         |             /             _____|__ B     |
     | README  |            /             /     |         |
     |      |  |___________/             /      |   A     |
     |   B  |  |            ____________/       |___|_____|
     |  /   |__|_________  /                        |
     | /  A    |         \/       _________         |
     |_|___\___|         /\      |F        |        |
       |    \________   /  \_____|         |        |
       |             | /         |         |        |
       |             |/          |         |        |
       |             /           |_________|        |
       |            /|                              |
       |           / |                              |
       |   _______/  |                              |
       |  |          |                              |
       |  |          |                              |
    ___|__|_       __|_____                         |
   |D       |     |D       |                        |
   |        |     |        |________________________|
   |        |     | subA   |
   |________|     |___\____|
                       \
                        \
                      ___\____
                     |D       |
                     |        |
                     | fish   |
                     |___\____|
                          \
                           \
                         ___\____        ________
                        |D       |      |D       |
                        |        |      |        |
                        |  (2)--------->|        |
                        |        |      |  foo   |
                        |        |      |___/____|
                        |        |         /
                        |        |        /
                        | bar    |       /
                        /        |      /
                       /| foo    |     /
                      / |___\____|    /
                     /       \       /
                    /         \     /
                   /         __\___/_            ________
            ______/_        |F       |          |F       |
           |F       |       |        |          |        |
           |        |       |  (1)------------->|        |
           |        |       |        |          |        |
           |________|       | full   |          | diffy  |
                            |contents|          |contents|
                            |________|          |________|
     
     
There, wasn't that pretty?  I knew you would.          

*** Crash-proof repository mutation and locking

We want to make repository changes in a such way that the repository
is in a "sane" (unambiguous and readable) state at every step.  

At the same time, we want to keep locking to a minimum: no operation
locks out readers, no read-only operation locks out anyone, and write
operations lock out other writers for as little time as possible.

These two issues are not directly related, but they must be
synchronized with each other.  This is complex, so instead of
specifying the sequence of events formally right away, let's run
through an example without locking first.  Here's how to commit a new
version of foo:

  Step 1: Create the new node for foo:2.

            ________
           |F       |
           |        |
           |        |
           |________|

  Nothing is attached to this new node, therefore the repository is in a
  sane state.  If the server crashes at this moment, there will be no
  problem (except that a cleanup thread might, at its leisure, delete
  the unreachable node).

  The new node has the full contents of foo:2.

  Step 2: Create the back-link from foo:2 to foo:1.

            ________                ________
           |F       |              |F       |
           |  (1)----------------->|        |
           |        |              |        |
           |________|              |________|


  Again, the repository doesn't yet know that the new node even
  exists.  The back-link is a pointer from foo:2 to foo:1, not the
  other way around, so foo:1 isn't even aware of the link.

  Both nodes have the full contents of their respective version; no
  diffs have been made yet.

  Step 3: Change the link in the parent node to point to the new foo:2
  node (which in turn points to foo:1).

              \
            ___\____                    
           |D       |                   
           |        |                   
           | foo    |                   
           |___\____|                   
                \                       
                 \                      
               ___\____            ________
              |F       |          |F       |
              |        |          |        |
              |  (1)------------->|        |
              |        |          |        |
              | full   |          | full   |
              |contents|          |contents|
              |________|          |________|

  The repository is still in a mostly sane state.  If the machine
  crashes now, there's this slightly weird situation in which a node
  for foo:2 is present in the tree, but the version `myproj:2' doesn't
  exist yet.

  However, if you follow the three-step algorithm for retrieving
  FILE:N, you'll see that you still reach foo:1 just fine -- it's as
  though foo:2 isn't there.  Another way to look at it is that version
  2 of the project does not exist yet, therefore you can't ask for it;
  if you ask for the head version, you'll get version 1.

  Step 3: Hook up version 2 to the root node.

            ____________________________
           |___1_______2________3_______...etc...
               |       |
               |      /
            ___|_____/
           |D        |
           |         |
           |   A     |
           |    \    |
           |   B \   | 
           |__/___\__|
             /     \
            |       \

You'll notice that nowhere in there did we give foo:1 diffy contents.
A separate thread gambols about the repository, taking care of such
mundane tasks.  There's no reason to slow up commits with it.  (Of
course, there's a "diff-bit" on each node, saying whether its contents
are stored full-text or as a diff against the node that back-links to
this one.)

Now let's generalize that procedure, and add locking...

The important thing to realize is that someone else may initiate a
commit before or during our commit, and their commit might finish
before or after ours.  We don't want to make their commit wait
unnecessarily, so we're not actually trying for a particular version
number -- we just want the next available version number at the time
our commit finishes.

Below, a "lock" is a cookie consisting of user:num, where num is the
same for all locks in this commit, and different from any other
current locks (even ones with the same user).  Locks persist through
server crashes (todo: can we implement that efficiently?)

  1. Lock the nodes you're committing new versions for (if you're
     adding a new file then there's no node to lock, though).  These
     primary locks will last throughout the commit.  Block on all
     locks while you block on any, so your commit will be atomic.

  2. Create the new, unattached nodes that will hold the new versions.
     Fill them up with the new contents.  Create the back-links.

  3. Lock the parents of the nodes you're committing (call these
     "secondary locks").  Again, block on all while block on any.

  4. Set dir_entries in the locked parents to point at the newly
     created child nodes.  Although this cannot be atomic if there are
     multiple parents, it is safe because the server knows that a node
     doesn't really exist if its parent is locked.  (Amend the
     node-finding algorithm given earlier to always follow back-links
     from a node whose parent is locked, until you find a node with
     the same lock as that parent).

  5. Now create the new version number; just take the next available
     number and hook it up to the root node, but put a special
     "negative lock" on the version number first.

     This "negative lock" has the same id number as all the other
     locks in this commit, but it is interpreted differently.  Its
     point is to invalidate the other locks while they're being
     removed -- the idea is that if you would block on a lock, first
     check if there's a corresponding negative lock on a version
     number somewhere.  If there is, then you don't have to block on
     the original lock, because all the important parts of its commit
     are complete, except for lock removal. 

Now, we could actually stop right there.  The commit is done; all the
information is in the right place in the repository.  Removing the
locks, and then removing the negative lock, *could* be done by a
separate thread, just like diffication.  But it probably makes sense
just to take care of lock removal right now, although we can still
have a separate thread that looks for inoperative locks (to clean up
after server crashes).  So:

  6. Remove all the other locks.

  7. Remove the negative lock from the version number.

Voila.

There's probably a variation whereby one *does* reserve the version
number at the outset of the commit.  If there's any reason why getting
the next available version number as of commit start is preferable to
as of commit end, then we can do that.  But I don't see why it would
be preferable -- and I like the mellow approach where you just grab
whatever version happens to be in line when you're ready for it.
Thoughts?

*** How renames work, and what they imply

Exactly how you think they do.

To rename an entity, you make a new node for its parent directory.  In
this new node, the entity's dir_entry has the new name, but still
points to the same node as the old dir_entry.  That's why a name is
stored in the parent, not in the thing being named.

This implies that each entity (where "entity" means roughly "a
most-recent node along with everything it backlinks to") has some
unique identifier aside from its name.  These identifiers are stored
alongside the name in the parent's dir_entry, and are unique within
the project.  

Having both the human-visible name and an internal identifier allows
Inversion to choose between following a history by name or by
identity, which can result in two different retrieval scenarios
(although such circumstances are unusual, we must support them).

These unique identitifiers, once assigned, are _never_ changed.  When
an entity is renamed, only the name changes, not the id number.

*** How removal works.

Exactly how you think it does.

You commit a new version of the parent node, which simply doesn't have
an entry for the removed child.

*** How resurrection works.

Suppose you removed `foo' several versions ago and now you want it
back.  When the user the working copy types

       $ inv resurrect foo

what happens?

Well, in the repository, the server starts at the base version's node
for the current directory (i.e., foo's parent directory).  Note that's
"base" version, not "latest" version -- the base is the one on which
the current working copy is based.

Starting there, follow the backlinks, asking at each step whether
there's a dir_entry for `foo'.  If there is, follows that link -- that
gets you the most recent `foo'.

Of course, there could have been several different entities in the
history named `foo' in this directory.  Which is the one the user
wants?

By default, the server assumes the user meant the most recent foo.
But the user could ask for one at a specific version:

   $ inv resurrect foo:5

In that case, the server follows backlinks until it gets to the
directory belonging to version 5, and then looks for an dir_entry
pointing to foo.

Or, the user might first need to know how many different entities have
been named `foo' here, before choosing one to resurrect:

   $ inv resurrect -l foo
   /* created    removed       versions    entity_name    entity_id  */
   2000-08-01   2000-10-14       1-5           foo           1729
   2000-12-15   2001-05-29       7-12          foo           1952
   2001-11-29   2002-03-20       20-103        foo           1729

From this we see that an entity named `foo' was created in this
directory, later removed, then another entirely different entity named
foo was added, then later removed, then the first foo was resurrected
and lasted for 83 versions before being removed (had it never been
removed, the third "removed" date would be an empty field).

The procedure by which the server generates this list should be
obvious.

Now the user decides to resurrect the middle foo.  Usually she would
want it at its latest revision, in this case 12:

   $ inv resurrect foo:12

Please note: the examples here were meant only to illustrate various
resurrection scenarios, and should not be taken as client
user-interface specifications.

*** How clones (tags and branches) work

Branches and tags are both implemented in terms of "clones".  Cloning
a project is constant-time and constant-space -- you just make a new
name that points back to the original project.  All clones are
automatically tags; and once you start committing on a clone, it
becomes a branch as well.

Branches *never* affect the original project -- the original data
structure remains untouched, it does not even know there's a branch
attached to it.  As will become clear below, this forces us to do a
little more node duplication than some other schemes would, but it
wins overall because it allows users to tag and branch projects to
which they don't have write access.  (Try *that* with CVS!)

Let's make a clone of myproj, called `xproj', based on myproj:5 (as it
happens, 5 is the highest version in myproj, but that's not a
requirement -- a branch might sprout off any version in myproj's
history):

            ______________________________________________________
  myproj   |___1_______2________3________4________5_________6_____...etc...
                                                 /
                                                /
                   .----(back to myproj)-----> +
                  /
                 /
            ____/_________________________________________________
  xproj    |___5_______6________7________8________9_________10____...etc...


The xproj branch starts at version 5; if you ask for a version 5 or
younger, you will get it, but it will be same as that version of
myproj.  (The reasons for doing things this way will become clear
later.)

Right after xproj is created, it has no nodes of its own -- every
request is referred to myproj's repository.

When you commit a change to `foo', here's what happens:

In order to commit a new version of foo, we'd have to change foo's
parent node to refer to the new foo node.  But we can't -- we don't
have write access to myproj!  So we have to make a local copy of foo's
parent.  But that leaves the parent's parent, over in myproj, still
pointing at myproj's version of the parent...

You can see where this is going, I'm sure.

The problem bubbles right up to the top, requiring us to create a root
node locally.  All its dir_entries that are not ancestors of foo just
refer back to nodes in myproj; those that are ancestors refer to new
local copies of nodes along the line of ancestry, all the way down to
foo:

            ______________________________________________________
  myproj   |___1_______2________3________4________5_________6_____...etc...
                                                 /
                                                /
                   .----(back to myproj)-----> +
                  /
                 /
            ____/_________________________________________________
  xproj    |___5_______6________7________8________9_________10____...etc...
                       |
                       |
                    ___|_____
                   |D        |
                   |         |
                   | README -------------.
                   |         |            \
                   |   A     |             \
                   |    \    |              \
                   |   B \   |               \        (points back
                   |__/___\__|                \______  to a node
                     /     \                           in myproj)
                    /       \
                   /         \
                  /           \ ________
           (points back        |D       |
            to a node          |        |
            in myproj)         | subA   |
                               |___\____|
                                    \
                                     \
                                   ___\____
                                  |D       |
                                  |        |
                                  | fish   |
                                  |___\____|
                                       \
                                        \
                                      ___\____
                                     |D       |
                                     |        |
                                     | foo    |
                                     |___\____|
                                          \
                                           \
                                         ___\____
                                        |F       |        (points back
                                        |  (5)---------->  to a node
                                        |        |         in myproj)
                                        |________|


This "line-of-ancestry copying" looks a lot more expensive than it
really is.  Remember that this example project consists mostly of
directories, but a real project is mostly files, whose nodes never
need to be copied except when they change.  And directory nodes only
need to be copied the first time one of their descendents changes --
after that, the directory node is already there.

Thus, a branch fills out over time, eventually holding a more-or-less
complete skeleton of the project's directories, plus any changed or
added files.

The exact method by which local entries "point back" to the original
project is not settled yet.  It might be some general URL-like syntax
that we can use in all sorts of circumstances, like

   user@inv.hostname.domain:repositoryname:myproj:5:/A/subA/fish/foo

(This will come in handy later, when we need a way to get unique
labels for the diffy chromosome chunks in partial merging... but more
about that when we get to it. :-) )

**** "Shallow" and "deep" clones (insurance against loss of trunk)

The cheapest kind of clone is constant-time and -space.  Call it a
"surface" clone, because it includes by reference wherever possible.

For tags and branches being stored in the same repository as the
original project, surface is the way to go.

But if you're making the clone on another server, then you have to make
a political judgement involving the importance of your clone, of the
historical data antecedant to your clone, and of the original
repository's stability.  What if you think that source repository
might disappear on you?  Then you want something more than just a
surface clone.  

Inversion offers two choices: "shallow" and "deep" clones.  Shallow
clones locally duplicate all the data for the version in which this
clone is rooted, but they don't duplicate anything older than that.
Deep clones duplicate the *entire* project -- the whole node
structure.  (A deep clone still remembers its ancestry, however).

Here's are the three possible kinds of clones, for reference:

   1. Surface clones
        Refers to original project wherever possible.

   2. Shallow clones
        Duplicates the node structure of the source version that roots
        this clone.  Older versions are included by reference, however.
        Ancestry is remembered.

   3. Deep clones
        Duplicates the entire node structure of the source root
        version, including back-links.  Ancestry is remembered.

fooo:
[This is why clones start at the version number they're rooted
at... (todo: explain this)]

fooo:
Explain why remembering ancestry is important even for shallow and
deep clones.

**** Merging branches

whoax

*** How meta-data is stored

Each node has a property list (key->value, key->value).  Among the
things we can store in the property list are file permissions, data
type override (text, binary, whatever), whether or not to do
platform-specific line-end conversions, etc.

It might be good to have two plists, actually: `inv_props', and
`user_props'.  Inversion only uses the ones in inv_props, and promises
a "store-and-ignore" policy for user_props.  And users should never
store anything in inv_props, only Inversion handles that.

*** A closer look at the `node' data structure (and friends)

   typedef struct dir_entry
   {
      char *name;                    /* Child's name, multi-lingual I hope */
      node *child;                   /* The target node.  Given here
                                        as a pointer, but it might be
                                        a node_id in reality. */ 
   } dir_entry;


   typedef struct node
   {
      unsigned long int entity_id;   /* Unvarying unique id for this entity */
      unsigned long int node_id;     /* Unique id for this node 
                                        (See struct dir_entry above
                                        about pointer vs node_id.) */
      int type;                      /* File, directory, symlink, etc */
      BOOL diffy;                    /* Holds full text or diff? */
      plist *inv_props;              /* Properties managed by Inversion. */
      plist *usr_props;              /* Properties ignored by Inversion. */

      /* I think we might need parent pointers to do multiple hard
         links correctly; but there might be another way... */
      struct node **parents;

      /* See notes about locking above.  This may be stored as
         "out-of-band" data outside the node. */
      lock *lock;

      /* I'm conjecturing that dir_entries will be stored as chains;
         but it may turn out to be more sensible to store them as
         text, just like regular contents. */
      union contents {
        struct dir_entry **dir_entries;
        char *data;
      }
   } node;

None of this is set in stone, of course; the boolean `diffy' might
instead be a string `diff_method', for which null means full contents
at this node.  The contents might be a string even when this node
represents a directory -- a special line-by-line texty format would
list its dir entries, and that could make it easier for us to store
directory changes in a diffy way.

** Client

There's not much to say about the client; it just has to speak the
protocol and do the right thing.  However, the layout of working
copies needs to be specified.

*** Working copies

The tentative plan is to use an IVN/ subdirectory, just like the CVS/
subdirectory.  However, the IVN/ subdirectory will be at the top level
of the checked-out portion of the project.

The reason for this is that then an IVN/ administrative subdir in a
subdirectory of the project signifies that this subtree comes from
some other project (perhaps from a different repository).  This makes
project "transclusion" easier.  If your project depends on a library
that someone else is maintaining, you can include a working copy of
their project inside yours, without taking over versioning in any
way.

The contents of the IVN/ subdir are still under consideration, but it
should be possible to sketch generally what's in there:

  A file indicating the base version of this working copy.

  Something like a CVS/Entries file

  A base tree of the project, for client->server diff transmission
    (OR, rsync-like records in the Entries-like file :-) )

Of course, the format of the so-called Entries file will be very
different from CVS.

This working copy administrative area has not been given much thought,
because it will largely come out in the wash after the repository and
protocol specifications are done.

** Other miscellaneous design goals

*** Natively client/server.

Even when accessing a local repository, the ivn client connects to a
ivn server and talks the ivn protocol.  This is for maintainability --
in CVS, it's easy to write a bug by implementing something correctly
for either the client/server or the standalone but not both (and more
often neither, but that's another story).

The protocol may be DAV, if DAV turns out to be suitable.  Still under
investigation.

*** The client/server protocol sends diffs in both directions

Inversion transmits diffs both ways (unlike CVS, which transmits diffs
from server to client, but full files going the other way, thus using
more bandwidth than necessary).  Ivn accomplishes this either by
keeping pristine copies locally, or using the rsync algorithm; which
way we do it is an implementation detail that hasn't been decided yet.

*** Minimize network turnarounds

The CVS client/server protocol goes to great effort to make sure that
the client and server need only a minimum of exchanges per operationn.
I believe it is "theoretically" possible to never need more than 2
turnarounds (i.e., client asks server preliminary questions, server
responds, client then does the real request, server responds), and
most things can be done with just 1.

*** Minimize database turnarounds (repository storage issues)

[Note: we're no longer sure there will even be a database, but the
concerns below apply to whatever static block of sludge the repository
data is stored in, be it Oracle, Berkeley DBM, or flat files on disk.]

We need to make sure that traversing tree structures does not require
N separate database accesses, where N is proportional to the depth or
breadth of the traversal.  For example, if retrieving an old version
requires chaining down from the present version, and each link on the
chain is another database access, that's bad.  Database accesses
should be thought of like network turnarounds -- one hopes to arrange
things so that any operation requires only a (small) constant number
of them, not N.

*** Times should be proportional to change size, not project size

The time to complete any operation should be proportional to the size
of the change that operation relates to, not the total amount of data
in the project or in the repository.  Committing three files should
take roughly the same amount of time in a project with ten files as in
a project with forty thousand.  Same with diffing three files, etc.

An exception is made for recency: retrieving an old version may take
longer than retrieving a recent one, because old versions will take
longer to construct.  This is a tradeoff of using diff-based storage,
and has worked out pretty well in practice.

** Scenarios 

Each scenario should be mentally played out for a given design,
analyzing for performance and scalability issues (particularly in
network traffic, storage sizze, and storage accesses).

If you spot a weakness in a design, add a scenario here that
highlights it: even if that particular design gets tossed, the
scenario will probably reveal similar weaknesses in other designs.

(On the other hand, if a scenario never reveals weaknesses in
anything, then remove it -- it's not helpful if it only demonstrates
the ease of implementing that operation well.  Currently this list is
a brain dump; there's no reason it can't be shortened.)

TODO: incorporate the WebDAV/DeltaV scenarios list here too.

  + Commit from an up-to-date wcopy

  + Commit from non-up-to-date wcopy, with the modified files not-up-to-date

  + Commit from non-up-to-date wcopy, with the modified files up-to-date

  + Updating to most recent version(s), with no conflicts

  + Updating to most recent version(s), with conflicts

  + Party A commits while Party B's commit is in progress, but no conflicts

    In this context, "conflicts" means the same entities were changed,
    even if the changes are non-overlapping.

  + Party A commits while Party B's commit is in progress, with conflicts

    In this context, "conflicts" means the same entities were changed,
    even if the changes are non-overlapping.

    Conflicts can mean changing the same file, or adding the same
    file, or one person modifying it while the other wants to delete
    it, or whatever.  (I don't think it will matter in the end, as the
    implementation is mainly concerned with the idea that someone
    wants to do something with a non-up-to-date file).

  + Add a new entity (directory or file)

  + Add a new entity (directory or file) from a non-up-to-date wcopy

    Both with current directory up-to-date, and not.  (First determine
    what desired behavior is!)

  + Rename an entity

  + Retrieve an old version of an entity that has been renamed

  + Remove an entity

  + Resurrect an entity

  + Move an entity to a new directory (may be different from rename)

  + Diff two particular versions of a file

  + Diff a working (uncommitted) version against an old version of a file

  + Retrieve version X of a file

  + Retrieve version X of a directory tree 

  + Show annotations of both modified and old versions of files.

    (Golly, we need some word other than "annotations".  To me it
    implies that one has added log messages to individual lines, or
    something like that, not that CVS is showing you who did what and
    when.)

  + Show annotations of a directory, similarly.

    Who created the file, and who last committed to it.

    Should this be a wish-list item, though?  CVS doesn't do it, but
    then again, CVS doesn't do directories either...

  + Make a branch

  + Make a tag (same as making a branch, in all proposed designs)

  + Merge a branch into trunk

    Not talking about the user interface to branch merging, here, but
    rather internal implementation: how the system discovers what is
    the same vs what needs merging, what conflicts, etc.

  + Merge a branch into trunk again

  + Merge a branch into another branch

    Is this fundamentally different from merging into trunk?  Is the
    trunk "special" in any way, implementation-wise, or is it just the
    first branch?

  + Change a previously committed log message

  + more to come...


* Meta-implementation

This section describes the process of implementing Inversion -- what
order we do things in, and how.

Once the design is ready, the parallizable parts of the software
should be determined, and tasks divvied up.  At this point or very
soon after, the client/server protocol needs to be designed too.  We
will evaluate WebDAV (using the same examination-under-common-ops
technique as with the data design); if WebDAV is suitable, it would be
a good choice, because it already has some political acceptability.

During the initial coding phase, everything that can be farmed out to
external programs will be.  For example, the diff routines will just
invoke xdelta and diff (or whatever), and the client side will invoke
an external patch.  In the long run, these things will very likely be
internalized for speed, but their semantics won't change when we do
that, and it's more important to get the stuff up and running as soon
as we can, and optimize afterwards.

There should be an automated test-suite.  It should not be written in
Bourne shell script (we can at least learn this lesson from CVS!).

Authentication questions will be put off until most of the software is
working, since they probably don't affect the rest of the design much.

* Jason's comments

     Please note that these comments are based on an older
     incarnation of the design.  I haven't gone over them yet
     to see which ones still apply.  -kff


--------------------8-<-------cut-here---------8-<-----------------------

>From: Jason Elliot Robbins <jrobbins@collab.net>
>Subject: Inversion feature outline (was: for meeting w/ Rational)
>To: kfogel@red-bean.com
>cc: jrobbins@collab.net, npm@collab.net, brian@collab.net, jon@collab.net
>Date: Thu, 02 Mar 2000 19:02:47 -0800
>
>Hi Karl,
>
>Here are some comments I came up with while reading your Inversion
>document on my last plane trip.  Overall this document looks like a
>good start, but there is a lot to work out.
>
>When would you like to put this document on the inversion.tigris.org
>web site?
>
>
>Here are my comments:
>
>I know this is long, the best comments are 1, 7, 12, 14, and 17.
>
>1. We should plan for web integration.  That could mean a lot of
>things.  One idea off the top of my head is to actually use a URL for
>the CVSROOT: I would make it an actual http:// scheme, rather than
>some new ivn: scheme, so that there is a clear place to host a web
>page with instructions about that repository.  Alternatively, we could
>use a new scheme like ivn:argouml.tigris.org, so long as the CVSROOT
>is very memorable to people who know the web address. Right now the
>maping between projects (e.g., www.webmacro.org, or java.tigris.org)
>and their CVSROOT varies a lot from site to site.  It would be nice to
>make this a no brainer.

Well, I'm not so sure... The reason this mapping (www<-->repository)
varies so much is political, not technical.

For example, it so happens that the hostname argouml.tigris.org hosts
one repository with one project in it, so there would be no ambiguity
in referencing that project by the hostname.  But there are many
servers that host multiple projects, and not all of those projects
have their own domains.  In many (most?) cases, the repository server
is on a different host from the Web server.  Or: sometimes people want
to deliberately associate their project with an organization, so they
don't want a strong link between the project's name and its Web
address.

Also, as far as I can tell, people don't really have trouble figuring
out a project's web site given a working copy, or vice-versa.  They
may have trouble *using* CVS to get the working copy, but that's a
different matter...

I think it's actually not our place to make assumptions about how
people want to present their project to the world.  It *is* our place
to make sure that our system provides no obstacles to whatever method
a user chooses to make her project web-visible, but "no obstacles"
doesn't mean special support for anything in particular.

If someone wants to keep their project's web pages in the project's
source tree, and point their web server at the right place, there's
nothing stopping them from doing that.  If they arrange it so the
names are the same, that's great -- and if they don't, there might be
a reason for that.

Hmmm, I can go into greater detail on this if you want; my hands are
tired of typing right now so I'll leave this question here.

>2. I would like to make the difficult things in CVS easier in Ivn.  I
>think you mention that you would like to make things like moving files
>and directories, etc. easier.  Actually, the second paragraph in
>Meta-implementation is definately the right idea.  Looking at all the
>multi-step proceedures outlined in your CVS book and making them one
>step in Ivn would be great.

Oh gosh you bet! :-)

>3. The concept of assigning a new version number to every file
>everytime there is a commit is rather foreign to me.  Is this used in
>other vc systems?  Are other people familiar with this practice?  Does
>that mean that it would be very awkward to verbally tell someone to
>check out a certain version (e.g., version 2378812, instead of 4.7)?
>Would it be possible to tell at a glance how many versions a file has
>gone through?

I don't know; I don't know; no I don't think so; and yes. :-)

Perhaps we could make a distinction between a file's commit count and
its project's commit count -- i.e., two kinds of version numbers,
treated equally; which one you use depends on circumstances.  Let's
mull on that.  I do worry about the potential for confusion, though...

How big do project version numbers get?  Well, to give you an idea,
the entire CVS tree (dating from 3 Dec 1994) has 4254 commits.  4254
isn't such a bad version number -- I wouldn't mind having to deal with
a few more digits than that, personally.

I don't think of it as assigning a new version number to every file on
every commit -- that's not what's happening inside anyway.  Instead, I
think of it as a single version number for every *commit*, and a
commit involves certain files.  As for files not involved in that
commit, if you ask for one of them at that version number, you'll just
get the most recent version not exceeding the one in question.

[My table description may seem not match what I'm saying here, but
that table desc is just a draft and will be updated.]

>4. Why make keyword expansion off by default.   If I want to do it I
>will have to search through the documentation to find the right
>option.  I realize that it might mess up binary files, is there
>another way to make that safe.

Well, right now, if you don't want to do it, you have to search
through the documentation to turn it off. :-)

Seriously, I'm by no means sure that off is better, I just lean that
way because I've been nailed several times in the past by not turning
it off when I should have (not talking about binary files, I mean text
files that happened to have text that matched keywords, but I didn't
want that text to be expanded).

[todo: more to say here, fingers falling off, will respond more
later.]

>7. How about the idea of "local branches" that are stored on a
>developer's working machine in addition to the local working copy of
>the checked out files.  This is basically a fraction of the total
>repository stored on each client.  The local reposiroty would be
>filled in on demand, e.g., if you do a diff against an old version,
>that version would be stored in your local repository.  Developers
>could also commit to the local branch several times before eventually
>merging into the central servers main branch.  If local branches are
>accessable via the internet then we could basically do away with
>patches.  If a developer does work that he wants to contribute back to
>an open source project, he could just email the URL of his local
>server.  Also, the concept of local branches could be generalized to
>build a higherachy of servers with the global project home server at
>the top and various mirrors in the middle, LAN servers near the
>bottom, and each developers' local server at the bottom.  This has the
>advantage that every developer can have commit access to some
>repository, we get a range of trust rather than a boolean permisssion
>to commit to the central server or not.

Sure!

I think what we have to do is design this in such a way that it
doesn't take any special support from us to create this concept.
That is, if one knows the client/server protocol, one can implement
local branches without us changing anything.

So, we should think about this as we go, but I don't think we should
actually make it part of the first release.  We should just try to
have a design that doesn't make this feature impossible.

[There are some "theoretical" problems that come up in schemes like
this, having to do with conflict resolution of course, which we have
to think through.  So we may have to back down from this idea to an
arbitrary degree... I hope not too much though, 'cause it's neat.]

>8. I would like to have some kind of plug-in permission system.  I am
>wary of yet another set of user account records and permissions.  We
>have found that half the tools we want to integrate into tigris has its
>own user account system, and that makes a lot of duplicate data that
>needs to be kept in synch.

Yeah, some way of saying "default to the system users and access
methods on this host", and then you can add inversion-only users as
necessary. (?)

>9. Some of the data in your tables is not as normalized as it cood
>be.  I know this may not be a priority, but here are some that I
>noticed.: the names of entities and their states could be in separate
>tables.  Line mods could be a table of line ranges rather than text, I
>guess it depends on what format you were thinking of for that text
>field, but it seems like you might do a lot of parsing.  The names of
>properties in key value pairs could be stored in a properties names
>table.

Yup, thanks.  Still a draft, we'll normalize.

>10.  I am a little concerned about storing the parent container
>referenace and a basename.  I would like to have permissions limited
>to a given scope that matches the complete path.  For example, I would
>like to grant a particular developer the ability to commit
>.*/dist/.*\.html but not any .java files or any file not in a
>subdirectory of some dir named named api.  You have some of this in
>your ACCESSES table.  For the sourcexchange.com web site we developed
>a much more general permission scheme that you can read about at
>http://joist.tigris.org. 

I'll look into this.

>11. I would like to see some audit mechanism.  I think in this version
>of the document you left off the basic ability to see who did each
>commit.   More generally, we should consider the repositiory
>administrator role in all its glory.  I dont want to require a lot of
>administration work, but whatever needs to be done should be put under
>the same "evaluation across operations" as we do for developer
>operations.

Oops... the problem referred to in your second sentence is due to an
error in transferring my tables from paper to disk.  Fixed now. :-)

In general, yes, I think I see what you mean about admin role.

>12.  It would be nice if the server could be installed without root
>access.  This would allow a lot more people to set up servers for
>their local work group etc.  As a student, I never had root access to
>a big machine, and a lto of people working on OS are students.  Some
>small local server is also important to the local branch idea.  Apache
>can be installed without having root access, and so can ezmlm mailing
>lists, I really like that power for everyone.  The security
>implications are good and it would surely simplify installation.

No reason why this can't happen.  Nothing in the design as it stands
right now requires root, AFAIK.

>13. We should consider the following levels of internet connectivity:
>high speed, low speed, disconnected (e.g., laptops and modems), flaky
>connection.  Low speed constant connections (e.g., in distant
>contries) might benefit from background filling in of their local
>repository history.  Flaky connections might benefit from a resume
>capability (I guess that comes for free in CVS, but we need to make
>sure there are never any remaining lock files left on either end).

Well, we will make sure that locking can handle connections broken at
any point, by doing things carefully ("carefully" in the same sense as
a filesystem that guarantees mv to be atomic even if the power goes
out).

>14. Would you consider implementing Inversion in Java.  The advantage
>is that it runs on all platforms, and JDBC connects to all database,
>and RMI would make the client-server implementation easier.  Web
>integration with java servles would certainly be easier.  Speed is
>somewhat of an issue, but java is pretty fast now and a lot of the
>speed will be determined by the database rather than our code.  I am
>not concerned about CPU-bound speed problems because processors are
>getting insanely fast and we can have fast machines in our racks.
>Speed could be a problem for small (student) projects that dont have
>good machines, but then again "small project" implies that they
>probably dont have that many concurrent accesses anyway.  I think
>selling inversion as "version control that you actually want to use"
>trumps speed claims anyway, we should make it easy in use and easy to
>enhance.  Java would also give us no-brainer internationalization
>support.  

In my experience (not nearly as great as yours) Java is the least
portable language: it is unusual in requiring *both* a compiler and a
run-time, and naturally I've had problems with both.  Maybe things are
different now...

I do think (as a political thing) that when people hear Java they
shudder in anticipation of the problems they're going to have getting
the thing running.  Even though they may be wrong, this reaction is
something to take into account.

I mean, I guess I can't say "No, I wouldn't even consider implementing
this in Java", but I certainly do lean against
it... Internationalization support would be nice, but is it that hard
in other languages, even C?

>15. Compression.  It would be great if the protocol batched up all the
>files that need to be transfered for a given transaction and then
>compressed the whole set of files using a standard zip algorithm or
>something.  Typically compressing many similar files can give better
>results than compressing each indepently.  This is assuming
>dictionary-based compression, I think.

Good idea!

>16. What is the import/export format for this?  Would I just do a
>mysqldump? Should there be an XML file format for versions and log
>entries and such?

Dunno. :)

>17. It would be nice to have scripts executed for any possible
>action.  CVS has scripts that run on many actions, e.g., commit and
>tagging.    Actually, if we use java it would be possible to define a
>sandbox to run these scripts safely.  That would allow developers to
>upload "inversion agents" that are .class files invoked whenever an
>action is taken on the server and do some kind of testing or
>notification on behalf of that particular developer.  This could be a
>good hook for integrating Inversion with a server-farm build system,
>implementing the hierarchy of servers described in suggestion #7.

Yeah; one of the things I want to do is go over the commit process
step-by-step and figure out every possible place where we can give
users a hook to run a script.  CVS offers a few good places; we can
offer more, I think.

>18.  You have specified that the contents of the source files should
>be in the database.  An alternative woulf be to just have the name of
>some file in the file system.  Could you explore/explain that tradeoff?

Yes -- we'll do whichever is better. :-)

Seriously, I don't know how various databases perform with big text
chunks.  Unless it's a serious hit, though, I think we should try to
stay in the database.  I mean, why use it unless we're going to use
it?  The only reason I can see to implement an extra level of
indirection here would be if it gave us some big performance boost.

* Wish List

put blue-sky requests here...