This file contains (very old) design notes for the cvs2svn repository
converter.  We've not yet deleted them, because they may contain
suggestions for future improvements in design.

=====================================================


An email from John Gardiner Myers <jgmyers@speakeasy.net> about some
considerations for the tool.

------
From: John Gardiner Myers <jgmyers@speakeasy.net>                     
Subject: Thoughts on CVS to SVN conversion
To: gstein@lyra.org                                  
Date: Sun, 15 Apr 2001 17:47:10 -0700

Some things you may want to consider for a CVS to SVN conversion utility:

If converting a CVS repository to SVN takes days, it would be good for     
the conversion utility to keep its progress state on disk.  If the
conversion fails halfway through due to a network outage or power
failure, that would allow the conversion to be resumed where it left off
instead of having to start over from an empty SVN repository.

It is a short step from there to allowing periodic updates of a
read-only SVN repository from a read/write CVS repository.  This allows
the more relaxed conversion procedure:

1) Create SVN repository writable only by the conversion tool.
2) Update SVN repository from CVS repository.
3) Announce the time of CVS to SVN cutover.
4) Repeat step (2) as needed.
5) Disable commits to CVS repository, making it read-only.
6) Repeat step (2).
7) Enable commits to SVN repository.
8) Wait for developers to move their workspaces to SVN.
9) Decomission the CVS repository.

You may forward this message or parts of it as you seem fit.
------

------------------------------------------------------------------------

Further design thoughts from Greg Stein <gstein@lyra.org>

* timestamp the beginning of the process. ignore any commits that
  occur after that timestamp; otherwise, you could miss portions of a
  commit (e.g. scan A; commit occurs to A and B; scan B; create SVN
  revision for items in B; we missed A)

* the above timestamp can also be used for John's "grab any updates
  that were missed in the previous pass."

* for each file processed, watch out for simultaneous commits. this
  may cause a problem during the reading/scanning/parsing of the file,
  or the parse succeeds but the results are garbaged. this could be
  fixed with a CVS lock, but I'd prefer read-only access.

  algorithm: get the mtime before opening the file. if an error occurs
  during reading, and the mtime has changed, then restart the file. if
  the read is successful, but the mtime changed, then restart the
  file.

* dump file metadata to a separate log file(s). in particular, we want 
  the following items for each commit:
  - MD5 hash of the commit message
  - author
  - timestamp

  The above three items are used to coalesce the commit. Remember to
  use a fudge factor for the timestamp. (the fudge cannot be fixed
  because a commit could occur over an arbitrary length of time, based
  on size of commit and the network connection used for the commit;
  figure out an algorithm here)

  All other metadata needs to be preserved, but that can probably
  happen when we re-read the file to generate the SVN revisions.

  We would sort the log file generated above (GNU sort can handle
  arbitrarily large files). Then scan the file progressively,
  generating the commit groups.

* use a separate log to track unique branches and non-branched forks
  of revision history (Q: is it possible to create, say, 1.4.1.3
  without a "real" branch?). this log can then be used to create a
  /branches/ directory in the SVN repository.

  Note: we want to determine some way to coalesce branches across
  files. It can't be based on name, though, since the same branch name
  could be used in multiple places, yet they are semantically
  different branches. Given files R, S, and T with branch B, we can
  tie those files' branch B into a "semantic group" whenever we see
  commit groups on a branch touching multiple files. Files that are
  have a (named) branch but no commits on it are simply ignored. For
  each "semantic group" of a branch, we'd create a branch based on
  their common ancestor, then make the changes on the children as
  necessary. For single-file commits to a branch, we could use
  heuristics (pathname analysis) to add these to a group (and log what
  we did), or we could put them in a "reject" kind of file for a human
  to tell us what to do (the human would edit a config file of some
  kind to instruct the converter).

* if we have access to the CVSROOT/history, then we could process tags
  properly. otherwise, we can only use heuristics or configuration
  info to group up tags (branches can use commits; there are no
  commits associated with tags)

* ideally, we store every bit of data from the ,v files to enable a
  complete restoration of the CVS repository. this could be done by
  storing properties with CVS revision numbers and stuff (i.e. all
  metadata not already embodied by SVN would go into properties)

* how do we track the "states"? I presume "dead" is simply deleting
  the entry from SVN. what are the other legal states, and do we need
  to do anything with them?

* where do we put the "description"? how about locks, access list,
  keyword flags, etc.

* note that using something like the SourceForge repository will be an
  ideal test case. people *move* their repositories there, which means
  that all kinds of stuff can be found in those repositories, from
  wherever people used to run them, and under whatever development
  policies may have been used.

  For example: I found one of the projects with a "permissions 644;"
  line in the "gnuplot" repository. Most RCS releases issue warnings
  about that (although they properly handle/skip the lines).