Anyone who has worked on the libsvn_wc code will have discovered that the current code is a complicated mess of special cases, and that it is difficult to understand, inconsistent, slow and buggy. I know this because I wrote some of it. It's possible that the libsvn_wc code will gradually evolve into an elegant, efficient, code base; on the other hand comments like "when we rewrite libsvn_wc" regularly appear on the dev list. This document is *not* a plan or design for a rewrite, it's just some of the thoughts of a libsvn_wc hacker. From Past to Present ==================== The original code for libsvn_wc used an implementation that stored more or less all state information on disk in the .svn area on disk, so during most operations the entries files were read and written many times. This led to the development of an API that passed around a lot of path parameters (as svn_string_t* originally, as const char* now) and to the development of the svn_io_xxx functions, which also accept path parameters. The implementation was slow, and didn't scale particularly well as working copies got larger. To improve things the current access baton and entries caching system was gradually hacked in, and libsvn_wc is now faster and scales a bit better, but problems still remain. A good example of the problems caused by the "path as parameter" API is svn_wc_text_modified_p. Its basic function is to determine if the base text file and the working file are the same or different, but physical IO operations have to be repeated because they are buried behind several layers of API. It's difficult to fix without rewriting, or duplicating, a number of svn_io_xxx and svn_wc_xxx functions. Aside from the repeated IO itself, each IO operation also has to repeat the UTF-8 to native path conversion. The current entries caching makes things faster than in the past, but has its own problems. Most operations now cache the entire entries hierarchy in memory which limits the size of the working copies that can be handled. The problem is difficult to solve as some operations make multiple passes--commit for instance makes a first pass searching for modifications, a second pass reporting to the repository, and a third pass to do post-commit processing. The original code also did not always make a distinction between the versioned hierarchy in the entries file and the physical hierarchy on disk. Things like using stat() or svn_io_check_path() calls to determine whether an item was versioned as file or directory do not work when the working copy on disk is obstructed or incomplete. The Future ========== Some of these ideas are trivial, some of them are difficult to implement, some of them may not work at all. - Have an svn_wc_t context object, opaque outside the library, that would replace the access batons. This would get passed through most of the libsvn_wc functions and could read/cache the entries files on demand as the working copy was traversed. It could also cache the UTF-8 xlate handle. - Have an API to svn_wc_entry_t, perhaps make the struct opaque, so that things like URL need not be constructed when the entries file is read but can be created on demand if required and possibly cached once created. The aim would be to reduce the memory used by the entries cache. - Consider caching physical IO results in svn_wc_entry_t/svn_wc_t. Should we really stat() any file more than once? This becomes less important as we reduce the number of IO operations. - Consider caching UTF-8 to native path conversions either in svn_wc_t, or svn_wc_entry_t, or locally in functions and using svn_io_xxx equivalents that accept native paths. This becomes less important as we reduce the number of IO operations. - Make interfaces pass svn_wc_entry_t* rather than simple paths. The public API using const char* paths would remain to be used by libsvn_client et al. - Maintain a clear distinction between the versioned hierarchy and the physical hierarchy when writing code, it's usually a mistake to use one when the other should be used. To this end, audit the use of svn_io_check_path(). - Avoid using stat() to determine if an item is present on disk before using the item, just use it straight away and handle the error if it doesn't exist. - Search out and destroy functions that read and discard entries files e.g. the apparently "simple" functions like svn_wc_is_wc_root or check_wc_root. Such overhead is expensive when used by operations that are not going to do much other work, running status on a single file for example. The overhead may not matter to a command line client, but it can matter to a GUI that makes many such calls. - Consider supporting out of tree .svn directories. - In the present code most operations are IO bound and have CPU to spare. Perhaps compressed text-bases would make things faster rather than slower, by trading spare CPU for reduced IO? - Keep track of the last text time written into an entries file and store it in svn_wc_t. Then when we come to do a timestamp sleep we can do it from that time rather than the current time. - Store working file size in the entries file and use it as another shortcut to detect modifications. This should not need any extra system calls, the stat() for timestamp can also return the size. When it triggers it will be much faster than possibly detranslating and then doing a byte-by-byte comparison. ### Problem: This doesn't work when the file needs translation, because the ### file might be modified in such a way that these modifications disappear ### when the file is detranslated. - Make the entries file smaller. The properties committed-date committed-rev and last-author are really only needed for keyword expansion, so only store them if the appropriate svn:keywords value is present. Note that committed-rev has a more general use as rPREV, however just about all uses of rPREV involve repository access so rPREV could be determined via an RA call. Removing the three properties could reduce entries file size by as much as one third, it's possible that might make reading, writing and parsing faster. It would reduce the memory used to cache the entries, an ABI change to svn_wc_entry_t might reduce it further. - Look at calls to svn_wc__get_keywords and svn_wc__get_eol_style Each of those reads the properties file. If they occur together then consider replacing them with a single call to svn_wc_prop_list, and perhaps write some functions that accept the properties hash as an argument. Alternatively, consider caching the existence of these two properties in the entries file to avoid reading the props file at all in some cases. - Optimise update/incomplete handling to reduce the number of times the entry file gets written. http://svn.haxx.se/dev/archive-2005-03/0060.shtml * Avoid adding incomplete="true" if the revision is not changing. * Don't write incomplete="true" immediately, cache it in the access baton and only write it when next writing the entries file. * Combine removing incomplete="true", and revision bumping, with the last change due to the update. - The svn_wc_t context could work in conjunction with a more advanced svn_wc_crawl_revisions system. This would provide a way of plugging multiple callbacks into a queue, probably with some sort of ordering and filtering ability, the aim being to replace most/all of the existing explicit loops. This would put more of the pool handling in one central location, it may even be possible to provide different entry caching schemes. I don't know how practical this idea is, or even if it is desirable. - Have a .svn/deleted directory so that schedule delete directories can be moved out of the working copy. At present a skeleton hierarchy of schedule delete directories remains in the working copy until the delete is committed. - When handling a delete received during update/switch perhaps do it in two stages. First move the item into a holding area within .svn and finally delete all such items at the end of the update. This would allow adds-with-history to use the deleted item and so might be a way to handle moves (implemented as delete plus add) in the presence of local modifications. Thought would have to be given to the revision of the local deleted item, what happens if it doesn't match the copyfrom revision? Perhaps we could get diffs, rather than full text, for adds-with-history if the copyfrom source is reported to the repository? - Consider implementing atomic move for wc-to-wc moves, rather than using copy+delete. This would be considerably faster for big directories, would lead to better revert behaviour, and avoid case-insensitivity problems (and if we ever get atomic mv in libsvn_fs then the wc code would be ready for it). - Consider writing some libsvn_wc compiled C regression tests to allow more complete coverage. Most of the current libsvn_wc testing is done via the command line client and it can be hard to get a working copy into the state necessary to test all code paths. - There are some basic features that are fragile. Switch has some bugs that can break a working copy, see issue 1906. I don't know how the system is supposed to work in theory, let alone how it should be implemented. Non-recursive checkout is broken, see issue 695; this probably applies to non-recursive update and switch as well. - Use absolute paths within libsvn_wc so that "." is not automatically a wc root. - Read notes/entries-caching for some details of the logging/caching in the current libsvn_wc. It's important that writing the entries file is handled efficiently.