"I have a cunning plan" or Entries Caching in the Access Batons 0. Preamble -------- Entries caching appears to be a good idea for making the client and WC libraries faster. There has been some discussion about this, but how it could be implemented has never really been written down. I have a mental picture of how entries caching could work, but the picture is a little blurred in places, and I'm a bit worried that the dark bit in the corner (probably my thumb) may be obscuring important details. The solution being developed as part of issue 749 is to use the svn_wc_adm_access_t access batons to cache the results of svn_wc_entries_read, so that the .svn/entries file does not need to read and parsed repeatedly. What makes this hard is that the entries file is currently accessed in a large number of places in the code. If we attempt to introduce caching gradually there is a danger that we will mix code that uses the cache with code that access the entries file directly. Such mixing is not a good idea, as it is possible that the cache and entries file may get out of sync. Even if we could ensure that each client operation used the cache consistently (could we do that?) it would make future development hard, as we would need to ensure that such consistency didn't break. Introducing caching everywhere in a single step is better, but the code changes to do it would be gigantic. 1. Caching Interface ----------------- The plan is to identify the places where there needs to be an access baton, and then make all the changes required to pass access batons around within the code, but without attempting to introduce the caching code. This is being done in stages. Once the access baton is in place, I hope that it will then be possible to start using caching everywhere in a single step. The basic functions to retrieve entries are svn_wc_entries_read and svn_wc_entry. The function svn_wc__entries_write is used to update the entries file on disk. Simple really, only three functions, and once the access baton gets this far we are more or less done! The trouble is that these functions are used everywhere, so the batons have to be passed through a large number of other functions. The basic caching read interface will consist of svn_wc_entry for a single entry and svn_wc_entries_read for a hash of all entries, just as it does now. Initially these functions will work exactly as they do right now, except they will have gained an additional access baton parameter. Once the functions support caching then switching caching on should just involve very localised changes, as the entry interface is the same with and without caching. In the longer term it may be that svn_wc_entries_read will be removed in favour of providing a set of functions that access the underlying cache, thus allowing the access baton to track changes made. However initially I do not think this will be required, if the current code gets a hash from svn_wc_entries_read and expects it to remain valid then that expectation should still apply when caching is implemented. PROBLEM: I have identified one place in mark_tree where the hash is retrieved, one entry is extracted and modified. I can work around this case, but it shows that we really need a more robust interface. Perhaps svn_wc_entries_read in its current form should be removed, and replaced by some functions returning const svn_wc_entry_t pointers. PARTIAL SOLUTION: Auditted all uses of svn_wc_entries_read and changed them to use const svn_wc_entry_t* where possible. At present access batons have a fairly strict interface, they must be passed directory names, and the code always "knows" whether it is supposed to have a baton for a particular directory or not (and thus it knows whether to call svn_wc_adm_open or svn_wc_adm_retrieve). One tricky point is that svn_wc_entry is often called first, before any access batons are opened, to determine if a given path represents a versioned file or a versioned directory. However svn_wc_entry falls back on checking the physical working copy, so this functionality will probably be copied or moved into an access baton convenience function that allows opening an access baton without requiring knowledge of whether the path is a file or a directory. The basic caching write interface is svn_wc__entries_write. Initially this will write directly to the entries file, just as it currently does. Later on, modifications may be cached until an explicit entries_flush call is made. I haven't yet determined whether this would be a significant benefit in terms of speed, or whether it would risk losing changes if a process is interrupted. There is a definite speed advantage to be had here, operations like checkout are still writing the entries file repeatedly, even if they don't need to read/parse it. However delayed writing is tricky, if a command is interrupted the cache would not get flushed to disk. Perhaps the solution is write lots of little files, and combine them during cleanup processing. The function svn_wc__entry_modify is written in terms of entries_read and entries_write and has already been converted to take an access baton. 2. Access Baton Sets ----------------- Each access baton represents a directory. Access batons can associate together in sets. Given an access baton in a set, it possible to retrieve any other access baton in the set. When an access baton in a set is closed, all other access batons in the set that represent subdirectories are also closed. The set is implemented as a hash table "owned" by the one baton in any set, but shared by all batons in the set. At present in the code, access batons are opened in a parent->child order. This works well with the shared hash being owned by the first baton in each set. There is code to detect if closing a baton will destroy the hash while other batons are using it, as far as I know it doesn't currently trigger. If it turns out that this needs to be supported it should be possible to transfer the hash information to another baton. 3. Caching Mechanism ----------------- Each access baton will cache the two possible hashes returned by svn_wc_entries_read, so that subsequent calls will not need to parse the entries file. If the full hash, the one containing deleted entries, is available when a request for the truncated hash is made, then the truncated hash will be constructed from the full hash. The function svn_wc__entries_write will cause the full hash cache to be filled and the truncated hash cache to be cleared. PROBLEM: memory use is going to be a problem, if we simply repeatedly allocate from the access baton pool as happens at present. As the entries get updated we need to find a way to reuse the cache memory, otherwise memory usage for checkout is going to increase with the number of items in the working copy. PARTIAL SOLUTION: ensure that the access baton pool is used for as little as possible. Use a "normal" pool for all allocations that do not need to be part of the cache. 4. Access Baton Conversion ----------------------- Given a function svn_error_t *foo (const char *path); if PATH is always a directory then the change that gets made is usually svn_error_t *foo (svn_wc_adm_access_t *adm_access); Within foo, the original const char* can be obtained using const char *svn_wc_adm_access_path(svn_wc_adm_access_t *adm_access); The above case sometimes occurs as svn_error_t *foo(const char *name, const char *dir); where NAME is a single path component, and DIR is a directory. Conversion is again simply in this case svn_error_t *foo (const char *name, svn_wc_adm_access_t *adm_access); The more difficult case is svn_error_t *foo (const char *path); where PATH can be a file or a directory. This occurs a lot in the current code. In the long term these may get converted to svn_error_t *foo (const char *name, svn_wc_adm_access_t *adm_access); where NAME is a single path component. However this involves more changes to the code calling foo than are strictly necessary, so initially they get converted to svn_error_t *foo (const char *path, svn_wc_adm_access_t *adm_access); where PATH is passed unchanged and an additional access baton is passed. This interface is less than ideal, since there is duplicate information in the path and baton, but since it involves fewer changes in the calling code it makes a reasonable intermediate step. 5. Logging ------- As well as caching the other problem that needs to be addressed is the issue of logging. Modifications to the working copy are supposed to use the log file mechanism to ensure that multiple changes that need to be atomic cannot be partially completed. If the individual changes that may need to be logged are all forced to use an access baton, then the access baton may be able to identify when the log file mechanism should be used. Combine this with an access baton state that tracks whether a log file is being run and we may be able to automatically identify those places that are failing to use the log file mechanism. 6. Status ------ Entries caching to avoid repeated reading and parsing of the entries file is now in place. The problem of delaying and combining writing to the file has not been addressed.