"I have a cunning plan" or Entries Caching in the Access Batons 0. Preamble -------- Entries caching appears to be a good idea for making the client and WC libraries faster. There has been some discussion about this, but how it could be implemented has never really been written down. I have a mental picture of how entries caching could work, but the picture is a little blurred in places, and I'm a bit worried that the dark bit in the corner (probably my thumb) may be obscuring important details. The solution being developed as part of issue 749 is to use the svn_wc_adm_access_t access batons to cache the results of svn_wc_entries_read, so that the .svn/entries file does not need to read and parsed repeatedly. What makes this hard is that the entries file is currently accessed in a large number of places in the code. If we attempt to introduce caching gradually there is a danger that we will mix code that uses the cache with code that access the entries file directly. Such mixing is not a good idea, as it is possible that the cache and entries file may get out of sync. Even if we could ensure that each client operation used the cache consistently (could we do that?) it would make future development hard, as we would need to ensure that such consistency didn't break. Introducing caching everywhere in a single step is better, but the code changes to do it would be gigantic. 1. Caching Interface ----------------- The plan is to identify the places where there needs to be an access baton, and then make all the changes required to pass access batons around within the code, but without attempting to introduce the caching code. This is being done in stages. Once the access baton is in place, I hope that it will then be possible to start using caching everywhere in a single step. The basic functions to retrieve entries are svn_wc_entries_read and svn_wc_entry. The function svn_wc__entries_write is used to update the entries file on disk. Simple really, only three functions, and once the access baton gets this far we are more or less done! The trouble is that these functions are used everywhere, so the batons have to be passed through a large number of other functions. The basic caching read interface will consist of svn_wc_entry for a single entry and svn_wc_entries_read for a hash of all entries, just as it does now. Initially these functions will work exactly as they do right now, except they will have gained an additional access baton parameter. Once the functions support caching then switching caching on should just involve very localised changes, as the entry interface is the same with and without caching. In the longer term it may be that svn_wc_entries_read will be removed in favour of providing a set of functions that access the underlying cache, thus allowing the access baton to track changes made. However initially I do not think this will be required, if the current code gets a hash from svn_wc_entries_read and expects it to remain valid then that expectation should still apply when caching is implemented. At present access batons have a fairly strict interface, they must be passed directory names, and the code always "knows" whether it is supposed to have a baton for a particular directory or not (and thus it knows whether to call svn_wc_adm_open or svn_wc_adm_retrieve). One tricky point is that svn_wc_entry is often called first, before any access batons are opened, to determine if a given path represents a versioned file or a versioned directory. However svn_wc_entry falls back on checking the physical working copy, so this functionality will probably be copied or moved into an access baton convenience function that allows opening an access baton without requiring knowledge of whether the path is a file or a directory. The basic caching write interface is svn_wc__entries_write. Initially this will write directly to the entries file, just as it currently does. Later on, modifications may be cached until an explicit entries_flush call is made. I haven't yet determined whether this would be a significant benefit in terms of speed, or whether it would risk losing changes if a process is interrupted. The function svn_wc__entry_modify is written in terms of entries_read and entries_write and has already been converted to take an access baton. 2. Caching Mechanism ----------------- Each access baton represents a directory. Access batons can associate together in sets. Given an access baton in a set, it possible to retrieve any other access baton in the set. When an access baton in a set is closed, all other access batons in the set that represent subdirectories are also closed. The set is implemented as a hash table "owned" by the one baton in any set, but shared by all batons in the set. Caching will be similar. The cache hash tables will be "owned" by one baton in the set, but shared by all batons. Caching will be lazy, the cache will not be populated until required (need to see how the TREE_LOCK behaviour in svn_wc_adm_open interacts here). Only entries covered by an access baton will be available in the cache, when an access baton is closed its entries will be removed from the cache. At present in the code, access batons are opened in a parent->child order. This works well with the shared hash being owned by the first baton in each set. There is code to detect if closing a baton will destroy the hash while other batons are using it, as far as I know it doesn't currently trigger. If it turns out that this needs to be supported it should be possible to transfer the hash information to another baton. 3. Access Baton Conversion ----------------------- Given a function svn_error_t *foo (const char *path); if PATH is always a directory then the change that gets made is usually svn_error_t *foo (svn_wc_adm_access_t *adm_access); Within foo, the original const char* can be obtained using const char *svn_wc_adm_access_path(svn_wc_adm_access_t *adm_access); The above case sometimes occurs as svn_error_t *foo(const char *name, const char *dir); where NAME is a single path component, and DIR is a directory. Conversion is again simply in this case svn_error_t *foo (const char *name, svn_wc_adm_access_t *adm_access); The more difficult case is svn_error_t *foo (const char *path); where PATH can be a file or a directory. This occurs a lot in the current code. In the long term these may get converted to svn_error_t *foo (const char *name, svn_wc_adm_access_t *adm_access); where NAME is a single path component. However this involves more changes to the code calling foo than are strictly necessary, so initially they get converted to svn_error_t *foo (const char *path, svn_wc_adm_access_t *adm_access); where PATH is passed unchanged and an additional access baton is passed. This interface is less than ideal, since there is duplicate information in the path and baton, but since it involves fewer changes in the calling code it makes a reasonable intermediate step. 4. Logging ------- As well as caching the other problem that needs to be addressed is the issue of logging. Modifications to the working copy are supposed to use the log file mechanism to ensure that multiple changes that need to be atomic cannot be partially completed. If the individual changes that may need to be logged are all forced to use an access baton, then the access baton may be able to identify when the log file mechanism should be used. Combine this with an access baton state that tracks whether a log file is being run and we may be able to automatically identify those places that are failing to use the log file mechanism. 5. Status ------ Now: I'm currently working on a patch to pass the access baton to svn_wc_entries_read, only one regression test failure at present! I've cheated a bit, because svn_wc_entry is currently passing NULL to svn_wc_entries_read, I really need to do svn_wc_entry to complete the patch. Next: svn_wc__entries_write should be simple once svn_wc_entries_read is done. Then: After the above is complete, the caching stuff might start.