Deltified Storage ----------------- Mike and I are reworking the deltification/undeltification code right now, to be a lot more efficient than the first implementation. We're using Branko's proposed scheme; below is his description of it, plus some related correspondence. ========================================================================= Branko's original mail, edited for space but no content changed: ========================================================================= Date: Fri, 27 Jul 2001 06:11:04 +0200 From: Branko =?ISO-8859-2?Q?=C8ibej?= Content-Type: text/plain; charset=ISO-8859-2; format=flowed Subject: RFC: Delta indexing and composition Here's the "more on that later" I promised. Now that I've safely postponed the delta combiner, I'd like to share my ideas for improving the deltification code in the filesystem. The scheme we have in place now works (kudos to Mike and Karl here!), but wastes both time and space, and is probably a mild disaster for non-sequential access to large files or old versions. The reasons for this are obvious: 1. It reads/resurrects the whole text from the beginning up to the end of the interesting region 2. It keeps applying all the deltas from the version to the fulltext every time an old version is accessed, throwing away the result. My suggested solution to these two problems is based on recognizing exactly what a delta window is: A delta window defines a contiguous region of text A. It may depend on a contiguous region of text B, but is independent of any other delta window in the delta representation of A. Given this definition, two things become obvious: 1. To reconstruct the region defined by window N, we only have to read that window. 2. Different windows within a delta may depend on different source texts. The first item implies that we can solve the first problem my indexing the windows within a delta, and accessing them directly. But we knew that already. The second item implies that every time we reconstruct a region of text, we can replace its defining delta window with a single diff from the fulltext, eliminating the intermediate reconstruction steps the next time this region is accessed -- thus solving the second problem. So, here's my proposal 1) Change the delta representation to index and store delta windows separately DELTA ::= (("delta" FLAG ...) (OFFSET WINDOW) ...) ; WINDOW ::= DIFF SIZE CHECKSUM [REP-KEY REP-OFFSET] ; OFFSET ::= number ; REP-OFFSET ::= number; The REP-KEY and REP-OFFSET in WINDOW are optional because, if the differences between two file revisions is large enough, the diff could in fact be larger than a compression-only vdelta of the text region. In that case it makes more sense to compress the window than to store a diff. 2) Change the undeltifier to use the new structure The undeltifier will stay essentially the same as it is now, except that it will use OFFSET and REP-OFFSET to access the necessary bits directly. The place where the delta combiner will fit stays the same, too. The major addition comes after the text is reconstructed. Using some suitable heuristic -- probably based on the number of jumps from the representation to the fulltext, the size of the diff from the fulltext, etc. -- we can decide to: a) replace the window with a single diff from the fulltext, b) replace it with a compressed version of the region, or c) do nothing. The disadvantage of this proposal is, of course, more space used in the repository. We can reduce the increase somewhat by compressing the window index, and possibly improving the svndiff encoding. But I think it's a fair price to pay, because this scheme reduces the number of disk accesses, total memory use and average processing time needed to reconstruct a region of text. That's it. Let's see you find holes in my proposal. :-) Brane ========================================================================= Then Mike and I asked Branko some questions, here is his response: ========================================================================= From: Branko =?ISO-8859-2?Q?=C8ibej?= Subject: Re: deltification semi-rewrite starting now To: kfogel@collab.net CC: dev@subversion.tigris.org Date: Mon, 24 Sep 2001 23:45:48 +0200 kfogel@collab.net wrote: >Mike Pilato and I have just reviewed Branko's deltification proposal, >found at > > notes/delta-indexing-and-composition.txt > >and like what we see :-). We have a couple of questions that probably >Branko can answer quickly, but basically we're going to start >implementing it now, completion anticipated in 2 weeks max (thank >goodness all the strings/reps separation is already done, so that >whole wheel doesn't need to be reinvented). > >The plan is that we'll also implement a new `svnadmin' subcommand for >deltifying and undeltifying revisions, or particular paths within >revisions. That way, administrators have a way to make certain trees >very efficient to retrieve -- for example, one might want to do this >to a tagged release -- and also gives us an obvious way to deltify the >storage of the current svn repository without perturbing the revision >numbers. :-) > >Branko, a couple of questions regarding your lovely design: > >>So, here's my proposal >> >>1) Change the delta representation to index and store delta windows >>separately >> >> DELTA ::= (("delta" FLAG ...) (OFFSET WINDOW) ...) ; >> WINDOW ::= DIFF SIZE CHECKSUM [REP-KEY REP-OFFSET] ; >> OFFSET ::= number ; >> REP-OFFSET ::= number; >> >> >>The REP-KEY and REP-OFFSET in WINDOW are optional because, if the >>differences between two file revisions is large enough, the diff could >>in fact be larger than a compression-only vdelta of the text region. In >>that case it makes more sense to compress the window than to store a diff. >> > >We're not sure what REP-OFFSET is for. > >We're pretty sure we understand OFFSET. It's the offset into the >reconstructed fulltext. The OFFSETs increase with each WINDOW in a >DELTA, and you can tell a given window's reconstruction range either >by adding OFFSET + SIZE, or by subtracting one OFFSET from the next. > >Hopefully that's a correct summary. :-) Yes, that is exactly right. >But what is REP-OFFSET? We understand the REP-KEY that precedes it. >That's simply the representation against whose fulltext this delta >applies, right? Let me think ... Yes. > But why would we want an offset into that rep? We >had thought the relevant offset(s) are part of the svndiff encoding. >Is it a way of magically jumping over a certain number of windows and >landing on the right one, in next-most-immediate source >representation, or is it something else? Although the offset is implicit in the svndiff, in real life you want to find the source (fulltext) *before* decoding the window. Also, as I noted, you might want to just use a (self-referencing) vdelta compress instead of a diff, if the result of the compression is smaller than the diff. Hmm. It's been a long time since I wrote that, and as usual I left some of the reasoning out. I'll have to think about this again. I sort of remember it had to do with true random access to the text. >We're still thinking about this, but maybe you can put us out of our >misery quickly. :-) Thanks, you just got me worrying about it. :-) >Also, did you mean > > WINDOW ::= (DIFF SIZE CHECKSUM [REP-KEY REP-OFFSET]) ; > >i.e., with parens, rather than without? Yes, it would work without >being a sublist, but for maintainability a sublist might be >preferable... I meant without params, but obviously it doesn't hurt to make a sublist out of it. Use whatever you find more aesthetically pleasing. :-) >Anyway, we can start coding right away, while awaiting clarification. >Found no holes in the proposal; agree that there is a slight storage >penalty, but the memory usage and speed gains are so overwhelming that >it would be petty to complain about the *very* gently-sloped, albeit >linear, increase in storage per deltified file. Wonderful. Now I /really/ have to dust off and finish the delta combiner. >The replacing of distant diffs with ones nearer the fulltext is a >great idea; we'll probably wait on that until after the basic rewrite >is done, however, as it is an optimization, though a very effective >one. Yes, it's an optimization only. What's more, it can be done entirely off-line. ========================================================================= Commentary: ========================================================================= We're going to just ignore REP-OFFSET for now, we can do everything without it. Maybe it will be used in a true delta-combiner later. Also, yes, we'll wrap WINDOW in an extra pair of parens, purely for aesthetic reasons. So: DELTA ::= (("delta" FLAG ...) (OFFSET WINDOW) ...) ; WINDOW ::= (DIFF SIZE CHECKSUM [REP-KEY [REP-OFFSET]]) ; OFFSET ::= number ; REP-OFFSET ::= number;