[ This paper makes reference to the "Sfio" library. See http://www.research.att.com/sw/tools/sfio/ for more. -kff] Network Services Research Lab David Korn and Kiem-Phong Vo AT&T Labs Submission: March 09, 2000 Expiration: September 09, 2000 The VCDIFF Generic Differencing and Compression Data Format Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This memo describes a general and efficient data format suitable for encoding compressed and/or differencing data so that they can be easily transported among computers. Table of Contents: 1. Executive Summary ............................................ 1 2. Algorithm Conventions ........................................ 3 3. Delta Instructions ........................................... 3 4. Vcdiff Encoding Format ....................................... 4 5. Instruction Code Tables ...................................... 13 6. Compression and Differencing Algorithms ...................... 18 7. Summary ...................................................... 23 ACKNOWLEDGEMENTS ............................................. 23 REFERENCES ................................................... 23 AUTHOR'S ADDRESS ............................................. 24 1. Executive Summary Compression and differencing techniques can greatly improve storage and transmission of files and file versions. Since files are often transported across machines with distinct architectures and performance characteristics, such data should be encoded in a form that is portable and can be decoded with little or no knowledge of the encoders. This document describes Vcdiff, a new compact portable encoding format that is independent of encoding algorithms and also efficient to decode. Data differencing is the process of computing a compact and invertible RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 encoding of a "target file" given a "source file". Data compression is similar but without the use of source data. The UNIX utilities diff, compress, and gzip are well-known examples of data differencing and compression tools. For data differencing, the computed encoding is called a "delta file", and, for data compression, it is called a "compressed file". Delta and compressed files are good for storage and transmission because they are often smaller than the originals. Data differencing and data compression are traditionally treated as distinct types of data processing. However, compression can be thought of as a special case of differencing in which the source data is empty. This blending of differencing and compression was first introduced in the Vdelta technique [1] by Korn and Vo. The basic idea is to unify the string parsing scheme used in the Lempel-Ziv'77 style compressors [2], and the block-move technique of Tichy [3]. Loosely speaking, this works as follows: a. Concatenate source and target data. b. Parse the data from left to right just like LZ'77 but make sure that a parsed segment starts target data. c. Start to output when reaching target data. Parsing is based on string matching algorithms such as suffix trees [4] or hashing with different time and space performance characteristics. Vdelta uses a new string matching algorithm that performs competitive to other techniques [5]. However, even with a memory-efficient and fast string matching algorithm, the computing resource requirements can be prohibitive for processing large files. The standard way to deal with this is to partition input files into "windows" to be processed separately. Here, except for some unpublished work by Vo, little has been done on designing effective windowing schemes. Current techniques, including Vdelta, simply use windows of the same size with corresponding addresses across source and target files. String matching and windowing algorithms have large influence on the compression rate of delta and compressed files. However, it is desirable to have a portable encoding format that is independent of such algorithms. This enables construction of client-server applications in which a server may serve clients with unknown computing characteristics. Unfortunately, all current differencing and compressing tools, including Vdelta, fall short in this resspect. Their storage formats are closely intertwined with the implemented algorithms. The encoding format Vcdiff proposed here addresses the above issues. Vcdiff achieves the below characteristics: Output compactness: The basic encoding format compactly represents compressed or delta files. In specific cases, applications can further extend it with "secondary encoders" (e.g., a Huffman or arithmetic encoder) to achieve more compression. Data portability: The basic encoding format is free from byte order and word size issues for integer representations. Algorithm genericity: The decoding algorithm for the basic encoding format is independent from string matching and windowing algorithms. Decoding efficiency: The decoding algorithm for the basic encoding format runs in time proportional to the size of the target file and uses space -2- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 proportional to the maximal window size. The Vcdiff data format and the algorithms for decoding data shall be described next. Since Vcdiff treats compression as a special case of differencing, we shall use the term "delta file" to indicate the compressed output for both cases. 2. Algorithm Conventions Algorithms to encode and decode the Vcdiff data format will be presented in the ANSI-C language. To ease the presentation, we shall generally omit error checking. The below conventions will be observed: a. Tokens with all upper cases letters will be C macros. b. Variables with capitalized names are global. c. Code fragments will be presented with line numbers to be used in the subsequent COMMENTS sections. Data will be encoded in byte units. For portability, control data generated by Vcdiff shall be limited to the lower eight bits of a byte even on machines with larger bytes. The bits in a byte are named from right to left so that the first bit has the lowest value, 1<<0 or 1, and the eighth bit has value 1<<7 or 128. To facilitate the algorithms, we shall assume a few types and functions for I/O on streams. To be definite, we shall use interfaces similar to that provided in the Sfio library. Below are descriptions of some of these types and functions. Others can be found in reference [6]. uchar_t: This is the type "unsigned char". int_t: This is the largest integer type on the platform in use. Sfio_t: This is the type of a stream. Sfio_t* sfstropen(uchar_t* buf, int n): This is not an Sfio function but it can be easily implemented on top of the Sfio primitive sfnew(). sfstropen() creates a stream from a given buffer with a given size. We shall assume that such a stream is both readable and writable. As with Sfio, a write stream will extend the buffer as necessary to accomodate output data. 3. Delta Instructions A target file is partitioned into non-overlapping sections or windows to be processed separately. A target window T of length n may be compared against some source data segment S of length m. Such a source data segment may come from some earlier part of the target file or it may come from the source file, if there is one. It is assumed that there is sufficient memory so that both T and S can be processed in main memory. For string processing, we treat S and T as substrings of a superstring U formed by concatenating T and S like this: -3- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 s[0]s[1]...s[m-1]t[0]t[1]...t[n-1] The index of a byte in S or T is referred to by its location in U. For example, T[i] is referred to as U[m+i]. The instructions to encode and direct the reconstruction of a target window are called delta instructions. There are three types: ADD: This instruction has two arguments, a size s and a sequence of s bytes to be copied to reconstruct the target window. COPY: This instruction has two arguments, a size s and an address p in the string U. The arguments specify the substring of U that must be copied. We shall assert that such a substring must be entirely contained in either S or T. RUN: This instruction has two arguments, a size s and a byte c that will be repeated s times. Below are example source and target strings and the delta instructions that encode the target string in terms of the source string. a b c d e f g h i j k l m n o p a b c d w x y z e f g h e f g h e f g h e f g h z z z z COPY 4, 0 ADD 4, w x y z COPY 4, 4 COPY 12, 24 RUN 4, z Note that the third COPY instruction copies data from T itself since address 24 is position 8 in T. In addition, parts of the data to be copied by this instruction will be reconstructed during its execution. This allows efficient encoding of periodic sequences, i.e., sequences with regularly repeated subsequences. The RUN instruction is a compact way to encode a sequence repeating the same byte even though such a sequence can be thought of as a periodic sequence with period 1. 4. Vcdiff Encoding Format A large target file is partitioned into non-overlapping sections called "target windows". In practice, window sizes should be chosen so that each window can be completely processed in memory during both encoding and decoding. A target window may be compared against some source data segment or none. In the compression case, such a source segment would be from some earlier part of the target file while, for differencing, it may also be from the source file. 4.1 Delta File Layout A Vcdiff delta file is organized into sections as follows: 4-byte header -4- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 Section 1 Section 2 ... Header: The header of a delta file consists of 4 bytes to identify it as a Vcdiff delta file. The first three bytes are: 0xd6, 0xc3, 0xc4, i.e., the ASCII letters 'V', 'C' and 'D' or-ed with the eighth bit. The fourth byte indicates a version number which is currently 0. Sections: Following the 4-byte header are a number of sections. Each section encodes either an instruction code table (Section 5) or a window. Each section starts with an indicator byte. If the 8th bit of this byte is on, the section encodes an instruction table; otherwise, it encodes a window. The remaining bits of this byte are used to encode other information as necessary. 4.1.1 The Format of a Window Each window is organized as below. Note that items bracketed are present only when certain corresponding bits in the indicator byte are on. Indicator byte Length of target window Length of instruction dataset Length of raw dataset [Index of code table] [Secondary instruction compressor] [Length of compressed instruction dataset] [Secondary data compressor] [Length of compressed raw dataset] [Source segment size] [Source segment position] Instruction dataset Raw dataset Indicator byte: The bits of the indicator byte for a window are as follows: 0 T I D WT WS X X 0: This is the 0-bit indicating that this section is a window of data to be decoded. T: This bit, if on, indicates that a non-default instruction code table should be used for decoding. In this case, the item [Index of code table] is a single byte indicating the index of this code table. I: This bit, if on, indicates that the instruction dataset has been compressed with a secondary compressor. In this case, the item [Secondary instruction compressor] is a single byte indicating the secondary compressor while the item [Length of compressed instruction dataset] has the length of the compressed instruction dataset encoded as a variable-sized integer. D: This bit is similar to the I-bit but it is for the raw -5- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 dataset. WT: This bit, if on, indicates that a source data segment was used to compare the target window with. Further, this source segment is from an earlier part of the target file. In this case, the size and position of the source segment are given in [Source segment size] and [Source segment position]. Both of these items are encoded as variable-sized integers. WS: This bit is similar to WT but, if on, indicates that the source data segment is from the source file. X: These bits are currently unused. Length of target window: This is a variable-sized integer indicating the size of the target window. Lengths of instruction and raw datasets: The delta instructions ADD and RUN have associated raw data (unmatched bytes for ADD and the repeating byte for RUN). The Vcdiff encoding format maintains two separate datasets, one for the raw data and one for the instructions. The lengths of the "instruction dataset" and the "raw dataset" are stored respectively as two variable-sized integers. Instruction dataset: This is the set of bytes that define the delta instructions. If I was 1, the dataset has been encoded by the indicated secondary encoder. Raw dataset: This is the set of raw data accompanying the ADD and RUN instructions. If D was 1, the dataset has been encoded by the indicated secondary encoder. 4.1.2 Processing a Delta File Below is the basic algorithm to process a delta file: 1. sfgetc(Delta); 2. sfgetc(Delta); 3. sfgetc(Delta); 4. sfgetc(Delta); 6. for(;;) 7. { if((indi = sfgetc(Delta)) < 0) 8. break; 9. if((indi & (1<<7)) != 0) 10. code_define(); 11. else 12. { n_tar = sfgetu(Delta); 13. tar = malloc(n_tar); 14. win_inflate(indi, tar, n_tar, NULL, 0); 15. sfwrite(Target, tar, n_tar); 16. free(tar); 17. } 18. } -6- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 COMMENTS. 1-4: These lines read the 4-byte header. 7-8: These lines read the indicator byte and terminate if reached end-of-file. 9-10: These lines define a new code table. 12-16: These lines decode the target window and output it to the target file. Next is the function to recompute a target window: 1. win_inflate(int indi, uchar_t* tar, int n_tar, uchar_t* src, int n_src) 2. { int n_inst, n_data; 3. uchar_t *inst, *data; 4. int s_inst, s_data; 5. int p_src, i_code, f_inst, f_data; 6. Sfio_t *sf, *instf, *dataf; 7. n_inst = sfgetu(Delta); 8. n_data = sfgetu(Delta); 9. if(indi & (1<<6)) 10. i_code = sfgetc(Delta); 11. if(indi & (1<<5)) 12. { f_inst = sfgetc(Delta); 13. s_inst = sfgetu(Delta); 14. } 15. if(indi & (1<<4)) 16. { f_data = sfgetc(Delta); 17. s_data = sfgetu(Delta); 18. } 19. if(indi & ((1<<3)|(1<<2)) ) 20. { n_src = sfgetu(Delta); 21. p_src = sfgetu(Delta); 22. } 23. inst = malloc(n_inst); 24. if(indi & (1<<5)) 25. { sfread(Delta, inst, s_inst); 26. decompress(inst, n_inst, s_inst, f_inst); 27. } 28. else sfread(Delta, inst, n_inst); 29. instf = sfstropen(inst, n_inst); 30. data = malloc(n_data); 31. if(indi & (1<<4)) 32. { sfread(Delta, data, s_data); 33. decompress(data, n_data, s_data, f_data); 34. } 35. else sfread(Delta, data, n_data); 36. dataf = sfstropen(data, n_data); 37. if(indi & ((1<<3)|(1<<2)) ) 38. { src = malloc(n_src); 39. sf = (indi & (1<<3)) ? Target : Source; 40. sfseek(sf, p_src, SEEK_SET); 41. sfread(sf, src, n_src); -7- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 42. } 43. win_decode(tar, n_tar, src, n_src, i_code, instf, dataf); 44. sfclose(instf); sfclose(dataf); free(inst); free(data); 45. } COMMENTS. 7-8: These lines read the sizes of the instruction and raw datasets. 9-10: These lines read the index of the instruction code table if needed. 11-14: These lines read the secondary instruction compressor indicator and the size of the compressed instruction dataset. 15-18: These lines read the secondary data compressor indicator and the size of the compressed raw dataset. 19-22: These lines read the size and position of the source data segment. Note that win_inflate() may also be called to decode an instruction code table (Section 5). In that case, "src" will be given and the WT and WS bits should be zero. 23-36: These lines read the instruction and raw datasets. If these were compressed, decompress() is called to reconstruct them. The function decompress() uses the method index f_inst or f_data to decide on the proper operations. Note that these datasets are then made into streams via the sfstropen() calls to ease data accessing later. 37-42: These lines construct the source data segment if any. 43: This line calls win_decode() (Section 4.4) to decode the delta instructions. 44: This line frees resources allocated for processing. 4.2 Integer Encoding Formats Vcdiff compactly encodes integer values using the same formats introduced in the Sfio library [6]. The code presented below is not quite correct with respect to type handling as in Sfio but they show the ideas. 4.2.1 Encoding Integers Using a Variable-Sized Format The variable-sized encoding of integers follows the same algorithm used in the Sfio library. This treats an integer as a number in base 128 so that each digit can be coded in one byte. The eighth bit of such a byte is the continuation bit. It is 1 if there are more digits to come or 0 if it is the last digit. +---------------------+ | 11100000 | 00111001 | +---------------------+ byte 0 byte 1 Table 1 Table 1 shows the variable sized encoding of 12345. The bytes in the encoding are presented in binary to make clear the use of the continuation bit. Omitting the continuation bit, the encoding of 12345 uses two bytes with values 96 and 57. Below are the algorithms: 1. int sfputu(Sfio_t* f, int_t v) 2. { uchar_t c[2*sizeof(int_t)], *s; -8- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 3. s = &c[sizeof(c)-1]; 4. *s = v & 127; 5. while((v >>= 7) != 0) 6. *--s = (v & 127) | (1<<7); 7. sfwrite(f, s, (&c[sizeof(c)]) - s); 8. return &c[sizeof(c)]-s; 9. } 10. int_t sfgetu(Sfio_t* f) 11. { int_t b, v; 12. for(v = 0;; ) 13. { b = sfgetc(f); 14. v = (v << 7) | (b & 127); 15. if(!(b & 128) ) 16. return v; 17. } 18. } COMMENTS. 1: This line declares the formal arguments to sfputu(), a stream f to store the encoding and the value v to be encoded. 2: This line declares an array large enough to store the encoding. 4-6: These lines extract digits in base 128 and store them in the array. Note that the right-most digit is extracted first and does not carry a continuation bit. 7-8: These lines write the encoding out to the given stream f and return the length of that encoding. 10-18: These lines define the decoding algorithm. 4.2.2 Encoding Integers with a Given Range The address of a COPY instruction is an unsigned integer with a known range, i.e., it cannot be larger than the current position. Such a value can be encoded more compactly than as done in the Sfio scheme. For example, if a value is in the range 0-255, it can be encoded with a single byte. On the other hand, if the same value happens to be larger than 127, it would require two bytes to encode in the variable-sized scheme. Below are the algorithms to encode integers with known ranges: 1. int sfputm(Sfio_t* f, int_t v, int_t max) 2. { uchar_t c[sizeof(int_t)], *s; 3. s = &c[sizeof(c)-1]; 4. *s = v & 255; 5. while((max >>= 8) != 0) 6. { v >>= 8; 7. *--s = v & 255; 8. } 9. sfwrite(f, s, (&c[sizeof(c)]) - s); 10. return (&c[sizeof(c)]) - s; 11. } 12. int_t sfgetm(Sfio_t* f, int_t max) 13. { int_t v; 14. for(v = 0;;) -9- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 15. { v = (v << 8) | sfgetc(f); 16. if((max >>= 8) == 0) 17. return v; 18. } 19. } 4.3 Delta Instruction Coding Format Delta instructions are encoded as control bytes with associated data. Each control byte is an index into an instruction code table of 256 entries. Each code entry is of the type Vcdcode_t as defined below: 1. typedef struct _vcdinst_s 2. { unsigned char type; /* instruction type */ 3. unsigned char size; /* >0 if size is coded here */ 4. unsigned char mode; /* address mode for COPYs */ 5. } Vcdinst_t; 6. typedef struct _vcdcode_s 7. { Vcdinst_t inst1; /* first instruction */ 8. Vcdinst_t inst2; /* second instruction */ 9. } Vcdcode_t; Thus, each control byte identifies up to two instructions. Below are the different instruction types: 1. #define VCD_NOOP 0 /* not an instruction */ 2. #define VCD_ADD 1 /* an ADD instruction */ 3. #define VCD_RUN 2 /* a RUN instruction */ 4. #define VCD_COPY 3 /* a COPY instruction */ Each instruction has a size n of the data involved. If the field Vcdinst_t.size is non-zero, it is the value for n. Otherwise, n is encoded next in the instruction dataset as a variable-sized integer. If the instruction is a COPY, the copy address will follow next in the instruction dataset. Its encoding depends on some addressing scheme to be discussed next. A COPY address can be encoded in different ways. The field Vcdinst_t.mode has values in the range [0-7] and defines the below address encoding modes: 1. #define VCD_SELF 0 /* coded as itself */ 2. #define VCD_HERE 1 /* from current position */ 3. #define VCD_SAME 2 /* index into "same" cache */ 4. #define VCD_NEAR 3 /* index into "near" cache */ VCD_SAME and VCD_NEAR indicate two address caching methods designed to take advantage of the heuristic that successive copying addresses tend to be the same or fairly close to one another. Note that as discussed below, there are 4 VCD_NEAR addresses corresponding to Vcdinst_t.mode values in the range [VCD_NEAR,VCD_NEAR+3]. Let A be the COPY address and H the current location in the target data. Below are the encodings: VCD_SELF: A is the next integer in the instruction dataset that is encoded in the range [0-H]. -10- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 VCD_HERE: Let B be the next byte in the instruction dataset. Then A is H-B. That is, the distance from A to H, H-A, must have been in the range [0-255]. VCD_SAME: The "same" cache keeps 256 addresses. Let B be the next byte in the instruction dataset. Then A is same[B]. VCD_NEAR: The "near" cache keeps 4 addresses. Let B be the next byte in the instruction dataset and t be Vcdinst_t.mode-VCD_NEAR. Then, A is near[t]+B-127. Below are the algorithms to maintain address caches. 1. typedef struct _cache_s 2. { int n; 3. int near[4]; 4. int same[256]; 5. } Cache_t; 6. cache_init(Cache_t* ka) 7. { int i; 8. for(i = 0; i < 4; ++i) 9. ka->near[i] = 0; 10. ka->n = 0; 11. for(i = 0; i < 256; ++i) 12. ka->same[i] = 0; 13. } 14. cache_update(Cache_t* ka, int_t addr) 15. { ka->near[ka->n] = addr; 16. if((ka->n += 1) >= 4) 17. ka->n = 0; 18. ka->same[addr&255] = addr; 19. } COMMENTS. 1-5: These lines define the caches. As discussed, the "near" cache has 4 addresses and the "same" cache has 256 addresses. 6-13: These lines initialize addresses in the caches to zero. 14-19: These lines update the caches with a given address. Note that the "near" cache is updated in a round-robin manner and the lower eight bits of an address is its index in the "same" cache. Below is the function to encode a COPY address: 1. int addr_encode(Cache_t* ka, int addr, int here, int* best) 2. { int i, d, mode = -1; 3. if((d = (here-addr)) < 256) 4. { *best = d; 5. mode = VCD_HERE; 6. } 7. else if(ka->same[d = addr&255] == addr) 8. { *best = d; -11- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 9. mode = VCD_SAME; 10. } 11. else 12. { for(i = 0; i < 4; ++i) 13. { if((d = addr - ka->k_near[i]) >= -127 && d <= 128) 14. { *best = d + 127; 15. mode = VCD_NEAR+i; 16. } 17. } 18. } 19. if(mode < 0) 20. { *best = addr; 21. mode = VCD_SELF; 22. } 23. cache_update(ka,addr); 24. return mode; 25. } COMMENTS. 1: This lines declare the formal arguments. "addr" is the address to be encoded. "here" is the current location in the target data. "best" is used to return the value to be used to encode "addr". 3-6: If "addr" is within the range [0-255] away from the current location "here", then the addressing mode is VCD_HERE and the address is encoded as a single byte showing this distance. 7-10: Using the lower eight bits of "addr" to index the "same" cache, if the address in the cache exactly matches "addr", then VCD_SAME is used and "addr" is encoded as this index. 12-18: Check each address in the "near" cache to see if "addr" is within [-127,128] away from such an address. If there is one, the addressing mode is VCD_NEAR plus the index of the address and "addr" is encoded as the distance plus 127 (so the encoded value is in the range [0-255]). 19-22: If none of the above addressing modes applies, then VCD_SELF is used and "addr" is encoded as a value in the range [0-here]. 23: This line updates the address caches. Below is the function to decode a COPY address: 1. int addr_decode(Cache_t* ka, int here, int type, Sfio_t* instf) 2. { int addr; 3. if(type == VCD_SELF) 4. addr = sfgetm(instf, here); 5. else if(type == VCD_HERE) 6. addr = here - sfgetc(instf); 7. else if(type == VCD_SAME) 8. addr = ka->same[sfgetc(instf)]; 9. else addr = ka->near[type] + sfgetc(instf) - 127; 10. cache_update(ka, addr); 11. return addr; 12. } 4.4 Decoding A Target Window -12- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 The algorithm to decode a target window is as follows: 1. win_decode(uchar_t* tar, int n_tar, uchar_t* src, int n_src, int i_code, Sfio_t* instf, Sfio_t* dataf) 2. { int_t here, size, addr, i; 3. Cache_t ka; 4. Vcdinst_t *inst; 5. Vcdcode_t *code, *table = Code[i_code]; 6. cache_init(&ka); 7. for(here = 0; here < n_tar; ) 8. { code = &table[sfgetc(instf)]; 9. for(i = 1; i <= 2; ++i) 10. { inst = i == 1 ? &code->inst1 : &code->inst2; 11. if(inst->type == VCD_NOOP) 12. continue; 13. if((size = inst->size) == 0) 14. size = sfgetu(instf); 15. if(inst->type == VCD_ADD) 16. { for(; size > 0; --size) 17. tar[here++] = sfgetc(dataf); 18. } 19. else if(inst->type == VCD_RUN) 20. { int c = sfgetc(dataf); 21. for(; size > 0; --size) 22. tar[here++] = c; 23. } 24. else if(inst->type == VCD_COPY) 25. { uchar_t* from; 26. addr = addr_decode(&ka,here,inst->mode,instf); 27. from = addr < nsrc ? src+addr : tar+addr-nsrc; 28. for(; size > 0; --size) 29. tar[here++] = *from++; 30. } 31. } 32. } 33. } COMMENTS. 5: "table" is initialized to be the instruction code table to be used. We assume that "Code" is a global variable pointing to the list of 256 possible code tables. 6: This line initializes the address caches. 8: This line reads a control byte from the instruction dataset and gets the corresponding code table to decode instructions. 9-32: These lines process the given pair of delta instructions. Note that the data for VCD_ADD and VCD_RUN are read from the raw dataset. 5. Instruction Code Tables The delta instructions are encoded based on some instruction code table. The Vcdiff format allows applications to tailor such code tables to the particular data characteristics to enhance output compactness. Up to 256 instruction tables with indices in the range [0,255] are allowed. However, the first 8 indices [0,7] are reserved by Vcdiff. 5.1 The Encoding of a Code Table in a Delta File -13- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 It is desirable to compactly encode a code table in a delta file. To accomplish this, we first represent a table to be output as a string. Since each table entry encodes two instructions and each is 3 bytes, the total string size is 6*256 or 1536 bytes. Such a string can then be compared against some other code table string (e.g., that of table 0) using the window encoding algorithm win_deflate() to reduce the output data. Thus, the format for a code table in a delta file looks like this: Code table to compare with Encoded data Code table to compare with: This is a single byte indicating the code table used to compare the current table with. Encoded data: The data uses the same format as that of a window (Section 4.1.1) where the target data is the code string of the table being encoded and the source data segment is the code string of the code table with the above index. Two functions tab2str() and str2tab() are used for converting between a code table and its code string. Below is the description of tab2str(). str2tab() is just as straightforward so its description will be omitted. Note that bytes of the same type are grouped together to induce more matching when code strings are compared. 1. tab2str(Vcdcode_t* tab, uchar_t data[6*256]) 2. { int i, n; 3. n = 0; 4. for(i = 0; i < 256; ++i) 5. data[n++] = tab[i].inst1.type; 6. for(i = 0; i < 256; ++i) 7. data[n++] = tab[i].inst2.type; 8. for(i = 0; i < 256; ++i) 9. data[n++] = tab[i].inst1.size; 10. for(i = 0; i < 256; ++i) 11. data[n++] = tab[i].inst2.size; 12. for(i = 0; i < 256; ++i) 13. data[n++] = tab[i].inst1.mode; 14. for(i = 0; i < 256; ++i) 15. data[n++] = tab[i].inst2.mode; 16. } Below is the function code_define() to retrieve an encoded instruction code table from the delta file. 1. code_define() 2. { uchar_t src[6*256], tar[6*256]; 3. int index; 4. index = sfgetc(Delta); 5. tab2str(Code[sfgetc(Delta)], src); 8. win_inflate(0, tar, 1536, src, 1536); 9. Code[index] = str2tab(data); 10. } -14- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 COMMENTS. 4: This line reads the index to install the new code table. 5: This line constructs the source string to be compared with. 8-9: These lines construct the new table and install it. 5.2 Reserved Instruction Code Tables A COPY instruction with small data size may use more encoding space than the data itself. Thus, it is good to restrict COPY instructions to some minimum sizes. In turn, a code table can take advantage of such a restriction to enhance compactness. Vcdiff provides two reserved instruction code tables with indices 0 and 1 for minimium COPY sizes of 4 and 3 respectively. As noted, the starting code table has index 0 so it is optimized for minimum COPY size of 4. Note also that even when a table optimized for a particular minimum COPY size, this does not mean that a COPY instruction with a smaller size cannot be encoded. It simply means that the size of such an instruction must be coded separately. TYPE SIZE MODE TYPE SIZE MODE INDEX ------------------------------------------------------------- a. RUN 0 NOOP 0 b. ADD 0 NOOP 1 c. ADD [1,14] NOOP [2,15] d. COPY 0 [0,6] NOOP [16,22] e. COPY [M,M+10] [0,6] NOOP [23,99] f. COPY 0 [0,6] ADD [1,4] [100,127] g. COPY [M,M+3] [0,6] ADD [1,4] [128,239] h. COPY [M,M+3] SELF COPY [M,M+3] SELF [240,255] ------------------------------------------------------------- Table 2 For a given minimum COPY size M, Table 2 shows how the 256 code values are partitioned into different categories of indices: a. Index 0 defines a RUN instruction whose size is always coded separately as an unsigned integer. This is signified by having the "size" field being 0. b. Index 1 defines an ADD instruction with size coded separately. c. Indices 2-15 define 14 ADD instructions with sizes in the range [1,14]. d. Indices 16-22 define 7 COPY instructions for the 7 different address encoding modes discusses in Section 4.3. The sizes of these instructions are coded separately. e. Indices 23-99 define 77 COPY instructions using all 7 addressing modes and having sizes in the range [M,M+10]. f. Indices 100-127 defines 28 pairs of COPY and ADD instructions. The COPY sizes are coded separately while the ADD sizes are in the range [1,4]. g. Indices 128-239 define 112 pairs of COPY and ADD instructions whose sizes are in the ranges [M,M+3] and [1,4] respectively. h. Finally, indices 240-255 defines 16 pairs of COPY and COPY instructions with sizes in the range [M,M+3] and both using the same VCD_SELF addressing mode. Next we present the algorithms to construct an instruction code table. -15- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 Where appropriate, the algorithms also construct the inverted tables usable for data encoding. Any part of a table not explicitly set will be assumed to be initialized to be either zero or the null pointer as the case warranted. 1. mk_codetab(Vcdcode_t tab[256], int add[16], int copy[15][7], int copyadd[8][7][5], int copycopy[8][8], int M) 2. { 3. tab[0].inst1.type = VCD_RUN; 4. tab[0].inst1.size = 0; 5. tab[0].inst2.type = VCD_NOOP; 6. mk_add(tab, add); 7. mk_copy(tab, copy, M); 8. mk_copyadd(tab, copyadd, M); 9. mk_copycopy(tab, copycopy, M); 10. } COMMENTS. 1: This line declares the arguments for mk_codetab(). The inverted tables add[], copy[][], copyadd[][][], and copycopy[][] are designed to be indexed by sizes and address modes. The argument M is the minimum COPY size used for the table. 3-5: These lines construct the RUN instruction. 6-9: These lines call various subroutines to construct different parts of the instruction table. Below is the algorithm to construct the single ADD instructions: 1. mk_add(Vcdcode_t tab[256], int add[16]) 2. { int s, n; 3. tab[1].inst1.type = VCD_ADD; 4. tab[1].inst1.size = 0; 5. tab[1].inst2.type = VCD_NOOP; 6. add[0] = 1; 7. for(n = 2, s = 1; s <= 14; ++s, ++n) 8. { tab[n].inst1.type = VCD_ADD; 9. tab[n].inst1.size = s; 10. tab[n].inst2.type = VCD_NOOP; 11. add[s] = n; 12. } 13. } COMMENTS. 3-6: These lines construct the ADD instruction whose size is to be coded separately. Location 0 in the inverted table add[] points to this code. 7-12: These lines construct the ADD instructions with size in the range [1-14] as discussed earlier. Below is the function to construct the single COPY instructions: 1. mk_copy(Vcdcode_t tab[256], int copy[15][7], int M) -16- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 2. { int s, m, n; 3. for(n = 16, m = 0; m < 7; ++m, ++n) 4. { tab[n].inst1.type = VCD_COPY; 5. tab[n].inst1.size = 0; 6. tab[n].inst1.mode = m; 7. tab[n].inst2.type = VCD_NOOP; 8. copy[0][m] = n; 9. } 10. for(s = M; s <= M+10; ++s) 11. { for(m = 0; m < 7; ++m, ++n) 12. { tab[n].inst1.type = VCD_COPY; 13. tab[n].inst1.size = s; 14. tab[n].inst1.mode = m; 15. tab[n].inst2.type = VCD_NOOP; 16. copy[s][m] = n; 17. } 18. } 20. } COMMENTS. 3-9: These lines construct the 7 COPY instructions with whose data sizes are coded separately. 10-18: These lines construct the 77 COPY instructions whose data sizes are in the range [M,M+10]. The below function constructs the COPY/ADD instruction pairs: 1. mk_copyadd(Vcdcode_t tab[256], int copyadd[8][7][5], int M) 2. { int s, m, a, n; 3. for(n = 100, m = 0; m < 7; ++m) 4. { for(a = 1; a <= 4; ++a, ++n) 5. { tab[n].inst1.type = VCD_COPY; 6. tab[n].inst1.size = 0; 7. tab[n].inst1.mode = m; 8. tab[n].inst2.type = VCD_ADD; 9. tab[n].inst2.size = a; 10. copyadd[0][m][a] = n; 11. } 12. } 13. for(s = M; s <= M+3; ++s) 14. { for(m = 0; m < 7; ++m) 15. { for(a = 1; a <= 4; ++a, ++n) 16. { tab[n].inst1.type = VCD_COPY; 17. tab[n].inst1.size = s; 18. tab[n].inst1.mode = m; 19. tab[n].inst2.type = VCD_ADD; 20. tab[n].inst2.size = a; 21. copyadd[s][m][a] = n; 22. } 23. } 24. } 25. } Finally, the algorithm to construct the 16 COPY/COPY pairs: -17- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 1. mk_copycopy(Vcdcode_t tab[256], int copycopy[8][8], int M) 2. { int s, c, n; 3. for(n = 240, s = M; s <= M+3; ++s) 4. { for(c = M; c <= M+3; ++c, ++n) 5. { tab[n].inst1.type = VCD_COPY; 6. tab[n].inst1.size = s; 7. tab[n].inst1.mode = VCD_SELF; 8. tab[n].inst2.type = VCD_COPY; 9. tab[n].inst2.size = c; 10. tab[n].inst2.mode = VCD_SELF; 11. copycopy[s][c] = n; 12. } 13. } 14. } 6. Compression and Differencing Algorithms Sections 4 and 5 describe how to reconstruct a target file from data in a delta file. In this section, we discuss briefly general architectures for compressors and differencers. A few abstract functions are assumed: int win_target(uchar_t** tar); This function may be called many times to get sequential windows of data from the target file to be compressed. It returns the length of the window while the window itself is returned in "*tar". int win_match(uchar_t* tar, int n_tar, int p_tar, uchar_t** src, int* n_src, int_t* p_src); This function computes a source data segment, if any, to be compared with a target window. It returns 0, 1 or 2 for no source segment, source segment from the source file, or source segment from the target file. The arguments "src", "n_src" and "p_src" are used to return the source data segment, its size, and its position in either the source file or target file. int str_match(uchar_t* tar, int n_tar, int p_tar, uchar_t* src, int n_src, int* match); This function computes a string in either the source data segment or the target window that matches with a prefix of the target data being processed. Such a matched string can be encoded as a COPY instruction. The argument "p_tar" indicates the current position in the target window to be processed. The function returns the length of the match and, if that is positive, "*match" is set to the match position. int run_check(uchar_t* tar, int n_tar, int p_tar); This function checks to see if there is a run of the same bytes starting at the current location in the target string. The performance of an encoder depends on what algorithms are used to implement the above functions. However, as shown earlier, decoding performance is completely independent of such choices. Below is the algorithm for encoding a target file. We shall assume -18- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 that there is no secondary encoding for the resulting instruction and raw datasets. 1. sfputc(Delta,0xd6); 2. sfputc(Delta,0xc3); 3. sfputc(Delta,0xc4); 4. sfputc(Delta,0); 5. for(p_tar = 0; ; p_tar += n_tar) 6. { if((n_tar = win_target(&tar)) <= 0) 7. break; 9. win_deflate(tar, n_tar); 10. } COMMENTS. 1-4: These lines output the 4-byte header for a delta file. 6: This line calls win_target() to get a target window. If the target file has been completely processed, the return value is non-positive and the loop terminates. 8: This line computes a source data segment to be compared with the given target window. 9: This line calls win_deflate() to compute the delta instructions and output the data to the delta file. Below is the function win_deflate(): 1. win_deflate(uchar_t* tar, int n_tar) 2. { 3. int type, indi, n_src; 4. int_t p_src; 5. uchar_t *src; 6. Sfio_t *instf, *dataf; 7. type = win_match(tar, n_tar, &src, &n_src, &p_src); 8. instf = sfstropen(NULL,0); 9. dataf = sfstropen(NULL,0); 10. win_encode(tar, n_tar, src, n_src, instf, dataf); 11. indi = type == 0 ? (1<<2) : type == 1 ? (1<<3) : 0; 12. sfputc(Delta, indi); 13. sfputu(Delta, n_tar); 14. sfputu(Delta, sfsize(instf)); 15. sfputu(Delta, sfsize(dataf)); 16. sfmove(instf,Delta,-1,-1); 17. sfmove(dataf,Delta,-1,-1); 18. sfclose(instf); sfclose(dataf); 18. } COMMENTS. 7: This line calls win_match() to obtain a source data segment. 8-9: These lines create two streams to output instructions and raw data. 10: This line calls win_encode() to compute the delta instructions. 11-12: These lines output the indicator byte. -19- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 13-15: These lines output the sizes of the target window and instruction and raw datasets. 16-17: These lines use the Sfio function sfmove() to output the instruction and raw datasets to the delta file. Below is the function to compute encoding of a target window: 1. int win_encode(uchar_t* tar, int n_tar, uchar_t* src, int n_src, Sfio_t* instf, Sfio_t* dataf) 2. { int add, here, addr, s, size; 3. Cache_t ka; 4. cache_init(&ka); 5. for(here = 0, size = 0, add = -1; here < n_tar; ) 6. { if( (s = run_check(tar, n_tar, here)) > 0) 7. addr = -1; 8. else s = str_match(tar,n_tar,here,src,n_src,&addr); 9. if(s > 0) 10. { if(add >= 0) 11. size += add_encode(here-add,tar+here-add,instf,dataf); 12. if(addr < 0) 13. size += run_encode(s,tar[here],instf,dataf); 14. else size += copy_encode(&ka,s,addr,here,instf); 15. add = -1; 16. here += s; 17. } 18. else 19. { if(add < 0) 20. add = here; 21. here += 1; 22. } 23. } 24. if(add >= 0) 25. size += add_encode(here-add,tar+here-add,instf,dataf); 26. else size += copy_encode(0,0,0,0); 27. return size; 28. } COMMENTS. 4: This line initializes the address caches. 6-8: These lines check to see if there is a RUN or a COPY instruction. The function run_check() is straightforward to implement so its description will be omitted. 9-17: These lines output the appropriate delta instruction. 19-22: These lines are executed if there is no RUN or COPY instruction. The "add" variable is set to collect the data for an ADD instruction. 24-25: These lines output the last unmatched part of the target window in an ADD instruction. 26: This line outputs the last saved COPY instruction, if any. See the function copy_encode() below. 27: This line returns the number of bytes used for encoding the data. To finish up the encoding algorithm, we need to describe the functions add_encode(), run_encode() and copy_encode(). We shall assume that -20- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 the default code instruction table 0 is used. The below code declares the tables and define a few convenience macro functions. 1. Vcdcode_t Tab[256]; 2. Vcdcode_t* Add[16]; 3. Vcdcode_t* Copy[15][7]; 4. Vcdcode_t* Copyadd[8][7][5]; 5. Vcdcode_t* Copycopy[8][8]; 6. #define INDEXADD(a) ((a >= 1 && a <= 14) ? a : 0) 7. #define INDEXCOPY(c) ((c >= 4 && c <= 14) ? c : 0) 8. #define INDEXCOPYADD(c) ((c >= 4 && c <= 7) ? c : 0) 9. #define ISTINYADD(a) ((a >= 1 && a <= 4) ? 1 : 0) 10. #define ISTINYCOPY(c) ((c >= 4 && c <= 7) ? 1 : 0) COMMENTS. 6-8: These lines define functions to compute the index into the tables Add, Copy, and Copyadd for any given ADD or COPY size. 9-10: These lines define functions to test if a given size is small. These are used to determine if certain pair of instructions could be coded using a single byte code. Below is the function to encode a COPY instruction. 1. int Size, Addr, Here, Mode; 2. int copy_encode(Cache_t* ka,int size,int addr,int here,Sfio_t* instf) 3. { int code, mode = 0, best = 0, ndel = 0; 4. if(size > 0) 5. mode = addr_encode(ka, addr, here, &best); 6. if(Size > 0) 7. { if(Mode == mode && mode == VCD_SELF && 8. ISTINYCOPY(Size) && ISTINYCOPY(size) ) 9. { code = Copycopy[Size][size]; 10. ndel += sfputc(instf, code); 11. ndel += sfputm(instf, Addr, Here); 12. ndel += sfputm(instf, addr, here); 13. Size = -1; 14. return ndel; 15. } 16. code = Copy[INDEXCOPY(Size)][Mode]; 17. ndel += sfputc(instf, code); 18. if(Tab[code].inst1.size == 0) 19. ndel += sfputu(instf, Size); 20. if(Mode == VDC_SELF) 21. ndel += sfputm(instf, Addr, Here); 22. else ndel += sfputc(instf, Addr); 23. } 24. Size = size; 25. Addr = best; 26. Mode = mode; 27. Here = here; 28. return ndel; 29. } -21- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 COMMENTS. 1: This line defines variables to save a COPY instruction so that it may be coded jointly with another instruction later. 4-5: These lines compute the mode and value to encode the given COPY address if there is one (i.e., "size" is positive). 6-23: These lines process a previously saved COPY instruction. 7-15: These lines output a code that encode both COPY instructions. Note that this is done only when the data sizes are small and both instructions are using the VCD_SELF addressing mode (Table 2). 16-22: These lines output the previously saved COPY instruction singly. 24-27: These lines save the current COPY instruction if it has not been merged to a previous COPY instruction as discussed. Below is the function to output an ADD instruction: 1. int add_encode(int size, uchar_t* data, Sfio_t* instf, Sfio_t* dataf) 2. { int code, ndel = 0; 3. if(Size > 0) 4. { if(ISTINYADD(size)) 5. { code = Copyadd[INDEXCOPYADD(Size)][Mode][size]; 6. ndel += sfputc(instf, code); 7. if(Tab[code].inst1.size == 0) 8. ndel += sfputu(instf, Size); 9. if(Mode == VCD_SELF) 10. ndel += sfputm(instf, Addr, Here); 11. else ndel += sfputc(instf, Addr); 12. for(; size > 0; --size, ++data) 13. ndel += sfputc(dataf, *data); 14. return ndel; 15. } 16. else ndel += copy_encode(0,0,0,0); 17. } 18. code = Add[INDEXADD(size)]; 19. ndel += sfputc(instf, code); 20. if(Tab[code].inst1.size == 0) 21. ndel += sfputu(instf, size); 22. ndel += size; 23. for(; size > 0; --size, ++data) 24. sfputc(dataf, *data); 25. return ndel; 26. } COMMENTS. 3-17: These lines consider the case when there is a previously saved COPY instruction. Then, if the ADD size is small, a single code is output for both COPY and ADD instructions. Otherwise, the COPY instruction is output by itself. 18-24: These lines output an ADD instruction by itself. Below is the function to output a RUN instruction: -22- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 1. int run_encode(int size, int byte, Sfio_t* instf, Sfio_t* dataf) 2. { int ndel = 0; 3. if(Size > 0) 4. ndel += copy_encode(0,0,0,0); 5. ndel += sfputc(instf,0); 6. ndel += sfputu(instf,size); 7. ndel += sfputc(dataf,byte); 8. return ndel; 9. } COMMENTS. 3-4: These lines output a previously saved COPY instruction, if any. 5-7: These lines output the RUN instruction. 7. Summary We have described a general and portable encoding format for compression and differencing. The format is compact. For compression only, using the same LZ-77 string parsing strategy and without any secondary encoders, the typical compression rate is better than Unix compress and close to gzip. For differencing, the same data format is better than all known methods. Ignoring application-specific secondary encoder issues, the decoding algorithms run in linear time and require working space proportional to window sizes. ACKNOWLEDGEMENTS Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff who provided much encouragement to publicize Vcdiff. REFERENCES [1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression, Practical Reusable Unix Software, Editor B. Krishnamurthy, John Wiley & Sons, Inc., 1995. [2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data Compression, IEEE Transactions on Information Theory, 23(3):337-343, May 1977. [3] W. Tichy, The String-to-String Correction Problem with Block Moves, ACM Transactions on Computer Systems, 2(4):309-321, November 1984. [4] E.M. McCreight, A Space-Economical Suffix Tree Construction Algorithm, Journal of the ACM, 23:262-272, 1976. [5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta Algorithms, IEEE Software Configuration and Maintenance Workshop, 1996. [6] G.S. Fowler, D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library, Accepted for publication in Software Practice & Experience, 1999. AUTHOR'S ADDRESS -23- RFC 1 VCDIFF Generic Differencing and Compression Data Format March 2000 Kiem-Phong Vo (main contact) AT&T Labs, Room D223 180 Park Avenue Florham Park, NJ 07932 Phone: 973-360-8630 Email: kpv@research.att.com David G. Korn AT&T Labs, Room D237 180 Park Avenue Florham Park, NJ 07932 Phone: 973-360-8602 Email: dgk@research.att.com EXPIRATION DATE September 09, 2000 -24-