This file describes the design, data model, and file formats of FSFS index files. Design ====== For each pack and each rev file using logical addressing, there is exactly two index files. One, the log-to-phys index, maps the (rev, item_index) pairs to absolute file offsets. The other, phys-to-log, is a reverse index that gives basic information on any file location. This is enough to read and cache any data without traversing DAGs. Rev and pack files are immutable, so the same is true for index files. During a transaction or while packing a file, a proto index file gets written (actually, one log-to-phys and one phys-to-log). Its format is a simple concatenation of runtime structs and as such, an implementation detail subject to change. A proto index basically aggregates all the information that must later be transformed into the final index. General design concerns ----------------------- In Subversion, there is no limit to the size of a revision; even practical limits are in the order of millions of changes at least. Index data for these would be multiple megabytes in size with pack file indexes possibly approaching 1 GB. To ensure we still get roughly O(1) access time, we need a hierarchical data structure. Therefore, the indexes will start with a header containing an array of references to sub-sections or pages. The length of these pages varies but is limited to a size configurable in fsfs.conf. Finally, it is assumed that whole pages can be cached efficiently and with a high cache hit rate. So, although a page may have a thousand or more entries, the access time is still amortized O(1) in many scenarios. Items and item types -------------------- The index implementation treats item_index and item type as simple ints, except for SVN_FS_FS__ITEM_INDEX_UNUSED and SVN_FS_FS__ITEM_TYPE_UNUSED. Since they have been defined as 0, the code may test for "used" etc. by simply comparing with 0. See section "addressing modes" in structure to a list of item types and pre-defined item_index values. Encoding -------- The final index file format is tuned for space and decoding efficiency. Indexes are stored as a sequence of variable integers. The encoding is as follows: * Unsigned integers are stored in little endian order with a variable length 7b/8b encoding. If most significant bit a byte has been set, the next byte has also belongs to the same value. 0x00 .. 0x7f -> 0x00 .. 0x7f ( 7 bits stored in 8 bits) 0x80 .. 0xff -> 0x80 0x01 .. 0xff 0x01 (14 bits stored in 16 bits) 0x100 .. 0x3fff -> 0x80 0x02 .. 0xff 0x7f (14 bits stored in 16 bits) 0x100000000 -> 0x80 0x80 0x80 0x80 0x10 (35 bits stored in 40 bits) Technically, we can represent integers of arbitrary lengths. Currently, we only generate and parse up to 64 bits. * Signed integers are mapped onto the unsigned value space as follows: x >= 0 -> 2 * x x < 0 -> -2 * x - 1 Again, we can represent arbitrary length numbers that way but the code is currently restricted to 64 bits. Most data is unsigned by nature but will be stored differentially using signed integers. Log-to-phys index ================= This index has to map (rev, item_index) -> offset. It assumes that the item_index values per revision are dense and start at 0. There may be unused item_index values, though; the data structure simply gets less space-efficient when the more sparse the value space gets. Index data model ---------------- hierarchy: header -> per-revision info -> page -> offset There is one entry per revision in the header. Per revision there are one or more pages (exclusive to that revision) containing up to a known, fixed limit of entries (= page size). The total access path is: pages = header->pages[revision]; offsets = page = pages[item_index / page_size]; offset = offsets[item_index % page_size]; Different log-to-phys indexes in the same repository may have different page sizes but within any given index file, the page size is the same and immutable. header: ... first revision covered by this index file ... number of revision covered by this index file ... maximum number of entries per page ... array, for each revision containing the index in of the first page that belongs to this revision. This has +1 entries to terminate the last revision. ... array of page headers. It has [] entries. page table: ... absolute position of the page contents within the index file ... number of offset entries in the page. Must match
. unless this is the last page for the respective revision. ... length in bytes of the on-disk page description. Note that this is redundant with the . page: ... number of offset entries in the page. Must match
. unless this is the last page for the respective revision. Redundant with . ... array of absolute file positions within the rev / pack file. This has entries. Index file format ----------------- file := header revisions pages offsets header := u(
.) \ u(
.) \ u(
.) \ u(s(
.)) revisions := u(
.[k+1] -
.[k]), for k in 0 ..
.-1 pages := u(
.[k].) \ u(
.[k].), for k in 0 .. s(
.)-1 offsets := i(
.[k].[0]) \ i(
.[k].[l] \ -
.[k].[l - 1]), for l in 1 .. s(
.[k].)-1, for k in 0 .. s(
.)-1 u(x) ... unsigned int x in 7b/8b encoding i(x) ... signed int x in 7b/8b encoding s(x) ... number of entries in array x Proto index file format ----------------------- The index will be created from a "revision-less" proto index file containing () pairs only. All mappings belonging to the same revision must be written in one go but there is no restriction on the order of those entries. To signify that a new revision begins, a (0, 0) mapping must be written. A (0, 0) entry at the beginning of the file is optional and will be ignored. /* begin of proto index file for revision r and following */ (0, 0) /* mark start of revision r, optional for first rev */ (off, item)* /* zero to many mappings in random order */ (0, 0) /* mark start of revision r + 1 */ (off, item)* /* zero to many mappings in random order */ (0, 0) /* mark start of revision r + 2 */ (off, item)* /* zero to many mappings in random order */ ... /* end of file. */ All entries are pairs of 64 bit unsigned integers in machine endianess. Phys-to-log index ================= This index has to map offset -> (rev, item_index, type, len, checksum). Index data model ---------------- hierarchy: header -> page -> item info Logically, the index splits up index rev / pack file into pages of a fixed size. That page size may differ from the FS's block size. The index will describe each rev / pack file page with one index page. page = header->pages[offset % page_size]; item info = binary_search(page.data, offset) Note that the current implementation will not return any data if the offset is does not match any actual item start. To simplify the lookup, the last index page will have an "unused item" entry for the section behind EOF. Holes aren't allowed as well, i.e. every byte of the rev / pack is expected to be covered by the index file. Also, there may be items stretching across page borders or even over multiple pages. The data model solves this issue by storing the item descriptions as a "primary array" and then representing the pages as ranges within that array. Thus multiple pages may reference the same item description. header: ... first revision covered by this index file ... size of the rev / pack file in bytes ... number of bytes in the rev / pack file covered by each index page ... number of pages ... array of page offsets, i.e. locations the page data within the index. This array has + 1 entries. page: ... array of item descriptions, ordered by offset. First and last entry may cross page boundaries. entry: ... absolute position in the pack / rev file ... on-disk size of the item in the pack / rev file ... item type ... modified 32 bit FNV-1a checksum of that section of the pack / rev file (see below). 0 for empty or zero-length items ... revision that this item belongs to ... item_index within that revision Index file format ----------------- file := header pages items header := u(
.) \ u(
.) \ u(
.) \ u(
.) pages := u(
.[k+1] -
.[k]), for k in 0 ..
. -1 items := u([0].) \ u([l].) \ i(c([l]) - c([l-1])) \ i( [l]. - [l-1].), \ u(FNV checksum) for l in 0 .. s()-1, for k in 0 ..
.-1 u(x) ... unsigned int x in 7b/8b encoding i(x) ... signed int x in 7b/8b encoding s(x) ... number of entries in collection x c(x) ... compound function := x. * 8 + x. Access to negative indexes gives a 0 value. are in strict ascending offset order. Items that started after the begin of a given page and overlap with the next page will not be stored in the start page. The runtime representation will duplicate items overlapping page boundaries; the on-disk representation will not. Proto index file format ----------------------- The index will be created from a proto index file containing simple instances of the in-memory representation of svn_fs_fs__p2l_entry_t. Page table and header information, except start revision and page size, can easily be derived from that information. All entries must be written in strict offset order. Overlapping entries are not allowed; zero-length items are. In transactions, the final revision number may not be known when writing the proto index file (e.g. while still writing the proto rev file). Items with revision set to SVN_INVALID_REVNUM will therefore be automatically updated when creating the index file. This is possible in conjunction with rev files but not for pack files. FNV checksum ------------ FNV-1a can be found here: http://www.isthe.com/chongo/tech/comp/fnv/ For performance reasons we use a modified version: * split the input byte stream [b0 .. bN] into 4 sub-streams of equal length and up to 3 remnants: [b0 b4 b8 ..], [b1 b5 b9 ..], [b2 b6 b10 ..], [b3 b7 b11 ..], [remnant] * calculate 32 bit FNV-1a checksums for the 4 substreams: h0 = fnv_1a([b0 b4 b8 ..]), ..., h3 = fnv_1a([b3 b7 b11 ..]) * combine the big endian representation of these checksums plus the remnant of the original stream into a 12 to 15 byte long intermediate [i0 .. iK], 12 <= K+1 <= 15 * FNV checksum = fnv_1a([i0 .. iK]) in big endian representation