perf tests: - GRRR -- indexing MUCH slower now? trunk: Indexer: finished (960889 msec) Indexer: net bytes indexed 9635556306 Indexer: 33.62065752061142 GB/hour plain text branch: Indexer: finished (1048065 msec) Indexer: net bytes indexed 9635556306 Indexer: 30.824156883707392 GB/hour plain text - try *larger* maxItemsInBlock: could give better net perf? ie less seeking and more scanning what to do about short terms that "force" a block to mark itself as hasTerms!!?? - maybe instead of "isLeafBlock" bit we encode "countUntilNextTerm"? this way, for a block that only has empty string term we can stop scanning quickly? - maybe make a "terms block cache" that holds low-prefix LRU term blocks in ram...? - maybe a cache holding all short-length terms will be big perf boost? it saves having to scan the low-depth blocks... or... maybe a bit noting whether this block contains any terms != empty string suffix; or, we separately hold all 'short'/'straggler' terms in a map, enabling the low-depth blocks to then 'lie' and say they have no terms? - hmm -- should I do something "special" for prefix terms? ie short terms like 'a' that force a "fake" block (having only the one term 'a'). if i don't do something special, any time we seek a* we will have to scan this block? try forcing no hasTerms if depth < 2? LATER: - test if cutting over prefix query to .intersect is faster - maybe blocks should NOT store sub-block pointers? it's reudundant w/ the index... - hmm: maybe switch PKLookupTask to intersect!? do we have fast string builder? - hmm -- fix DOT when there are multiple outputs!? oh, maybe not -- it just works? - maybe we should provide a "terms dict rewriter" tool? ie can rewrite terms dict w/ new settings after segment was already created - intersect - can have a "allow terms out of order" mode... eg w/ the IntersectedTermsEnum? that could be HUGE gain - would be nice to bake into FST outputs this ability to pack bits (ie multiple outputs) into a single long output... instead of app having to do its own packing - maybe: allow more bytes to be spent on index WITHOUT changing hte blocking? ie add next-byte into index, but don't change the term blocks - ie, allow the index to "reach in" and index first/2nd/etc. bytes of the prefixes w/in a block? ie, if block 'foo' has 22 entries, but they all start with either 'a' or 'e' then i can store safely fooa/fooe in the index, pointing to the same block - should we re-shuffle the blocks into "depth-first" order...? - if entire terms index shares a certain prefix (eg 0000) then optimize this case -- pull out a common prefix, once, so don't do arc-by-arc scan for that - ooh: for this case, instead of the "empty" block, we should store the 0000 block as the "root" - TERMS DICT should store min, max term, common prefix for fast NOT_FOUND case? - specialize the "onlyExact" case high up, so we don't sprinkle if's all throughout - must remove var gap terms index writer/reader - should we "align" our term dict blocks w/ disk blocks!?