Increasing the row limit above 32000 rows

Author: Eike Rathke
Document state: CWS rowlimit integrated as of SRC680_m42; CWS rowlimit2 integrated as of SRC680_m52; bugfixes as of SRC680_m59
TODO: beyond 64k plan
Last changed: Monday, December 6, 2004

Table of content

Preliminary
History
Document structure overview
Position accessing methods
List of structs and classes having an USHORT (or short or UINT16 or INT16) row value as member variable
List of defines using a limited row value
Some specials about row numbers
Strategy
Road map and time frame
Table of steps

Preliminary

Currently the spreadsheet application is limited to 32000 rows per sheet. As we all know that is not enough for some uses. In this document I will describe the code changes which have to be done to increase the limit. Where necessary I will also insert some remarks about why the code is like it is, pitfalls and things to have in mind when changing the code. This document is under development, the goal is to have it completed by the end of the year, so changing the code may start next year (apart from some minor changes here and there which can be done on the fly). [Editor's note: that was back in 2001. The goal was shifted twice because of other tasks to be completed with higher priority. Hopefully we'll start modifying code by the end of July / start of August 2002]. You may wonder about that time frame, but increasing the row limit is not a task of changing one defined constant or similar. You will see as this document evolves.

History

Bear in mind that most document structures were designed back in old Win31 days and the number of rows was limited to 8192 then. Win31 couldn't address more than 64kB in one contiguous memory area without huge effort (it could have been 16384 instead of 8192 (what the unnamed competitors program implemented in its version 5) because with size_of_ptr==4 an array of 64kB could hold 16384 pointers to rows, but anyways, it wasn't). In general, addressing a spreadsheet position is done using unsigned short type variables for column, row, sheet (AKA table). Theoretically this would allow 65536 rows to be addressed, but at some places there are short type variables used where relative addressing is needed, for example in cell references. Some special values like USHRT_MAX or MAXROW+1 are used to denote an invalid row position, and furthermore the number of rows has to be dividable by a reasonable number producing a whole-numbered result which is necessary for the mechanism of broadcasting changes in areas where formulas are listening to. Hence the change from 16384 to 32000 instead of 32768.

Document structure overview

Roughly said, a spreadsheet document may consist of up to 256 sheets, each sheet having 256 columns with each column containing a dynamic array with elements containing the row number and a pointer to the cell for a given row. Empty cells are not stored. The array is binary searchable and for evenly distributed filled columns an interpolating search is used. A similar array holds the cell style attributes (font, alignment, number format and so on) with each entry specifying an end row for that attribute, thus formatting an entire column with identical attributes results in only one entry.

Position accessing methods

There are a lot of class ScDocument methods (see inc/document.hxx and source/data/documen*.cxx accessing a cell position by
MethodName( USHORT nCol, USHORT nRow, USHORT nTab );
where at least the nRow parameter will have to be changed to long, but it should be evaluated if all methods taking separated col/row/tab parameters couldn't be changed to take one ScAddress (see below) parameter instead. Similar, the class ScTable methods (see inc/table.hxx and source/core/data/table*.cxx)
MethodName( USHORT nCol, USHORT nRow );
and the class ScColumn methods (see inc/column.hxx and source/core/data/column*.cxx)
MethodName( USHORT nRow );and the class ScAttrArray methods (see inc/attarray.hxx and source/core/data/attarray.cxx)
MethodName( USHORT nRow );
row parameters have to be changed, also all
MethodName( ..., USHORT nStartRow, USHORT nEndRow, ... );
of all of those classes.
Search( USHORT nRow, short& nIndex );
of ScColumn and ScAttrArray and similar are special in a way that the short& nIndex reference is used to return a position within an array where the position may at most be the number of rows used. This of course has to be changed as well.

Additionally, a class ScAddress (see inc/global.hxx and source/core/data/global2.cxx) contains column/row/table values encoded into one UINT32 value such that the row value occupies 16 bit and column and table values each occupy 8 bit. Changing these would also affect the old binary file format, so appropriate conversion methods would have to be provided. It should be evaluated how a packing of information into any whatsoever sized variable space could be accomplished, since an ScAddress is quite often stored in memory. For example, one INT32 may hold the row value and column and sheet positions may be stored in INT16 values each. INT16 because at least the sheet number shouldn't be limited to 256 anymore (this as a side effect of all the row limit changes, though a change of the real sheet accessing methods would have to be carried out in a second step).

A cell reference (struct SingleRefData and struct ComplRefData, see inc/refdata.hxx and source/core/tool/refdata.cxx) in a formula contains an absolute position and an relative position. Both values are INT16 values. If those are to be changed it should be thought over if it would be really necessary to keep both the absolute value and the relative value or if it wouldn't be sufficient to keep either one (which of course would imply some changes in the class ScInterpreter (see source/core/inc/interpre.hxx and source/core/tool/interpr*.cxx) and class ScCompiler (see inc/compiler.hxx and source/core/tool/compiler.cxx position accessing methods). This would save a significant amount of memory if a lot of references were used in a lot of formulas (which usually is the case).

There is also a class ScTripel which partly implements features found in class ScAddress, and a class ScRefTripel which is similar to the SingleRefData with the exception that it is never used in formula data and has slightly different capabilities. Both, ScTripel and ScRefTripel (see inc/global.hxx and source/core/data/global.cxx), are leftovers from prior versions and should be replaced by ScAddress and a new (to be created) class ScRefAddress respectively throughout the entire application.

List of structs and classes having an `USHORT` (or `short` or `UINT16` or `INT16`) row value as member variable

Note: this list was created in 2001 and is incomplete [2004-01-26]

inc/*.hxx

class ScAddress
UINT32 nAddress; encoded row in lower 16 bit

struct ScAttrEntry
nRow; the end row of the range spanned by the attribute
class ScAttrArray
nCount; number of used ScAttrEntry entries (max: number of used rows)
nLimit; number of allocated entries (ditto)
class ScAttrIterator
short nPos; holds the index position to a ScAttrArray element obtained by ScAttrArray::Search()
nRow;
nEndRow;
class ScMergeAttr
INT16 nRowMerge;
class ScFormulaCell
nMatRows;
class ScChartPositionMap
nRowCount;
class ScChartArray
nStartRow;
class ScChangeTrack
nContentRowsPerSlot;
nContentSlots;
struct ColEntry
nRow;
class ScColumn
nCount; number of used ColEntry entries (max: number of used rows)
nLimit; number of allocated entries (ditto)
struct ScReferenceEntry
nRow;
class ScConsData
nRowCount;
class ScDBData
nStartRow;
nEndRow;
nSortDestRow;
nQueryDestRow;
class ScDocumentIterator
nRow;
class ScValueIterator
nStartRow;
nEndRow;
nRow;
nColRow;
nNextRow;
nAttrEndRow;
class ScQueryValueIterator
nRow;
nColRow;
nAttrEndRow;
class ScCellIterator
nStartRow;
nEndRow;
nRow;
nColRow;
class ScQueryCellIterator
nRow;
nColRow;
nAttrEndRow;
class ScDocAttrIterator
nStartRow;
nEndRow;
class ScAttrRectIterator
nStartRow;
nEndRow;
class ScHorizontalCellIterator
nEndRow;
USHORT* pNextRows;
nRow;
class ScHorizontalAttrIterator
nStartRow;
nEndRow;
USHORT* pNextEnd;
nRow;
class ScUsedAreaIterator
nNextRow;
nCellRow;
nAttrRow;
nFoundRow;
struct RowInfo
nRowNo;
struct ScSymbolStringCellEntry
nRow;
class ScDocument
nSrcMaxRow;
class ScTableRowsObj
nStartRow;
nEndRow;
class ScDPOutput
nTabStartRow;
nMemberStartRow;
nDataStartRow;
nTabEndRow;
class ScDPSaveData
nRowGrandMode;
struct ScDPTableIteratorParam
nRowCount;
class ScEditUtil
nRow;
struct ScImportParam
nRow1;
nRow2;
class ScTripel
nRow;
NOTE: Replace usage of ScTripel with ScAddress and delete ScTripel code. Delete ScTripel conversion code from ScAddress.
struct ScQueryParam
nRow1;
nRow2;
nDestRow;
struct ScSubTotalParam
nRow1;
nRow2;
struct ScConsolidateParam
nRow;
struct ScPivotParam
nRow;
nRowCount;
struct ScMarkEntry
nRow;
class ScMarkArray
nCount; number of used ScMarkEntry entries (max: number of used rows)
nLimit; number of allocated entries (ditto)
class ScPivot
nSrcRow1;
nSrcRow2;
nDestRow1;
nDestRow2;
nDataStartRow;
short nRowCount; a count which in program flow may not exceed a value of 8. Just a nice trap for a grep.
short nDataRowCount;
nRowIndex;
This class is some old stuff almost not needed anymore, was used for the old ... (pssst, we can't use the word here since it is a registered trademark, guess by whom) what is called DataPilot in our application. The class is only used for import/export of the old binary file format.
class ScArea
nRowStart;
nRowEnd;
struct SingleRefData
INT16 nRow;
INT16 nRelRow;
struct ScExtTabOptions
UINT16 nTopRow;
UINT16 nTopSplitRow;
class ScExtDocOptions
UINT16 nCurRow;
struct ScSortParam
nRow1;
nRow2;
nDestRow;

source/core/inc/*.hxx

class ScMatrix
nAnzRow;

source/filter/inc/*.hxx

struct ScEEParseEntry
nRow;
nRowOverlap;
class ScEEParser
nRowCnt;
nRowMax;
class XclExpTableOp
nFirstRow;
nLastRow;
nColInpRow;
nRowInpCol;
nRowInpRow;
struct ScHTMLTableStackEntry
nRowCnt;
struct ScHTMLAdjustStackEntry
nNextRow;
nCurRow;
class ScHTMLTableData
nFirstRow;
nLastRow;
nRowSpan;
nDocRow;
struct Sc10ColData
Row;
NOTE: Calc 1.0 import filter, there shouldn't be any changes necessary.
class XclExpCachedValueList
nRows;
class XclImpChart_LinkedData
nMinRowVal;

source/filter/xml/*.hxx

class ScMyNotEmptyCellsIterator
sal_uInt16 nCellRow;

source/ui/inc/*.hxx

class ScRedlinData
nRow;
class ScDataGrid
nNumberOfRows;
class ScGridWindow
nFilterBoxRow;
class ScNavigatorDlg
nCurRow;
class ScOutputData
nEditRow;
struct ScPrintState
nStartRow;
nEndRow;
class ScPageRowEntry
nStartRow;
nEndRow;
class ScPrintFunc
nRepeatStartRow;
nRepeatEndRow;
nStartRow;
nEndRow;
class ScSpellingEngine
nOrgRow;
nOldRow;
class ScTabPageSortFields
nFirstRow;
class ScUndoCursorAttr
nRow;
class ScUndoEnterData
nRow;
class ScUndoPageBreak
nRow;
class ScUndoThesaurus
nRow;
class ScUndoSubTotals
nNewEndRow;
class ScUndoImportData
nEndRow;
class ScUndoRepeatDB
nNewEndRow;
class ScViewData
nEditRow;
nEditEndRow;

List of defines using a limited row value

#define MAXROW 31999 inc/global.hxx
THE limit.
#define BCA_BRDCST_ALWAYS ScAddress( 0, 32767, 0 ) inc/global.hxx
A special address (outside the normal valid address range) used to signal an “always changed” function or cell.
#define ASCIIDLG_MAXROWS 32000 source/ui/dbgui/asciiopt.cxx
The maximum number of rows displayed in the ASCII import preview dialog. It is questionable if this should be increased since arrays for holding the imported data are created.
NumericField ED_ROW: Maximum = 32000; Last = 32000; source/ui/navipi/navipi.src, Window RID_SCDLG_NAVIGATORThe navigator resource defines an upper limit for the row edit field and its spin button.

Some specials about row numbers

A couple of iterators (see inc/dociter.hxx and source/core/data/dociter.cxx) make direct use of internal data structures like those of ScColumn and ScAttrArray using row values.

Reference updating methods like those of class ScRefUpdate (see inc/refupdat.hxx and source/core/tool/refupdat.cxx) make extensive use of USHORT and short variables and the MAXROW limit.

Each sheet contains two arrays, pRowHeight and pRowFlags (see inc/table.hxx and source/core/data/table1.cxx), sized MAXROW+1 each. If the number of rows is to be increased significantly a new mechanism would have to be implemented since keeping those row infos for a million rows or so isn't desirable. Even now these two arrays allocate about 90kB per sheet.

class ScBroadcastAreaSlot (see source/core/data/bcaslot.cxx) makes use of an upper row limit in the way that the rows are distributed over a number of slots. The number of slots is related to the number of rows.

class ScChangeTrack uses a similar technique to combine positions into slot buckets for faster access.

Currently, both algorithms of ScBroadcastAreaSlot and ScChangeTrack can't work with an arbitrary changing row limit, they do need a fixed upper limit set at compile time. Both algorithms would also not work very well (regarding speed, memory footprint and efficiency) if the upper limit would be too high.

ScChartArray::GlueState() (see source/core/tool/chartarr.cxx) creates a temporary array which may become as big as the spanning area of a (multi-) selection. Currently the maximum limit this implies is 8MB (256 columns x 32000 rows), this will automatically grow if more rows are available and the user spans a selection from upper left to lower right. However, the upcoming future chart implementation will not need this mechanism anymore, but we'll have to keep it for backwards compatibility.

The drawing layer is not directly related to the row limit, but the area it covers is of course determined by the number of available rows.

Strategy

It would be challenging to offer an almost unlimited number of rows, say a million or two or hundred (if your system had that much memory). However, before that could be accomplished the above mentioned mechanisms of the drawing layer needed to be changed and row flags, ScBroadcastAreaSlot and ScChangeTrack needed to be changed to be less memory consuming.

Therefor the changes should be carried out in several steps:

Change all variables used for row numbers from USHORT and short to INT32 using a SCROW typedef. This includes any changes needed or type conversion in import/export filter, and to compensate for any side effect that might occur. The change should be carried out successively instead of in one big move in order to maintain a usable application and not to render it useless for weeks. To help in finding all necessary places a special access operators and methods converter class SCROW should be used that forces the compiler to throw error diagnostics whenever an USHORT or short is used in combination with SCROW. A similar class is already available (on the writer's machine ;-) and just needs some modification.
Find a solution for the memory footprint problem of the pRowHeight and pRowFlags arrays. Similar to the mechanism used to manage row entries of a column array of used rows, dynamic arrays of applied row heights and flags should be maintained.
For MS-Excel compatibility increase the limit to 64k rows. It should still be ok to use the same drawing layer and ScBroadcastAreaSlot and ScChangeTrack mechanisms at this stage, even if the latter two would result in some more memory consumption, but those are on a per document basis and not a per sheet basis.
At this point, a version may be released to provide an interim solution for those who need it, for example for extended MS-Excel compatibility.
Find solutions for the ScBroadcastAreaSlot and ScChangeTrack problems which have to be really scalable. The first might involve a complete redesign of the broadcaster/listener concept of formula cells, a heavy change! But it would be necessary anyways, because there are some performance issues with it.
Wait for a reimplementation of the drawing layer (which will come) and use that because it will do away some quirks related to anchor positions, different map modes, rounding errors, and the area covered.
Finally set an arbitrary row limit. Or don't have an arbitrary limit at all?

Road map and time frame

Since we'll never touch the binary file format again in the sense of modifying anything of it's structure to store new features, and the fact that we're working on a method to strip the binary filters from the core code and have separate filters to convert SO5 file format to XML file format, we won't have to care about the binary file stream operators or code accessing file streams if we wait for the new filter code to be functional before touching the binary file format code. This will ease things a lot. A second mile stone we want to achieve before carrying out massive changes on the source code are the accessibility and CTL changes for the next release that are going on in the 642 builds. Both changes are expected to be finished by the end of July. The following table uses week x for the availability of the filter and main feature completeness on accessibility and CTL, and references to week x+1 and so on. For calculating the duration of tasks it is assumed that a week has about 2.5 developer's days, not counting time needed for other tasks such as fixing bugs, code reviews, and participating in mailing lists, meetings and conference calls. Time lines of best case – worst case are given, this includes assumptions of best case getting help when appropriate and worst case when I'd to do everything on my own. Vacations not included.

NOTE: Previous versions of this document stated a "starting date" of August 2002, based on the assumption that OOo1.1/SO6.1 were feature complete and that the binary filters were stripped until end of July. Unfortunately that wasn't the case, and people kept asking me about the progress and why this task wasn't even started. At the moment I cannot (and I don't want to) give yet another estimate of things that aren't even the slightest bit under my control to be nailed down on it. So I just state that waiting for other tasks is ongoing, without giving any due date.

NOTE2: Started in week 48 of 2003 without waiting for the binary filter to be stripped. Based on the 2.5 days per week calculation and including 2 weeks holidays the approximated end date will be between April and June of 2004.

Towards 64k

Step	What	Start week	Effort (days)	Duration (weeks)		End week	Status	Days used
Step	What	Start week	Effort (days)	min	max	End week	Status	Days used
	Wait for new SO5 binary filter.	past	-	-	-	???	done	-
	Start work on branch cws_src680_rowlimit, don't wait for stripped binary filter anymore.	48	-	-	-	-	done	-
1	Isolate `class ScAddress` in an own header file (move from `inc/global.hxx` to `inc/address.hxx`), in the same header file add the following typedefs and include that whenever needed. `typedef USHORT SCROW; typedef USHORT SCCOL; typedef USHORT SCTAB; // temporarily for signed short types: typedef short SCsROW; typedef short SCsCOL; typedef short SCsTAB;` This can be done at almost any time. I'll do it.	0	2	0.8	0.8	0-1	done	2
2	Replace all `USHORT` and `UINT16 for row variables` with `SCROW`. Replace `short` and `INT16` with `SCsROW` for row variables. This is just to identify used signed row variables and mark them for easier retrieval later on when they are to be changed using a `typedef INT32 SCROW;` Maybe those are places where some special handling will be necessary (for example, in reference updating code) if the width of the variable type is changed later on. Additionally, while we're at it, we should change column `USHORT` to `SCCOL` and change table `USHORT` to `SCTAB` (and `SCsCOL` and `SCsTAB` as above) for future reference. This work could be parallelized in greater portions, I guess up to 4 people could work on it together. This is a task where one or two community members willing to dive deeper into the spreadsheet code could participate in a coordinated effort.	1	12	2.4	4.8	3-5	done	10.5
2	Keeping this old note for documentation: Except for the source/filter/excel/ directory the first wave is completed. Reason to leave out the excel directory are heavy changes in the calcrtl and calc18 childworkspaces that would end up in doing all the work twice. We're waiting for integration of those. Furthermore, we're waiting for integration of the cac CWS that also introduces quite some changes to the formula-compiler-interpreter related code, where a second run will be necessary.
3	Restructure `class ScAddress` to use `SCROW nRow;` `SCCOL nCol; SCTAB nTab;`and provide constructors and access operators to set and get old `ULONG`/`UINT32`/`USHORT` values with internal conversion so that no existing source code using it has to be changed. My task.	3-6	2	0.8	0.8	3-6	done	2
4	Implement functionality of `class ScTripel` at `class ScAddress` (`GetText()`, `GetColRowString()`). Create `class ScRefAddress` similar to `class ScRefTripel`. Replace all usage of `class ScTripel` and `class ScRefTripel` with `class ScAddress` and `class ScRefAddress`. My task.	4-7	4	1.6	1.6	5-8	done	4
5	Using a special converter class included in `inc/address.hxx` identify pieces of code where an implicit conversion of type `SCROW` to type `USHORT` or `UINT16` is performed. Change the code found to use `SCROW` or explicit conversion and take measures for error handling if the value passed doesn't fit. Same for `SCsROW` and `short` and `INT16`. I'll provide the converter class and would appreciate the help of another developer familiar with the code. This is a nasty task because it means to compile every source file with the converter class header, decide about modifications, make changes, compile again, ...	5-9	9	1.8	3.6	6-12	done	16
6	Look for places where the signed `SCsROW` is used and add special handling if necessary. Task for two.	7-13	6	1.2	2.4	8-15	done	2
6	This turned out to be unnecessary as we replaced all unsigned types with signed types. Instead, use of nRow<=MAXROW and similar were replaced with ValidRow(nRow), as also negative values might be encountered now.
7	Implement solution for `pRowHeight` and `pRowFlags` arrays. This may get a bit more complicated since it is woven with UI issues.	8-16	9	3	4	10-19	done	13
8	Throw the switch to `typedef INT32 SCROW;` and `typedef INT32 SCsROW;` Test and fix bugs.	11-20	3	1.2	1.2	12-21	done	2
9	Change `MAXROW` to 64k and `BCA_BRDCST_ALWAYS` to `ScAddress(0,SCROW_MAX,0)`. Adjust `ScBroadcastAreaSlot` and `ScChangeTrack` slots. Test and fix bugs.	12-21	9	2	4	14-24	done	9
In order to provide an EarlyAccess version in time, the rowlimit CWS will be QA'ed at this point and integrated into the master. After that, I'll create a new CWS (rowlimit2) to work on the remaining step7, addressing memory consumption. Note that columns "start week", "duration in weeks", and "end week" aren't aligned anymore. CWS rowlimit2 was integrated as of SRC680_m52, additional bugfixes went into SRC680_m59.
10	Maybe `MAXROW` could be increased to 128k or 256k with the current slot implementations, would have to be investigated.	15-25	3	1.2	2	16-26
	Be happy.	16-26	?	?
	Sum		59	16	25.2	week (x+16)-(x+26) == 12-21 (2004)

Breaking the limit

To be done.