Supporting Formula in Pocket Excel plugin for xMerge

Supporting Formula in Pocket Excel plugin for xMerge

Overview

The difficulty in supporting formula conversion from xml in StarOffice to binary equivalents in Excel comes from the fact that there are very basic differences between the two formats. Pocket Excel uses obviously binary records to store formula, StarOffice uses a String. In the case of Pocket Excel, postfix (Reverse Polish) notation is used and in StarOffice the formula is stored infix notation. This means that we must identify the operators and operands (as well as functions) in the formula in order to convert from one notation to another. This process we call parsing and in many ways is similar to what a compiler does when interpreting source code. The next step is to convert to and from postfix notation. This can be done in three ways using either stacks, binary trees or combinations of both. Once this has been done all that remains is token substitution where a String is converted to or from a series of bytes.

The Issues

Parsing

Parsing is the first operation to be performed on any formula encountered in either Pocket Excel or StarOffice. Parsing is the single most important aspect of formula conversion because all tokens must be correctly identified in order for the following two steps to be successful. The parser must be able to correctly identify operators, operands. Operands in there simplest terms consist of either a numeric value or a cell reference. Operators include binary operators (+,-,*,/) or unary operators(-). These are the simplest as

the operands are easily identified. A more complicated type of operator however is the function. These are more difficult as they can contain any number of parameters. The parameters can be of almost any type, numeric, cell references, cell ranges, or other functions. Consider the following formula :-

=SUM(AA11:AA22, 1000, C3+4, AVERAGE(D44:D55))

This is a valid formula and the parser needs to be able to identify each token

in order for the RPN conversion to be correctly completed.

Pocket Excel Formula Parsing

In the case of pocket excel the formula must be parsed from a series of bytes in postfix (Reverse Polish) notation to a string in infix notation. This is not really a parser as we simply convert the bytes to their equivalent string representations. Tokens in this way can easily be identified and no lexical scanning is required. Postfix notation does not have to concern itself with operator precedence or parenthesis hence this must be taken care of during paring. For example consider the following postfix statement

=AB+C*

This would be translated to infix notation

=A+B*C

However because of operator precedence it is required the we used parenthesis

to ensure that it is calculated in the right order.

=(A+B)*C

StarCalc Formula Parsing

In the case of StarOffice formula Parsing the formula must be parsed from a String to a series of bytes. Again operators precedence is important along with the use of parenthesis. Also the parser must have the ability to parse nested functions and complex arguments (eg. =A1*3-(SUM(C1:A1, D1+40, AVERAGE(D2:D5))) ).

RPN Conversion

Consider the formula

SUM(A1, A2, 3+4*SUM(B3:B9),D1)

this notation is called infix and is the one that is used in StarCalc. However in Pocket Excel this formula would appear in postfix notation (also called Reverse Polish)

A1 A2 3 4 B3:B9 SUM * + D1 SUM

There are two ways in which postfix can be converted from/to infix. One is to use a Binary tree and read the tree from right to left to get postfix and from left to right to get infix. The problem with this method is it doesn't take into account functions which can have more than two parameters.

The other way of implementing conversion is to use Stacks. This solves the problem of function parameters but the code is more complex. For example in order to take into account that natural operator precedence might be overruled by parenthesis the Stack method will need to take into account parenthesis (possibly nested).

Token substitution

Token substitution must take account of the different ways in which data is stored in each format. For example in StarOffice cell references are stored as Strings in the same way as they appear in the spreadsheet (e.g. A12, C23, B4:B12). However in Pocket Excel Cell references are stored as 0 based indexes so A1 is represented in binary as

44 00 C0 00

Cell Ref. Row Column (bits 0-7)

This is just a substitution for a simple cell reference. There are other types of cell references, cell ranges and functions. This needs to well designed code as it will need to be expanded as additional features are supported. Also due to the large amount of objects it is not practical to have an object for each operator and operand. Obviously token substitution needs to be supported in both directions.

Proposed Solution

Parsing

We have written a top-down parser that will be able to handle infix to postfix parsing with one addition. This addition is handling parameters within functions. This is not a trivial addition as there can be any number of parameters and the parameters themselves can be any combination of cell references, cell ranges, functions or other formula. The BNF notation for this parser is

<expression> ::= <unary op> <term> [<addop> <term>]

<term> ::= <factor> [<mulop> factor]

<factor> ::= <integer> | <variable> | <expression> | <function>

<function> ::= <ident> [ <expression> ]

Parsing from postfix to infix is made simpler by the fact that we will be reading in a series of bytes which will give the exact content of each token. In this case what is required is a converter as opposed to a parser. This will generate a series of objects (see token substitution) which are passed to the RPN converter.

RPN Conversion

In order to overcome the problem with Binary trees and function parameters we have decided to write a Stack implementation. This means the code to implement this is more complicated but is flexible enough to handle most formula.

From Infix to Postfix

In this case we push operators on to a Stack. When we come across an operator we check for precedence with the next operand and if it is greater we also push this onto the stack.

=A1+SUM(3*B1+4, C1)

First token is A1 which we put into our output string
the next token is + which we push on the stack
The next token is the function SUM with two arguments which again is an operator which we put on the stack
Next is 3 which as an operand we put in the output string
Next is * which goes on the stack
B1 is an operand we put in the output string
+ but this has less precedence so we pop * off the stack into the output string and + goes on the stack
4 operand in the output strings

From Postfix to Infix

From postfix to infix we again use stack. In this case however we use it as a temporary storage space for operands. We push operands onto the stack and when we come across a operator we pop the operands off the stack and use them as operands for that operator. If the completed statement is a parameter to a function it needs to be pushed back onto the stack and the remainder of the function parsed.

=A1 A2 3 4 B3:B9 SUM * + D1 SUM 3 +

First token A1 is an operand it goes on the stack
next is A2 again on the stack
next is 3 again on the stack
next is 4 again on the stack
next is B3:B9 again on the stack
next is SUM with 1 argument so pop one operand off the stack B3:B9 and insert it into the SUM function
The next is a * operator so we prepend that to the SUM statement and pop another operand 4 off the stack
The next is a + operator so we prepend that to the SUM statement and pop another operand 3 off the stack
The next is another operand so know that parameter is finished so we push it onto the stack and then push D1 onto the stack as well.
The next is SUM with 3 arguments so we pop 3 arguments of the stack
The next is an operand so we know to push the statement completed in the last step needs to be pushed onto the stack and then we push on 3 as well
the last token is a plus so we end up with

=SUM(A1, A2, 3+4*SUM(BS:B9), D1)+3

Token Substitution

The best solution is to use generic objects for tokens. We will have two basic object classes. One called OperatorToken and one called OperandToken. Each of these will have a value variable as well as type. The type will describe forms of an operator or operand e.g. + , A1, SUM(). The parser will parse a string or a series of bytes into a Vector of these objects which will be used by the RPN converter to convert to or from postfix notation. These classes will be derived from a Token interface to facilitate this.

We will also have to describe a Constants interface to describe the various hex codes that exist for each object. One area we have not decided what to do yet is converting to or from bytes. Obviously we need a separate one for each operator but it is not practical to have one object for each operator.