Derby Type System

The Derby type system is mainly contained in the org.apache.derby.iapi.types package. The main two classes are DataValueDescriptor and DataTypeDescriptor.

DataValueDescriptor

Values in Derby are always represented by instances of org.apache.derby.iapi.types.DataValueDescriptor, which might have been better named DataValue. DataValueDescriptor, or DVD for short, is mainly used to represent SQL data values, though it is used for other internal types.  DataValueDescriptor is a Java  interface and in general all values are manipulated through interfaces and not the Java class implementations such as SQLInteger. DVDs are mutable (their value can change) and can represent NULL or a valid value. Note that SQL NULL is represented by a DataValueDescriptor with a state of NULL, not a null Java reference to a DataValueDescriptor.

Generally the Derby engine works upon an array of DVD's that represent a row, which can correspond to a row in a table, a row in a ResultSet to be returned to the application or an intermediate row in a query. The DVD's within this array are re-used for each row processed, this is why they are mutable. For example in  reading rows from the store a single DVD is used to read a column's value for all the rows processed. This is to benefit performance, thus in a table scan of one million rows Derby does not create one million objects, which would be the case if the type system was immutable, like the Java object wrappers java.lang.Integer etc.

The methods in DataValueDescriptor can be broken into these groups

Type Specific Interfaces

To support operators specific to a type, or set of types, Java interfaces that extend DataValueDescriptor exist:

Language Compilation

Much of the generate code for language involves the type system. E.g. SQL operators are converted to method calls on interfaces within the type system, such as DataValueDesciptor or NumberDataValue. Thus all this generated code makes method calls through interface method calls. The language has a policy/style of generating fields with holder objects for the result of any operation. This holder DataValueDescriptor is then re-used for all the operations within that query execution, thus saving object creation when the operation is called on multiple rows. The generated code does not create the initial value for the field, instead the operator method or DataValueFactory methods create instance the first time that the result is passed in as null. The approximate Java code for this would be (note the generator generates to byte code directly).

   // instance field to hold the result of the minus
   private NumberDataValue f7;

       ...

    // code within a generated method
   f7 = value.minus(f7);

Interaction with Store

The store knows little about how values represent themselves in bytes, all that knowledge is contained within the DVD implementation.
The exception is SQL NULL handling, the store handles NULL values consistently, as a null bit in the status byte for a field. Thus readExternal and writeExternal are never called for a DataValueDescriptor that is NULL.

Delayed Object Creation

When a value reads itself from its byte representation it is required that the least amount of work is performed to obtain a useful representation of a value. This is because the value being read from disk may never be returned to the application, or returned but never used by the application. The first case can occur when a qualification in the SQL statement is executed at the language layer and not pushed down to the store, thus the row is fetched from the store but filtered out at the language layer. Taking SQLDecimal as an example, the byte format is a representation of  a java.math.BigInteger instance along with a scale. Taking the simple approach that SQLDecimal would always use a java.math.BigDecimal, then this is the steps that would occur when reading a DECIMAL column:
  1. Read BigInteger format into byte array, read scale
  2. New BigInteger instance from byte array - 2 object creations and byte array copy
  3. New BigDecimal instance from BigInteger and scale - 1 object creation
Now think about a million row table scan with a DECIMAL column that returns 1% of the rows to the application, filtering at the language layer.

This simple SQLDecimal implementation will create 3 million objects and do 1 million byte array copies.

The smart (and current) implementation of SQLDecimal will delay steps 2 and 3 until there is an actual need for a BigDecimal object, e.g when the application calls ResultSet.getBigDecimal. So assuming the application calls getBigDecimal for every row it receives, then, since only 1% of the rows are returned, 30,000 objects are created and 10,000 byte copies are made, thus saving 2,970,000 object creations and 990,000 byte array copies and the garbage collection overhead of those short lived objects.

This delayed object creation increases the complexity of the DataValueDescriptor implementation, but the performance benefit is well worth it. The complexity comes from the implementation maintaining dual state, in SQLDecimal case the value is represented by either the raw value, or by a BigDecimal object. Care is taken in the implementation to always access the value through methods, and not the fields directly. String based values such as SQLChar also perform this delayed object creation to String, as creating a String object requires two object creations and a char array copy. In the case of SQLChar though, the raw value is maintained as a char array and not a byte array, this is because the char[] can be used as-is as the value, e.g. in string comparisons.

DataValueFactory

Specific instances of DataValueDescriptor are mostly created through the DataValueFactory interface. This hides the implementation of types from the JDBC, language and store layers. This interface includes methods to:

DataTypeDescriptor

The SQL type of a column, value or expression is represented by an instance of org.apache.derby.iapi.types.DataTypeDescriptor. DataTypeDescriptor contains three key pieces of information:
  1. The fundamental SQL type, e.g. INTEGER, DECIMAL, represented by a org.apache.derby.iapi.types.TypeId.
  2. Any length, precision or scale attributes, e.g. length for CHAR, precision & scale for DECIMAL.
  3. Is the type nullable
Note that a DataValueDescriptor is not tied to any DataTypeDescriptor, thus setting a value into a DataValueDescriptor that does not conform to the intended DataTypeDescriptor is allowed. The value is checked in an explict normalization phase. As an example, an application can use setBigDecimal() to set 199.0 to a parameter that is marked as being DECIMAL(4,2). Only on the execute phase will the out of range exception be raised.

Issues

Interfaces or Classes

Matching the interface type hierachy is a implementation (class) hierachy complete with abstract types, for example DataType (again badly named) is the abstract root for all implementations of DataValueDescriptor, and NumberDataType for NumberDataValue. Code would be smaller and faster if the interfaces were removed and the official api became the public methods of these abstract classes. The work involved here is fixing the code generation involving types, regular java code would be compiled correctly with any change, but the generated code needs to be change by hand, to change interface calls to method calls. Any change like this should probably rename the abstract classes to short descriptive names, liker DataValue and NumberValue.

DataValueFactory

There is demonstrated need to hide the implementation of DECIMAL as J2ME, J2SE and J2SE5 require different versions, thus a type implementation factory is required. However it seems to be too generic to have the ability to support different implementations of INTEGER, BIGINT and some other fundemental types. Thus maybe the code could be simplified to allow use of SQLInteger, SQLLong and others directly. At least the SQL types that are implemented using Java primitives.

Result Holder Generation

The dynamic creation of result holders (see language section) means that all operators have to check for the result reference being passed in being null, and if so create a new instance of the desired type. This check seems inefficient as it will be performed once per operation, again, imagine the million row query. In addition the field that holds the result holder in the generated code is assigned each time to the same value, inefficient. It seems that the code using the type system, generated or coded, can set up the result holder at initialization time, thus removing the need for the check and field assignment, leading to faster smaller code.

NULL and operators

The operators typically have to check for incoming NULL values and assign the result to be NULL if any of the inputs are NULL. This combined with the result holder generation issue leads to a lot of duplicate code checking to see if the inputs are NULL. It's hard to currently do this in a single method as the code needs to determine if the inputs are NULL, generate a result holder and return two values (is the result NULL and what is the result holder). Splitting the operator methods into two would help as at least the NULL checks could be in the super-class for all the types, rather than in each implementation. In addition this would lead to the ability to generate to a more efficient operator if the inputs are not nullable. E.g for the + operator there could be plus() and plusNotNull() methods, the plus() being implemented in the NumberDataType class, handling NULL inputs and calling plusNotNull(), with the plusNotNull() implemented in the specific type.

Operators and self

It seems the operator methods should almost always be acting on thier own value, e.g. the plus() method should only take one input and the result is the value of the receiver (self) added to the input. Currently the plus takes two inputs and probably in most if not all cases the left input is the receiver. The result would be smaller code and possible faster, as the method calls on self would not be through an interface.