Derby Type System
The Derby type system is mainly contained in the org.apache.derby.iapi.types
package. The main two classes are DataValueDescriptor
and
DataTypeDescriptor.
DataValueDescriptor
Values in Derby are always represented by instances of org.apache.derby.iapi.types.DataValueDescriptor,
which might have
been better named DataValue.
DataValueDescriptor, or
DVD for short, is
mainly used to represent SQL data values, though it is used for other
internal types. DataValueDescriptor
is a Java
interface and in general all values are manipulated through interfaces
and not the Java class implementations such as SQLInteger.
DVDs are mutable
(their value can change) and can represent NULL or a valid value. Note
that SQL NULL is
represented by a DataValueDescriptor
with a state of NULL,
not a null Java
reference to a DataValueDescriptor.
Generally the Derby
engine works upon an array of DVD's that represent a row, which can
correspond to a row in a table, a row in a ResultSet to be returned to
the application or an intermediate row in a query. The DVD's within
this array are re-used for each row processed, this is why they are
mutable. For example in reading rows from the store a single DVD
is used to read a column's value for all the rows processed. This is to
benefit performance, thus in a table scan of one million rows Derby
does not create one million objects, which would be the case if the
type system was immutable, like the Java object wrappers
java.lang.Integer
etc.
The methods in DataValueDescriptor
can be broken into
these groups
- getXXX methods
to help implement java.sql.ResultSet.getXXX
methods. Thus a ResultSet.getInt()
corresponds to the DataValueDescriptor.getInt()
method.
- setValue methods to help implement java.sql.PreparedStatement.setXXX
methods. Thus a PreparedStatement.setInt()
corresponds to the DataValueDescriptor.setValue(int)
method. These methods perform overflow and other range checks, e.g.
setting a long into a SQLInteger checks
to see that the value is within the range of an int, if not
an exception is thrown.
These methods are also used to implement casts and other data type
conversion, thus ensuring consistent type conversions for Derby within
SQL and JDBC.
- Methods to support SQL operators, e.g. isNull.
- Methods to read and write to disk, or strictly to convert the
value into a byte representation, e.g.writeExternal.
Type Specific
Interfaces
To support operators specific to a type, or set of types, Java
interfaces that extend DataValueDescriptor exist:
- NumberDataValue -
Methods for operators on numeric types, such as INTEGER, DECIMAL, REAL, e.g. plus for the SQL +
operator.
- StringDataValue -
Methods for operators on character types, such as CHAR, VARCHAR.
- BitDataValue -
Methods for operators on binary types, such as CHAR FOR BIT DATA, BLOB.
- DateTimeDataValue -
Methods for operators on character types, such as CHAR, VARCHAR.
- BooleanDataValue -
Methods for operators on BOOLEAN
type.
Language Compilation
Much of the generate code for language involves the type system. E.g.
SQL operators are converted to method calls on interfaces within the
type system, such as DataValueDesciptor
or NumberDataValue. Thus all
this generated code makes method calls through interface method calls.
The language has a policy/style of generating fields with holder
objects for the result of any operation. This holder
DataValueDescriptor is
then re-used for all the operations within that
query execution, thus saving object creation when the operation is
called on multiple rows. The generated code does not create the initial
value for the field, instead the operator method or DataValueFactory
methods create instance the first time that the result is passed
in as
null. The approximate Java
code for this would be (note the generator
generates to byte code directly).
// instance field to
hold the result of the minus
private
NumberDataValue f7;
...
// code within
a generated method
f7 = value.minus(f7);
Interaction with
Store
The store knows little about how values represent themselves in bytes,
all that knowledge is contained within the DVD implementation.
The exception is SQL NULL handling,
the store handles NULL values
consistently, as a null bit in the status byte for a field. Thus
readExternal and writeExternal are never called
for a
DataValueDescriptor that
is NULL.
Delayed Object Creation
When a value reads itself from its byte representation it is required
that the least amount of work is performed to obtain a useful
representation of a value. This is because the value being read from
disk may never be returned to the application, or returned but never
used by the application. The first case can occur when a qualification
in the SQL statement is executed at the language layer and not pushed
down to the store, thus the row is fetched from the store but filtered
out at the language layer. Taking SQLDecimal
as an example, the byte
format is a representation of a java.math.BigInteger instance
along with a scale. Taking the simple approach that SQLDecimal would
always use a java.math.BigDecimal, then this is the steps that would
occur when reading a DECIMAL column:
- Read BigInteger format
into byte array, read scale
- New BigInteger instance
from byte array - 2 object creations
and byte array copy
- New BigDecimal instance
from BigInteger and scale - 1 object
creation
Now think about a million row table scan with a DECIMAL column that
returns 1% of the rows to the application, filtering at the language
layer.
This simple SQLDecimal implementation
will create 3 million objects and
do 1 million byte array copies.
The smart (and current) implementation of SQLDecimal will delay steps 2
and 3 until there is an actual need for a BigDecimal object, e.g when
the application calls ResultSet.getBigDecimal.
So assuming the
application calls getBigDecimal for
every row it receives, then, since
only 1% of the rows are returned, 30,000 objects are created and 10,000
byte copies are made, thus saving 2,970,000 object creations and
990,000 byte array copies and the garbage collection overhead of those
short lived objects.
This delayed object creation increases the complexity of the
DataValueDescriptor implementation,
but the performance benefit is well
worth it. The complexity comes from the implementation maintaining dual
state, in SQLDecimal case
the value is represented by either the raw
value, or by a BigDecimal object.
Care is taken in the implementation
to always access the value through methods, and not the fields
directly. String based values such as SQLChar also perform this
delayed
object creation to String, as creating a String object requires two
object creations and a char array copy. In the case of SQLChar though,
the raw value is maintained as a char array and not a byte array, this
is because the char[] can be used as-is as the value, e.g. in string
comparisons.
DataValueFactory
Specific instances of DataValueDescriptor
are mostly created through
the DataValueFactory interface.
This hides the implementation of types
from the JDBC, language and store layers. This interface includes
methods to:
- generate new NULL values
for specific SQL types.
- generate specific types from Java primitves or Java objects (such
as String). The returned type corresponds to the JDBC mapping for the
Java type, e.g. SQLInteger for
int. Where the Java
type can map to
multiple SQL types there are specific methods such as getChar,
getVarchar.
DataTypeDescriptor
The SQL type of a column, value or expression is represented by an
instance of org.apache.derby.iapi.types.DataTypeDescriptor. DataTypeDescriptor contains three key
pieces of information:
- The fundamental SQL type, e.g. INTEGER, DECIMAL, represented by a
org.apache.derby.iapi.types.TypeId.
- Any length, precision or scale attributes, e.g. length for CHAR, precision & scale for
DECIMAL.
- Is the type nullable
Note that a DataValueDescriptor is
not tied to any DataTypeDescriptor,
thus setting a value into a DataValueDescriptor
that does not conform to the intended DataTypeDescriptor is allowed.
The value is checked in an explict normalization phase. As an example,
an application can use setBigDecimal()
to set 199.0 to a
parameter that is marked as being DECIMAL(4,2).
Only on the execute phase will the out of range exception be raised.
Issues
Interfaces or Classes
Matching the interface type hierachy is a implementation (class)
hierachy complete with abstract types, for example DataType (again
badly named) is the abstract root for all implementations of
DataValueDescriptor, and NumberDataType for NumberDataValue. Code would
be smaller and faster if the interfaces were removed and the official
api became the public methods of these abstract classes. The work
involved here is fixing the code generation involving types, regular
java code would be compiled correctly with any change, but the
generated code needs to be change by hand, to change interface calls to
method calls. Any change like this should probably rename the abstract
classes to short descriptive names, liker DataValue and NumberValue.
DataValueFactory
There is demonstrated need to hide the implementation of DECIMAL as
J2ME, J2SE and J2SE5 require different versions, thus a type
implementation factory is required. However it seems to be too generic
to have the ability to support different implementations of INTEGER,
BIGINT and some other fundemental types. Thus maybe the code could be
simplified to allow use of SQLInteger, SQLLong and others directly. At
least the SQL types that are implemented using Java primitives.
Result Holder Generation
The dynamic creation of result holders (see language section) means
that all operators have to check for the result reference being passed
in being null, and if so create a new instance of the desired type.
This check seems inefficient as it will be performed once per
operation, again, imagine the million row query. In addition the field
that holds the result holder in the generated code is assigned each
time to the same value, inefficient. It seems that the code using the
type system, generated or coded, can set up the result holder at
initialization time, thus removing the need for the check and field
assignment, leading to faster smaller code.
NULL and operators
The operators typically have to check for incoming NULL values and
assign the result to be NULL if any of the inputs are NULL. This
combined with the result holder generation issue leads to a lot of
duplicate code checking to see if the inputs are NULL. It's hard to
currently do this in a single method as the code needs to determine if
the inputs are NULL, generate a result holder and return two values (is
the result NULL and what is the result holder). Splitting the operator
methods into two would help as at least the NULL checks could be in the
super-class for all the types, rather than in each implementation. In
addition this would lead to the ability to generate to a more efficient
operator if the inputs are not nullable. E.g for the + operator there
could be plus() and plusNotNull() methods, the plus() being implemented in the
NumberDataType class, handling NULL inputs and calling plusNotNull(),
with the plusNotNull() implemented in the specific type.
Operators and self
It seems the operator methods should almost always be acting on thier
own value, e.g. the plus() method should only take one input and the
result is the value of the receiver (self) added to the input.
Currently the plus takes two inputs and probably in most if not all
cases the left input is the receiver. The result would be smaller code
and possible faster, as the method calls on self would not be through
an interface.