Joins and Performance

Joins, SQL statements in which Derby selects data from two or more tables using one or more key columns from each table, can vary widely in performance. Factors that affect the performance of joins are join order, indexes, and join strategy.

Join Order Overview

The Derby optimizer usually makes a good choice about join order. This section discusses the performance implications of join order.

In a join operation involving two tables, Derby scans the tables in a particular order. Derby accesses rows in one table first, and this table is now called the outer table.

Then, for each qualifying row in the outer table, Derby looks for matching rows in the second table, which is called the inner table.

Derby accesses the outer table once, and the inner table probably many times (depending on how many rows in the outer table qualify).

This leads to a few general rules of thumb about join order:

Join Strategies

The most common join strategy in Derby is called a nested loop. For each qualifying row in the outer table, Derby uses the appropriate access path (index or table) to find the matching rows in the inner table.

Another type of join in Derby is called a hash join. For joins of this type, Derby constructs a hash table representing all the selected columns of the inner table. For each qualifying row in the outer table, Derby does a quick lookup on the hash table to get the inner table data. Derby has to scan the inner table or index only once, to build the hash table.

Nested loop joins are preferable in most situations.

Hash joins are preferable in situations in which the inner table values are unique and there are many qualifying rows in the outer table. Hash joins require that the statement's WHERE clause be an optimizable equijoin:

The hash table for a hash join is held in memory and if it gets big enough, it can cause the JVM to run out of memory. The optimizer makes a very rough estimate of the amount of memory required to make the hash table. If it estimates that the amount of memory required would exceed the system-wide limit of memory use for a table, the optimizer chooses a nested loop join instead.

If memory use is not a problem for your environment, set this property to a high number; allowing the optimizer the maximum flexibility in considering a join strategy queries involving large queries leads to better performance. It can also be set to smaller values for more limited environments.

Note:
Derby allows multiple columns as hash keys.


[ Previous Page
Next Page
Table of Contents
Index ]