Chapter 11. Performance Tuning

Table of Contents

Prefetching
Data Rows
Iterated Queries
Paginated Queries
Caching and Fresh Data
Turning off Synchronization of ObjectContexts

Prefetching

Prefetching is a technique that allows to bring back in one query not only the queried objects, but also objects related to them. In other words it is a controlled eager relationship resolving mechanism. Prefetching is discussed in the "Performance Tuning" chapter, as it is a powerful performance optimization method. However another common application of prefetching is to refresh stale object relationships, so more generally it can be viewed as a technique for managing subsets of the object graph.

Prefetching example:

SelectQuery query = new SelectQuery(Artist.class);

// this instructs Cayenne to prefetch one of Artist's relationships
query.addPrefetch("paintings");

// query is expecuted as usual, but the resulting Artists will have
// their paintings "inflated"
List<Artist> artists = context.performQuery(query);

All types of relationships can be preftetched - to-one, to-many, flattened.

A prefetch can span multiple relationships:

 query.addPrefetch("paintings.gallery");

A query can have multiple prefetches:

query.addPrefetch("paintings");
query.addPrefetch("paintings.gallery"); 

If a query is fetching DataRows, all "disjoint" prefetches are ignored, only "joint" prefetches are executed (see prefetching semantics discussion below for what disjoint and joint prefetches mean).

Prefetching Semantics

Prefetching semantics defines a strategy to prefetch relationships. Depending on it, Cayenne would generate different types of queries. The end result is the same - query root objects with related objects fully resolved. However semantics can affect preformance, in some cases significantly. There are 3 types of prefetch semantics, all defined as constants in org.apache.cayenne.query.PrefetchTreeNode:

PrefetchTreeNode.JOINT_PREFETCH_SEMANTICS
PrefetchTreeNode.DISJOINT_PREFETCH_SEMANTICS
PrefetchTreeNode.DISJOINT_BY_ID_PREFETCH_SEMANTICS

Each query has a default prefetch semantics, so generally users do not have to worry about changing it, except when performance is a concern, or a few special cases when a default sematics can't produce the correct result. SelectQuery uses DISJOINT_PREFETCH_SEMANTICS by default. Semantics can be changed as follows:

SelectQuery query = new SelectQuery(Artist.class);
query.addPrefetch("paintings").setSemantics(
                PrefetchTreeNode.JOINT_PREFETCH_SEMANTICS); 

There's no limitation on mixing different types of semantics in the same SelectQuery. Multiple prefetches each can have its own semantics.

SQLTemplate and ProcedureQuery are both using JOINT_PREFETCH_SEMANTICS and it can not be changed due to the nature of these two queries.

Disjoint Prefetching Semantics

This semantics (only applicable to SelectQuery) results in Cayenne generatiing one SQL statement for the main objects, and a separate statement for each prefetch path (hence "disjoint" - related objects are not fetched with the main query). Each additional SQL statement uses a qualifier of the main query plus a set of joins traversing the preftech path between the main and related entity.

This strategy has an advantage of efficient JVM memory use, and faster overall result processing by Cayenne, but it requires (1+N) SQL statements to be executed, where N is the number of prefetched relationships.

Disjoint-by-ID Prefetching Semantics

This is a variation of disjoint prefetch where related objects are matched against a set of IDs derived from the fetched main objects (or intermediate objects in a multi-step prefetch). Cayenne limits the size of the generated WHERE clause, as most DBs can't parse arbitrary large SQL. So prefetch queries are broken into smaller queries. The size of is controlled by the DI property Constants.SERVER_MAX_ID_QUALIFIER_SIZE_PROPERTY (the default number of conditions in the generated WHERE clause is 10000). Cayenne will generate (1 + N * M) SQL statements for each query using disjoint-by-ID prefetches, where N is the number of relationships to prefetch, and M is the number of queries for a given prefetch that is dependent on the number of objects in the result (ideally M = 1).

The advantage of this type of prefetch is that matching database rows by ID may be much faster than matching the qualifier of the original query. Moreover this is the only type of prefetch that can handle SelectQueries with fetch limit. Both joint and regular disjoint prefetches may produce invalid results or generate inefficient fetch-the-entire table SQL when fetch limit is in effect.

The disadvantage is that query SQL can get unwieldy for large result sets, as each object will have to have its own condition in the WHERE clause of the generated SQL.

Joint Prefetching Semantics

Joint semantics results in a single SQL statement for root objects and any number of jointly prefetched paths. Cayenne processes in memory a cartesian product of the entities involved, converting it to an object tree. It uses OUTER joins to connect prefetched entities.

Joint is the most efficient prefetch type of the three as far as generated SQL goes. There's always just 1 SQL query generated. Its downsides are the potentially increased amount of data that needs to get across the network between the application server and the database, and more data processing that needs to be done on the Cayenne side.

Data Rows

Converting result set data to Persistent objects and registering these objects in the ObjectContext can be an expensive operation compareable to the time spent running the query (and frequently exceeding it). Internally Cayenne builds the result as a list of DataRows, that are later converted to objects. Skipping the last step and using data in the form of DataRows can significantly increase performance.

DataRow is a simply a map of values keyed by their DB column name. It is a ubiqutous representation of DB data used internally by Cayenne. And it can be quite usable as is in the application in many cases. So performance sensitive selects should consider DataRows - it saves memory and CPU cycles. All selecting queries support DataRows option, e.g.:

SelectQuery query = new SelectQuery(Artist.class);
query.setFetchingDataRows(true);

List<DataRow> rows = context.performQuery(query); 
SQLTemplate query = new SQLTemplate(Artist.class, "SELECT * FROM ARTIST");
query.setFetchingDataRows(true);

List<DataRow> rows = context.performQuery(query);

Moreover DataRows may be converted to Persistent objects later as needed. So e.g. you may implement some in-memory filtering, only converting a subset of fetched objects:

// you need to cast ObjectContext to DataContext to get access to 'objectFromDataRow'
DataContext dataContext = (DataContext) context;

for(DataRow row : rows) {
    if(row.get("DATE_OF_BIRTH") != null) {
        Artist artist = dataContext.objectFromDataRow(Artist.class, row);
        // do something with Artist...
        ...
    }
}

Iterated Queries

While contemporary hardware may easily allow applications to fetch hundreds of thousands or even millions of objects into memory, it doesn't mean this is always a good idea to do so. You can optimize processing of very large result sets with two techniques discussed in this and the following chapter - iterated and paginated queries.

Iterated query is not actually a special query. Any selecting query can be executed in iterated mode by the DataContext (like in the previous example, a cast to DataContext is needed). DataContext returns an object called ResultIterator that is backed by an open ResultSet. Data is read from ResultIterator one row at a time until it is exhausted. Data comes as a DataRows regardless of whether the orginating query was configured to fetch DataRows or not. A ResultIterator must be explicitly closed to avoid JDBC resource leak.

Iterated query provides constant memory performance for arbitrarily large ResultSets. This is true at least on the Cayenne end, as JDBC driver may still decide to bring the entire ResultSet into the JVM memory.

Here is a full example:

// you need to cast ObjectContext to DataContext to get access to 'performIteratedQuery'
DataContext dataContext = (DataContext) context;

// create a regular query
SelectQuery q = new SelectQuery(Artist.class);

// ResultIterator operations all throw checked CayenneException
// moreover 'finally' is required to close it
try {

    ResultIterator it = dataContext.performIteratedQuery(q);

    try {
        while(it.hasNextRow()) {
            // normally we'd read a row, process its data, and throw it away
            // this gives us constant memory performance
            Map row = (Map) it.nextRow();
            
            // do something with the row...
            ...
        }
    }
    finally {
        it.close();
    }
}
catch(CayenneException e) {
   e.printStackTrace();
}

Also common sense tells us that ResultIterators should be processed and closed as soon as possible to release the DB connection. E.g. storing open iterators between HTTP requests and for unpredictable length of time would quickly exhaust the connection pool.

Paginated Queries

Enabling query pagination allows to load very large result sets in a Java app with very little memory overhead (much smaller than even the DataRows option discussed above). Moreover it is completely transparent to the application - a user gets what appears to be a list of Persistent objects - there's no iterator to close or DataRows to convert to objects:

SelectQuery query = new SelectQuery(Artist.class);
query.setPageSize(50);

// the fact that result is paginated is transparent
List<Artist> artists = ctxt.performQuery(query);

Having said that, DataRows option can be combined with pagination, providing the best of both worlds:

SelectQuery query = new SelectQuery(Artist.class);
query.setPageSize(50);
query.setFetchingDataRows(true);

List<DataRow> rows = ctxt.performQuery(query);

The way pagination works internally, it first fetches a list of IDs for the root entity of the query. This is very fast and initially takes very little memory. Then when an object is requested at an arbitrary index in the list, this object and adjacent objects (a "page" of objects that is determined by the query pageSize parameter) are fetched together by ID. Subsequent requests to the objects of this "page" are served from memory.

An obvious limitation of pagination is that if you eventually access all objects in the list, the memory use will end up being the same as with no pagination. However it is still a very useful approach. With some lists (e.g. multi-page search results) only a few top objects are normally accessed. At the same time pagination allows to estimate the full list size without fetching all the objects. And again - it is completely transparent and looks like a normal query.

Caching and Fresh Data

Object Caching

Query Result Caching

Turning off Synchronization of ObjectContexts

By default when a single ObjectContext commits its changes, all other contexts in the same runtime receive an event that contains all the committed changes. This allows them to update their cached object state to match the latest committed data. There are however many problems with this ostensibly helpful feature. In short - it works well in environments with few contexts and in unclustered scenarios, such as single user desktop applications, or simple webapps with only a few users. More specifically:

  • The performance of synchronization is (probably worse than) O(N) where N is the number of peer ObjectContexts in the system. In a typical webapp N can be quite large. Besides for any given context, due to locking on synchronization, context own performance will depend not only on the queries that it runs, but also on external events that it does not control. This is unacceptable in most situations.

  • Commit events are untargeted - even contexts that do not hold a given updated object will receive the full event that they will have to process.

  • Clustering between JVMs doesn't scale - apps with large volumes of commits will quickly saturate the network with events, while most of those will be thrown away on the receiving end as mentioned above.

  • Some contexts may not want to be refreshed. A refresh in the middle of an operation may lead to unpredictable results.

  • Synchronization will interfere with optimistic locking.

So we've made a good case for disabling synchronization in most webapps. To do that, set to "false" the following DI property - Constants.SERVER_CONTEXTS_SYNC_PROPERTY, using one of the standard Cayenne DI approaches. E.g. from command line:

java -Dcayenne.server.contexts_sync_strategy=false

Or by changing the standard properties Map in a custom extensions module:

public class MyModule implements Module {

    @Override
    public void configure(Binder binder) {
        binder.bindMap(Constants.PROPERTIES_MAP).put(Constants.SERVER_CONTEXTS_SYNC_PROPERTY, "false");
    }
}