Chapter 11. Performance Tuning

Table of Contents

Prefetching
Data Rows
Specific Attributes and Relationships with EJBQL
Iterated Queries
Paginated Queries
Caching and Fresh Data
Turning off Synchronization of ObjectContexts

Prefetching

Prefetching is a technique that allows to bring back in one query not only the queried objects, but also objects related to them. In other words it is a controlled eager relationship resolving mechanism. Prefetching is discussed in the "Performance Tuning" chapter, as it is a powerful performance optimization method. However another common application of prefetching is to refresh stale object relationships, so more generally it can be viewed as a technique for managing subsets of the object graph.

Prefetching example:

ObjectSelect<Artist> query = ObjectSelect.query(Artist.class);

// instructs Cayenne to prefetch one of Artist's relationships
query.prefetch(Artist.PAINTINGS.disjoint());

// the above line is equivalent to the following: 
// query.prefetch("paintings", PrefetchTreeNode.DISJOINT_PREFETCH_SEMANTICS);

// query is expecuted as usual, but the resulting Artists will have
// their paintings "inflated"
List<Artist> artists = query.select(context);

All types of relationships can be preftetched - to-one, to-many, flattened. A prefetch can span multiple relationships:

query.prefetch(Artist.PAINTINGS.dot(Painting.GALLERY).disjoint());

A query can have multiple prefetches:

query.prefetch(Artist.PAINTINGS.disjoint());
query.prefetch(Artist.PAINTINGS.dot(Painting.GALLERY).disjoint());

If a query is fetching DataRows, all "disjoint" prefetches are ignored, only "joint" prefetches are executed (see prefetching semantics discussion below for what disjoint and joint prefetches mean).

Prefetching Semantics

Prefetching semantics defines a strategy to prefetch relationships. Depending on it, Cayenne would generate different types of queries. The end result is the same - query root objects with related objects fully resolved. However semantics can affect preformance, in some cases significantly. There are 3 types of prefetch semantics, all defined as constants in org.apache.cayenne.query.PrefetchTreeNode:

PrefetchTreeNode.JOINT_PREFETCH_SEMANTICS
PrefetchTreeNode.DISJOINT_PREFETCH_SEMANTICS
PrefetchTreeNode.DISJOINT_BY_ID_PREFETCH_SEMANTICS

There's no limitation on mixing different types of semantics in the same query. Each prefetch can have its own semantics. SelectQuery uses DISJOINT_PREFETCH_SEMANTICS by default. ObjectSelect requires explicit semantics as we've seen above. SQLTemplate and ProcedureQuery are both using JOINT_PREFETCH_SEMANTICS and it can not be changed due to the nature of those two queries.

Disjoint Prefetching Semantics

This semantics results in Cayenne generatiing one SQL statement for the main objects, and a separate statement for each prefetch path (hence "disjoint" - related objects are not fetched with the main query). Each additional SQL statement uses a qualifier of the main query plus a set of joins traversing the preftech path between the main and related entity.

This strategy has an advantage of efficient JVM memory use, and faster overall result processing by Cayenne, but it requires (1+N) SQL statements to be executed, where N is the number of prefetched relationships.

Disjoint-by-ID Prefetching Semantics

This is a variation of disjoint prefetch where related objects are matched against a set of IDs derived from the fetched main objects (or intermediate objects in a multi-step prefetch). Cayenne limits the size of the generated WHERE clause, as most DBs can't parse arbitrary large SQL. So prefetch queries are broken into smaller queries. The size of is controlled by the DI property Constants.SERVER_MAX_ID_QUALIFIER_SIZE_PROPERTY (the default number of conditions in the generated WHERE clause is 10000). Cayenne will generate (1 + N * M) SQL statements for each query using disjoint-by-ID prefetches, where N is the number of relationships to prefetch, and M is the number of queries for a given prefetch that is dependent on the number of objects in the result (ideally M = 1).

The advantage of this type of prefetch is that matching database rows by ID may be much faster than matching the qualifier of the original query. Moreover this is the only type of prefetch that can handle SelectQueries with fetch limit. Both joint and regular disjoint prefetches may produce invalid results or generate inefficient fetch-the-entire table SQL when fetch limit is in effect.

The disadvantage is that query SQL can get unwieldy for large result sets, as each object will have to have its own condition in the WHERE clause of the generated SQL.

Joint Prefetching Semantics

Joint semantics results in a single SQL statement for root objects and any number of jointly prefetched paths. Cayenne processes in memory a cartesian product of the entities involved, converting it to an object tree. It uses OUTER joins to connect prefetched entities.

Joint is the most efficient prefetch type of the three as far as generated SQL goes. There's always just 1 SQL query generated. Its downsides are the potentially increased amount of data that needs to get across the network between the application server and the database, and more data processing that needs to be done on the Cayenne side.

Similar Behaviours Using EJBQL

It is possible to achieve similar behaviours with EJBQLQuery queries by employing the "FETCH" keyword.

SELECT a FROM Artist a LEFT JOIN FETCH a.paintings

In this case, the Paintings that exist for the Artist will be obtained at the same time as the Artists are fetched. Refer to third-party query language documentation for further detail on this mechanism.

Data Rows

Converting result set data to Persistent objects and registering these objects in the ObjectContext can be an expensive operation compareable to the time spent running the query (and frequently exceeding it). Internally Cayenne builds the result as a list of DataRows, that are later converted to objects. Skipping the last step and using data in the form of DataRows can significantly increase performance.

DataRow is a simply a map of values keyed by their DB column name. It is a ubiqutous representation of DB data used internally by Cayenne. And it can be quite usable as is in the application in many cases. So performance sensitive selects should consider DataRows - it saves memory and CPU cycles. All selecting queries support DataRows option, e.g.:

ObjectSelect<DataRow> query = ObjectSelect.dataRowQuery(Artist.class);

List<DataRow> rows = query.select(context);
SQLSelect<DataRow> query = SQLSelect.dataRowQuery("SELECT * FROM ARTIST");
List<DataRow> rows = query.select(context);

Individual DataRows may be converted to Persistent objects as needed. So e.g. you may implement some in-memory filtering, only converting a subset of fetched objects:

// you need to cast ObjectContext to DataContext to get access to 'objectFromDataRow'
DataContext dataContext = (DataContext) context;

for(DataRow row : rows) {
    if(row.get("DATE_OF_BIRTH") != null) {
        Artist artist = dataContext.objectFromDataRow(Artist.class, row);
        // do something with Artist...
        ...
    }
}

Specific Attributes and Relationships with EJBQL

It is possible to fetch specific attributes and relationships from a model using EJBQLQuery. The following example would return a java.util.List of String objects;

SELECT a.name FROM Artist a

The following will yield a java.util.List containing Object[] instances, each of which would contain the name followed by the dateOfBirth value.

SELECT a.name, a.dateOfBirth FROM Artist a

Refer to third-party query language documentation for further detail on this mechanism.

Iterated Queries

While contemporary hardware may easily allow applications to fetch hundreds of thousands or even millions of objects into memory, it doesn't mean this is always a good idea to do so. You can optimize processing of very large result sets with two techniques discussed in this and the following chapter - iterated and paginated queries.

Iterated query is not actually a special query. Any selecting query can be executed in iterated mode by an ObjectContext. ObjectContext creates an object called ResultIterator that is backed by an open ResultSet. Iterator provides constant memory performance for arbitrarily large ResultSets. This is true at least on the Cayenne end, as JDBC driver may still decide to bring the entire ResultSet into the JVM memory.

Data is read from ResultIterator one row/object at a time until it is exhausted. There are two styles of accessing ResultIterator - direct access which requires explicit closing to avoid JDBC resources leak, or a callback that lets Cayenne handle resource management. In both cases iteration can be performed using "for" loop, as ResultIterator is "Iterable".

Direct access. Here common sense tells us that ResultIterators instances should be processed and closed as soon as possible to release the DB connection. E.g. storing open iterators between HTTP requests for unpredictable length of time would quickly exhaust the connection pool.

try(ResultIterator<Artist> it = ObjectSelect.query(Artist.class).iterator(context)) {
    for(Artist a : it) {
       // do something with the object...
       ...
    }           
}

Same thing with a callback:

ObjectSelect.query(Artist.class).iterate(context, (Artist a) -> {
    // do something with the object...
    ...
});

Another example is a batch iterator that allows to process more than one object in each iteration. This is a common scenario in various data processing jobs - read a batch of objects, process them, commit the results, and then repeat. This allows to further optimize processing (e.g. by avoiding frequent commits).

try(ResultBatchIterator<Artist> it = ObjectSelect.query(Artist.class).batchIterator(context, 100)) {
    for(List<Artist> list : it) {
       // do something with each list
       ...
       // possibly commit your changes
       context.commitChanges();
    }           
}

Paginated Queries

Enabling query pagination allows to load very large result sets in a Java app with very little memory overhead (much smaller than even the DataRows option discussed above). Moreover it is completely transparent to the application - a user gets what appears to be a list of Persistent objects - there's no iterator to close or DataRows to convert to objects:

// the fact that result is paginated is transparent
List<Artist> artists = 
    ObjectSelect.query(Artist.class).pageSize(50).select(context);

Having said that, DataRows option can be combined with pagination, providing the best of both worlds:

List<DataRow> rows = 
    ObjectSelect.dataRowQuery(Artist.class).pageSize(50).select(context);

The way pagination works internally, it first fetches a list of IDs for the root entity of the query. This is very fast and initially takes very little memory. Then when an object is requested at an arbitrary index in the list, this object and adjacent objects (a "page" of objects that is determined by the query pageSize parameter) are fetched together by ID. Subsequent requests to the objects of this "page" are served from memory.

An obvious limitation of pagination is that if you eventually access all objects in the list, the memory use will end up being the same as with no pagination. However it is still a very useful approach. With some lists (e.g. multi-page search results) only a few top objects are normally accessed. At the same time pagination allows to estimate the full list size without fetching all the objects. And again - it is completely transparent and looks like a normal query.

Caching and Fresh Data

Object Caching

Query Result Caching

Cayenne supports mostly transparent caching of the query results. There are two levels of the cache: local (i.e. results cached by the ObjectContext) and shared (i.e. the results cached at the stack level and shared between all contexts). Local cache is much faster then the shared one, but is limited to a single context. It is often used with a shared read-only ObjectContext.

To take advantage of query result caching, the first step is to mark your queries appropriately. Here is an example for ObjectSelect query. Other types of queries have similar API:

ObjectSelect.query(Artist.class).localCache("artists");

This tells Cayenne that the query created here would like to use local cache of the context it is executed against. A vararg parameter to localCache() (or sharedCache()) method contains so called "cache groups". Those are arbitrary names that allow to categorize queries for the purpose of setting cache policies or explicit invalidation of the cache. More on that below.

The above API is enough for the caching to work, but by default your cache is an unmanaged LRU map. You can't control its size, expiration policies, etc. For the managed cache, you will need to explicitly use one of the more advanced cache providers. One such provider available in Cayenne is a provider for EhCache. It can be enabled on ServerRuntime startup in a custom Module:

ServerRuntimeBuilder
  .builder()
  .addModule((binder) -> 
     binder.bind(QueryCache.class).to(EhCacheQueryCache.class)
  )
  .build();

By default EhCache reads a file called "ehcache.xml" located on classpath. You can put your cache configuration in that file. E.g.:

<ehcache xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:noNamespaceSchemaLocation="ehcache.xsd" updateCheck="false"
	monitoring="off" dynamicConfig="false">

	<defaultCache maxEntriesLocalHeap="1000" eternal="false"
		overflowToDisk="false" timeToIdleSeconds="3600" timeToLiveSeconds="3600">
	</defaultCache>
	
	<cache name="artists" timeToLiveSeconds="20" maxEntriesLocalHeap="100" />
</ehcache>

The example shows how to configure default cache settings ("defaultCache") as well as settings for a named cache group ("artists"). For many other things you can put in "ehcache.xml" refer to EhCache docs.

Often "passive" cache expiration policies similar to shown above are not sufficient, and the users want real-time cache invalidation when the data changes. So in addition to those policies, the app can invalidate individual cache groups explicitly with RefreshQuery:

RefreshQuery refresh = new RefreshQuery("artist");
context.performGenericQuery(refresh);

The above can be used e.g. to build UI for manual cache invalidation. It is also possible to automate cache refresh when certain entities are committed. This requires including cayenne-lifecycle.jar deoendency. From that library you will need two things: @CacheGroups annotation to mark entities that generate cache invalidation events and  CacheInvalidationFilter that catches the updates to the annotated objects and generates appropriate invalidation events:

// configure filter on startup
ServerRuntimeBuilder
  .builder()
  .addModule((binder) -> 
     binder.bindList(Constants.SERVER_DOMAIN_FILTERS_LIST).add(CacheInvalidationFilter.class)
  )
  .build();

Now you can associate entities with cache groups, so that commits to those entities would atomatically invalidate the groups:

@CacheGroups("artists")
public class Artist extends _Artist {
}

Finally you may cluster cache group events. They are very small and can be efficiently sent over the wire to other JVMs running Cayenne. An example of Cayenne setup with event clustering is available on GitHub.

Turning off Synchronization of ObjectContexts

By default when a single ObjectContext commits its changes, all other contexts in the same runtime receive an event that contains all the committed changes. This allows them to update their cached object state to match the latest committed data. There are however many problems with this ostensibly helpful feature. In short - it works well in environments with few contexts and in unclustered scenarios, such as single user desktop applications, or simple webapps with only a few users. More specifically:

  • The performance of synchronization is (probably worse than) O(N) where N is the number of peer ObjectContexts in the system. In a typical webapp N can be quite large. Besides for any given context, due to locking on synchronization, context own performance will depend not only on the queries that it runs, but also on external events that it does not control. This is unacceptable in most situations.

  • Commit events are untargeted - even contexts that do not hold a given updated object will receive the full event that they will have to process.

  • Clustering between JVMs doesn't scale - apps with large volumes of commits will quickly saturate the network with events, while most of those will be thrown away on the receiving end as mentioned above.

  • Some contexts may not want to be refreshed. A refresh in the middle of an operation may lead to unpredictable results.

  • Synchronization will interfere with optimistic locking.

So we've made a good case for disabling synchronization in most webapps. To do that, set to "false" the following DI property - Constants.SERVER_CONTEXTS_SYNC_PROPERTY, using one of the standard Cayenne DI approaches. E.g. from command line:

$ java -Dcayenne.server.contexts_sync_strategy=false

Or by changing the standard properties Map in a custom extensions module:

public class MyModule implements Module {

    @Override
    public void configure(Binder binder) {
        binder.bindMap(Constants.PROPERTIES_MAP).put(Constants.SERVER_CONTEXTS_SYNC_PROPERTY, "false");
    }
}