Design for Hybrid SQL/CLFS-Based Store in Qpid
==============================================

CLFS (Common Log File System) is a new facility in recent Windows versions.
CLFS is an ARIES-compliant log intended to support high performance and
transactional applications. CLFS is available in Windows Server 2003R2 and
higher, as well as Windows Vista and Windows 7.

There is currently an all-SQL store in Qpid. The new hybrid SQL-CLFS store
moves the message, messages-mapping to queues, and transaction aspects
of the SQL store into CLFS logs. Records of queues, exchanges, bindings,
and configurations will remain in SQL. The main goal of this change is
to yield higher performance on the time-critical messaging operations.
CLFS and, therefore, the new hybrid store, is not available on Windows XP
and Windows Server prior to 2003R2; these platforms will need to run the
all-SQL store.

Note for future consideration: it is possible to maintain all durable
objects in CLFS, which would remove the need for SQL completely. It would
require added log handling as well as the logic to ensure referential
integrity between exchanges and queues via bindings as SQL does today.
Also, the CLFS store counts on the SQL-stored queue records being correct
when recovering messages; if a message operation in the log refers to a queue
ID that's unknown, the CLFS store assumes the queue was deleted in the
previous broker session and the log wasn't updated. That sort of assumption
would need to be revisited if all content moves to a log.

CLFS Capabilities
-----------------

This section explains some of the key CLFS concepts that are important
in order to understand the designed use of CLFS for the store. It is
not a complete explanation and is not feature-complete. Please see the
CLFS documentation at MSDN for complete details
(http://msdn.microsoft.com/en-us/library/bb986747%28v=VS.85%29.aspx).

CLFS provides logs; each log can be dedicated or multiplexed. A multiplexed
log has multiple streams of independent log records; a dedicated log has
only one stream. Each log uses containers to hold the actual data; a log
requires a minimum of two containers, each of which must be at least 512KB.
Thus, the smallest log possible is 1MB. They can, of course, be larger, but
with 1 MB as minimum size for a log, they shouldn't be used willy-nilly.
The maximum number of streams per log is approximately 100.

As records are written to the log CLFS assigns Log Sequence Numbers (LSNs).
The first valid LSN in a log stream is called the Base, or Tail. CLFS
can automatically reclaim and reuse container space for the log as the
base LSN is moved when records are no longer needed. When a log is multiplexed,
a stream which doesn't move its tail can prevent CLFS from reclaiming space
and cause the log to grow indefinitely. Thus, mixing streams which don't
update (and, thus, move their tails) with streams that are very dynamic in
a single log will probably cause the log to continue to expand even though
much of the space will be unused.

CLFS provides three LSN types that are used to chain records together:

- Next: This is a forward sequence maintained by CLFS itself by the order
  records are put into the stream.
- Undo-next, Undo-prev: These are backward-looking chains that are used
  to link a new record to some previous record(s) in the same stream.

Also note that although log files are simply located in the file system,
easily locatable, streams within a log are not easily known or listable
outside of some application-specific recording of the stream names somewhere.

Log Usage
---------

There are two logs in use.

- Message: Each message will be represented by a chain of log records. All
  messages will be intermixed in the same dedicated stream. Each portion of
  a message content (sometimes they are written in multiple chunks) as well
  as each operation involving a message (enqueue, dequeue, etc.) will be
  in a log record chained to the others related to the same message.

- Transaction: Each transaction, local and distributed, will be represented
  by a chain of log records. The record content will denote the transaction
  as local or distributed.

Both transaction and message logs use the LSN of the first record for a
given object (message or transaction) as the persistence ID for that object.
The LSN is a CLFS-maintained, always-increasing value that is 64 bits long,
the same as a persistence ID.

Log records that relate to a transaction or message previously logged use the
log record undo-prev LSN to indicate which transaction/message the record
relates to.

Message Log Records
-------------------

Message log records will be one of the following types:

- Message-Start: the first (and possibly only) section of message content
- Message-Chunk: second and succeeding message content chunks
- Message-Delete: marks the end of the message's lifetime
- Message-Enqueue: records the message's placement on a queue
- Message-Dequeue: records the message's removal from a queue

The LSN of the Message-Start record is the persistence ID for the message.
The log record undo-prev LSN is used to link each subsequent record for that
message to the Message-Start record.

A message's sequence of log records is extended for each operation on that
message, until the message is deleted whereupon a Message-Delete record is
written. When the Message-Delete is written, the log's base LSN can be moved
up to the next earliest message if the deleted one opens up a set of
records at the tail of the log that are no longer needed. To help maintain
the order and know when the base can be moved, the store keeps message
information in a STL map whose key is the message ID (Message-Start LSN).
Thus, the first entry in the map is the earliest ID/LSN in use.
During recovery, messages still residing in the log can be ignored when the
record sequence for the message ends with Message-Delete. Similarly, there
may be log records for messages that are deleted; in this case the previous
LSN won't be one that's still within the log and, therefore, there won't have
been a Message Start record recovered and the record can be ignored.

Transaction Log Records
-----------------------

Transaction log records will be one of the following types:

- Dtx-Start: Start of a distributed transaction
- Tx-Start: Start of a local transaction
- End: End of the transaction
- Rollback: Marks that the transaction is rolled back
- Prepare: Marks the dtx as prepared
- Commit: Marks the transaction as committed
- Delete: Notes that the transaction is no longer valid

Transactions are also identified by the LSN of the start (Dtx-Start or
Tx-Start) record. Successive records associated with the same transaction
are linked backwards using the undo-prev LSN.

The association between messages and transactions is maintained in the
message log; if the message enqueue/dequeue operation is part of a transaction,
the operation includes a transaction ID. The transaction log maintains the
state of the transaction itself. Thus, each operation (enqueue, dequeue,
prepare, rollback, commit) is a single log record.

A few notes:
- The transactions need to be recovered and sorted out prior to recovering
  the messages. The message recovery needs to know if a enqueue/dequeue
  associated with a transaction can be discarded or should be acted on.

- Transaction IDs need to remain valid as long as any messages exist that
  refer to them. This prevents the problem of trying to recover a message
  with a transaction ID that doesn't exist - was it finalized? was it aborted?
  Reference to a missing transaction ID can be ignored with assurance that
  the message was deleted further along or the transaction would still be there.

- Transaction IDs needing to be valid requires that a refcount be kept on each
  transaction at run time. As messages are deleted, the transaction set can
  be notified that the message is gone. To enforce this, Message objects have
  a boost::shared_ptr to each Transaction they're associated with. When the
  Message is destroyed, refs to Transactions go down too. When Transaction is
  destroyed, it's done so write its delete to the log.

In-Memory Objects
-----------------

The store holds the message and transaction relationships in memory. CLFS is
a backing store for that information so it can be reliably reconstructed in
the event of a failure. This is a change from the SQL-only store where all
of the information is maintained in SQL and none is kept in memory. The
CLFS-using store is designed for high-throughput operation where it is assumed
that messages will transit the broker (and, therefore, the store) quickly.

- Message list: this is a map of persistence ID (message LSN) to a list of
  queues where the message is located and an indication that there is
  (or isn't) a transaction involved and in which direction (enqueue/dequeue)
  so a dequeued message doesn't get deleted while a transacted enqueue is
  pending.

- Transaction list: also probably a map of id/LSN to a transaction object.
  The transaction object needs to keep a list of messages/queues that are
  impacted as well as the transaction state and Xid (for dtx).

- Right now log records are written as need with no preallocation or
  reservation. It may be better to pre-reserve records in some cases, such
  as a transaction prepare where the space for commit or rollback may be
  reserved at the same time. This may be the only case where losing a
  record may be an issue - needs some more thought.

Recovery
--------

During recovery, need to verify recovered messages' queues exist; if there's a
failure after a queue's deletion is final but before the messages are recorded
as dequeued (and possibly deleted) the remainder of those dequeues (and
possibly deleting the message) needs to be handled during recovery by not
restoring them for the broker, and also logging their deletion. Could also
skip the logging of deletion and let the normal tail-maintenance eventually
move up over the old message entries. Since the invalid messages won't be
kept in the message map, their IDs won't be taken into account when maintaining
the tail - the tail will move up over them as soon as enough messages come
and go.

Plugin Options
--------------

The command-line options added by the CLFS plugin are;

  --connect             The SQL connect string for the SQL parts; same as the
                        SQL plugin.
  --catalog             The SQL database (catalog) name; same as the SQL plugin.
  --store-dir           The directory to store the logs in. Defaults to the
                        broker --data-dir value. If --no-data-dir specified,
                        --store-dir must be.
  --container-size      The size of each container in the log, in bytes. The
                        minimum size is 512K (smaller sizes will be rounded up).
                        Additionally, the size will be rounded up to a multiple
                        of the sector size on the disk holding the log. Once
                        the log is created, each newly added container will
                        be the same size as the initial container(s). Default
                        is 1MB.
  --initial-containers  The number of containers to populate a new log with
                        if a new log is created. Ignored if the log exists.
                        Default is 2.
  --max-write-buffers   The maximum number of write buffers that the plugin can
                        use before CLFS automatically flushes the log to disk.
                        Lower values flush more often; higher values have
                        higher performance. Default is 10.

  Maybe need an option to hold messages of a certain size in memory? I think
  maybe the broker proper holds the message content, so the store need not.

Testing
-------

More tests will need to be written to stress the log container extension
capability and ensure that moving the base LSN works properly and the store
doesn't continually grow the log without bounds.

Note that running "qpid-perftest --durable yes" stresses the log extension
and tail maintenance. It doesn't get run as a normal regression test but should
be run when playing with the container/tail maintenance logic to ensure it's
not broken.