Design for Hybrid SQL/CLFS-Based Store in Qpid ============================================== CLFS (Common Log File System) is a new facility in recent Windows versions. CLFS is an ARIES-compliant log intended to support high performance and transactional applications. CLFS is available in Windows Server 2003R2 and higher, as well as Windows Vista and Windows 7. There is currently an all-SQL store in Qpid. The new hybrid SQL-CLFS store moves the message, messages-mapping to queues, and transaction aspects of the SQL store into CLFS logs. Records of queues, exchanges, bindings, and configurations will remain in SQL. The main goal of this change is to yield higher performance on the time-critical messaging operations. CLFS and, therefore, the new hybrid store, is not available on Windows XP and Windows Server prior to 2003R2; these platforms will need to run the all-SQL store. Note for future consideration: it is possible to maintain all durable objects in CLFS, which would remove the need for SQL completely. It would require added log handling as well as the logic to ensure referential integrity between exchanges and queues via bindings as SQL does today. Also, the CLFS store counts on the SQL-stored queue records being correct when recovering messages; if a message operation in the log refers to a queue ID that's unknown, the CLFS store assumes the queue was deleted in the previous broker session and the log wasn't updated. That sort of assumption would need to be revisited if all content moves to a log. CLFS Capabilities ----------------- This section explains some of the key CLFS concepts that are important in order to understand the designed use of CLFS for the store. It is not a complete explanation and is not feature-complete. Please see the CLFS documentation at MSDN for complete details (http://msdn.microsoft.com/en-us/library/bb986747%28v=VS.85%29.aspx). CLFS provides logs; each log can be dedicated or multiplexed. A multiplexed log has multiple streams of independent log records; a dedicated log has only one stream. Each log uses containers to hold the actual data; a log requires a minimum of two containers, each of which must be at least 512KB. Thus, the smallest log possible is 1MB. They can, of course, be larger, but with 1 MB as minimum size for a log, they shouldn't be used willy-nilly. The maximum number of streams per log is approximately 100. As records are written to the log CLFS assigns Log Sequence Numbers (LSNs). The first valid LSN in a log stream is called the Base, or Tail. CLFS can automatically reclaim and reuse container space for the log as the base LSN is moved when records are no longer needed. When a log is multiplexed, a stream which doesn't move its tail can prevent CLFS from reclaiming space and cause the log to grow indefinitely. Thus, mixing streams which don't update (and, thus, move their tails) with streams that are very dynamic in a single log will probably cause the log to continue to expand even though much of the space will be unused. CLFS provides three LSN types that are used to chain records together: - Next: This is a forward sequence maintained by CLFS itself by the order records are put into the stream. - Undo-next, Undo-prev: These are backward-looking chains that are used to link a new record to some previous record(s) in the same stream. Also note that although log files are simply located in the file system, easily locatable, streams within a log are not easily known or listable outside of some application-specific recording of the stream names somewhere. Log Usage --------- There are two logs in use. - Message: Each message will be represented by a chain of log records. All messages will be intermixed in the same dedicated stream. Each portion of a message content (sometimes they are written in multiple chunks) as well as each operation involving a message (enqueue, dequeue, etc.) will be in a log record chained to the others related to the same message. - Transaction: Each transaction, local and distributed, will be represented by a chain of log records. The record content will denote the transaction as local or distributed. Both transaction and message logs use the LSN of the first record for a given object (message or transaction) as the persistence ID for that object. The LSN is a CLFS-maintained, always-increasing value that is 64 bits long, the same as a persistence ID. Log records that relate to a transaction or message previously logged use the log record undo-prev LSN to indicate which transaction/message the record relates to. Message Log Records ------------------- Message log records will be one of the following types: - Message-Start: the first (and possibly only) section of message content - Message-Chunk: second and succeeding message content chunks - Message-Delete: marks the end of the message's lifetime - Message-Enqueue: records the message's placement on a queue - Message-Dequeue: records the message's removal from a queue The LSN of the Message-Start record is the persistence ID for the message. The log record undo-prev LSN is used to link each subsequent record for that message to the Message-Start record. A message's sequence of log records is extended for each operation on that message, until the message is deleted whereupon a Message-Delete record is written. When the Message-Delete is written, the log's base LSN can be moved up to the next earliest message if the deleted one opens up a set of records at the tail of the log that are no longer needed. To help maintain the order and know when the base can be moved, the store keeps message information in a STL map whose key is the message ID (Message-Start LSN). Thus, the first entry in the map is the earliest ID/LSN in use. During recovery, messages still residing in the log can be ignored when the record sequence for the message ends with Message-Delete. Similarly, there may be log records for messages that are deleted; in this case the previous LSN won't be one that's still within the log and, therefore, there won't have been a Message Start record recovered and the record can be ignored. Transaction Log Records ----------------------- Transaction log records will be one of the following types: - Dtx-Start: Start of a distributed transaction - Tx-Start: Start of a local transaction - End: End of the transaction - Rollback: Marks that the transaction is rolled back - Prepare: Marks the dtx as prepared - Commit: Marks the transaction as committed - Delete: Notes that the transaction is no longer valid Transactions are also identified by the LSN of the start (Dtx-Start or Tx-Start) record. Successive records associated with the same transaction are linked backwards using the undo-prev LSN. The association between messages and transactions is maintained in the message log; if the message enqueue/dequeue operation is part of a transaction, the operation includes a transaction ID. The transaction log maintains the state of the transaction itself. Thus, each operation (enqueue, dequeue, prepare, rollback, commit) is a single log record. A few notes: - The transactions need to be recovered and sorted out prior to recovering the messages. The message recovery needs to know if a enqueue/dequeue associated with a transaction can be discarded or should be acted on. - Transaction IDs need to remain valid as long as any messages exist that refer to them. This prevents the problem of trying to recover a message with a transaction ID that doesn't exist - was it finalized? was it aborted? Reference to a missing transaction ID can be ignored with assurance that the message was deleted further along or the transaction would still be there. - Transaction IDs needing to be valid requires that a refcount be kept on each transaction at run time. As messages are deleted, the transaction set can be notified that the message is gone. To enforce this, Message objects have a boost::shared_ptr to each Transaction they're associated with. When the Message is destroyed, refs to Transactions go down too. When Transaction is destroyed, it's done so write its delete to the log. In-Memory Objects ----------------- The store holds the message and transaction relationships in memory. CLFS is a backing store for that information so it can be reliably reconstructed in the event of a failure. This is a change from the SQL-only store where all of the information is maintained in SQL and none is kept in memory. The CLFS-using store is designed for high-throughput operation where it is assumed that messages will transit the broker (and, therefore, the store) quickly. - Message list: this is a map of persistence ID (message LSN) to a list of queues where the message is located and an indication that there is (or isn't) a transaction involved and in which direction (enqueue/dequeue) so a dequeued message doesn't get deleted while a transacted enqueue is pending. - Transaction list: also probably a map of id/LSN to a transaction object. The transaction object needs to keep a list of messages/queues that are impacted as well as the transaction state and Xid (for dtx). - Right now log records are written as need with no preallocation or reservation. It may be better to pre-reserve records in some cases, such as a transaction prepare where the space for commit or rollback may be reserved at the same time. This may be the only case where losing a record may be an issue - needs some more thought. Recovery -------- During recovery, need to verify recovered messages' queues exist; if there's a failure after a queue's deletion is final but before the messages are recorded as dequeued (and possibly deleted) the remainder of those dequeues (and possibly deleting the message) needs to be handled during recovery by not restoring them for the broker, and also logging their deletion. Could also skip the logging of deletion and let the normal tail-maintenance eventually move up over the old message entries. Since the invalid messages won't be kept in the message map, their IDs won't be taken into account when maintaining the tail - the tail will move up over them as soon as enough messages come and go. Plugin Options -------------- The command-line options added by the CLFS plugin are; --connect The SQL connect string for the SQL parts; same as the SQL plugin. --catalog The SQL database (catalog) name; same as the SQL plugin. --store-dir The directory to store the logs in. Defaults to the broker --data-dir value. If --no-data-dir specified, --store-dir must be. --container-size The size of each container in the log, in bytes. The minimum size is 512K (smaller sizes will be rounded up). Additionally, the size will be rounded up to a multiple of the sector size on the disk holding the log. Once the log is created, each newly added container will be the same size as the initial container(s). Default is 1MB. --initial-containers The number of containers to populate a new log with if a new log is created. Ignored if the log exists. Default is 2. --max-write-buffers The maximum number of write buffers that the plugin can use before CLFS automatically flushes the log to disk. Lower values flush more often; higher values have higher performance. Default is 10. Maybe need an option to hold messages of a certain size in memory? I think maybe the broker proper holds the message content, so the store need not. Testing ------- More tests will need to be written to stress the log container extension capability and ensure that moving the base LSN works properly and the store doesn't continually grow the log without bounds. Note that running "qpid-perftest --durable yes" stresses the log extension and tail maintenance. It doesn't get run as a normal regression test but should be run when playing with the container/tail maintenance logic to ensure it's not broken.