[djg: comments like this are from dean] This past summer, Alexei and I wrote a spec for an I/O Filters API... this proposal addresses one part of that -- 'stacked' I/O with buff.c. We have a couple of options for stacked I/O: we can either use existing code, such as sfio, or we can rewrite buff.c to do it. We've gone over the first possibility at length, though, and there were problems with each implemenation which was mentioned (licensing and compatibility, specifically); so far as I know, those remain issues. Btw -- sfio will be supported w/in this model... it just wouldn't be the basis for the model's implementation. -- Ed Korthof | Web Server Engineer -- -- ed@organic.com | Organic Online, Inc -- -- (415) 278-5676 | Fax: (415) 284-6891 -- --------------------------------------------------------------------------- Stacked I/O With BUFFs Sections: 1.) Overview 2.) The API User-supplied structures API functions 3.) Detailed Description The bfilter structure The bbottomfilter structure The BUFF structure Public functions in buff.c 4.) Efficiency Considerations Buffering Memory copies Function chaining writev 5.) Code in buff.c Default Functions Heuristics for writev Writing Reading Flushing data Closing stacks and filters Flags and Options ************************************************************************* Overview The intention of this API is to make Apache's BUFF structure modular while retaining high efficiency. Basically, it involves rewriting buff.c to provide 'stacked' I/O -- where the data passed through a series of 'filters', which may modify it. There are two parts to this, the core code for BUFF structures, and the "filters" used to implement new behavior. "filter" is used to refer to both the sets of 5 functions, as shown in the bfilter structure in the next section, and to BUFFs which are created using a specific bfliter. These will also be occasionally refered to as "user-supplied", though the Apache core will need to use these as well for basic functions. The user-supplied functions should use only the public BUFF API, rather than any internal details or functions. One thing which may not be clear is that in the core BUFF functions, the BUFF pointer passed in refers to the BUFF on which the operation will happen. OTOH, in the user-supplied code, the BUFF passed in is the next buffer down the chain, not the current one. ************************************************************************* The API User-supplied structures First, the bfilter structure is used in all filters: typedef struct { int (*writev)(BUFF *, void *, struct iovect *, int); int (*read)(BUFF *, void *, char *, int); int (*write)(BUFF *, void *, const char *, int); int (*flush)(BUFF *, void *, const char *, int, bfilter *); int (*transmitfile)(BUFF *, void *, file_info_ptr *); void (*close)(BUFF *, void *); } bfilter; bfilters are placed into a BUFF structure along with a user-supplied void * pointer. Second, the following structure is for use with a filter which can sit at the bottom of the stack: typedef struct { void *(*bgetfileinfo)(BUFF *, void *); void (*bpushfileinfo)(BUFF *, void *, void *); } bbottomfilter; BUFF API functions The following functions are new BUFF API functions: For filters: BUFF * bcreatestack(pool *p, int flags, struct bfilter *, struct bbottomfilter *, void *); BUFF * bpushfilter (BUFF *, struct bfilter *, void *); BUFF * bpushbuffer (BUFF *, BUFF *); BUFF * bpopfilter(BUFF *); BUFF * bpopbuffer(BUFF *); void bclosestack(BUFF *); For BUFFs in general: int btransmitfile(BUFF *, file_info_ptr *); int bsetstackopts(BUFF *, int, const void *); int bsetstackflags(BUFF *, int, int); Note that a new flag is needed for bsetstackflags: B_MAXBUFFERING The current bcreate should become BUFF * bcreatebuffer (pool *p, int flags, struct bfilter *, void *); ************************************************************************* Detailed Explanation bfilter structure The void * pointer used in all these functions, as well as those in the bbottomfilter structure and the filter API functions, is always the same pointer w/in an individual BUFF. The first function in a bfilter structure is 'writev'; this is only needed for high efficiency writing, generally at the level of the system interface. In it's absence, multiple writes will be done w/ 'write'. Note that defining 'writev' means you must define 'write'. The second is 'write'; this is the generic writing function, taking a BUFF * to which to write, a block of text, and the length of that block of text. The expected return is the number of characters (out of that block of text) which were successfully processed (rather than the number of characters actually written). The third is 'read'; this is the generic reading function, taking a BUFF * from which to read data, and a void * buffer in which to put text, and the number of characters to put in that buffer. The expected return is the number of characters placed in the buffer. The fourth is 'flush'; this is intended to force the buffer to spit out any data it may have been saving, as well as to clear any data the BUFF code was storing. If the third argument is non-null, then it contains more text to be printed; that text need not be null terminated, but the fourth argument contains the length of text to be processed. The expected return value should be the number of characters handled out from the third argument (0 if there are none), or -1 on error. Finally, the fifth argument is a pointer to the bfilter struct containing this function, so that it may use the write or writev functions in it. Note that general buffering is handled by BUFF's internal code, and module writers should not store data for performance reasons. The fifth is 'transmitfile', which takes as its arguments a buffer to which to write (if non-null), the void * pointer containing configuration (or other) information for this filter, and a system-dependent pointer (the file_info_ptr structure will be defined on a per-system basis) containing information required to print the 'file' in question. This is intended to allow zero-copy TCP in Win32. The sixth is 'close'; this is what is called when the connection is being closed. The 'close' should not be passed on to the next filter in the stack. Most filters will not need to use this, but if database handles or some other object is created, this is the point at which to remove it. Note that flush is called automatically before this. bbottomfilter Structure The first function, bgetfileinfo, is designed to allow Apache to get information from a BUFF struct regarding the input and output sources. This is currently used to get the input file number to select on a socket to see if there's data waiting to be read. The information returned is platform specific; the void * pointer passed in holds the void * pointer passed to all user-supplied functions. The second function, bpushfileinfo, is used to push file information onto a buffer, so that the buffer can be fully constructed and ready to handle data as soon as possible after a client has connected. The first void * pointer holds platform specific information (in Unix, it would be a pair of file descriptors); the second holds the void * pointer passed to all user-supplied functions. [djg: I don't think I really agree with the distinction here between the bottom and the other filters. Take the select() example, it's valid for any layer to define a fd that can be used for select... in fact it's the topmost layer that should really get to make this definition. Or maybe I just have your top and bottom flipped. In any event I think this should be part of the filter structure and not separate.] The BUFF structure A couple of changes are needed for this structure: remove fd and fd_in; add a bfilter structure; add a pointer to a bbottomfilter; add three pointers to the next BUFFs: one for the next BUFF in the stack, one for the next BUFF which implements write, and one for the next BUFF which implements read. Public functions in buff.c BUFF * bpushfilter (BUFF *, struct bfilter *, void *); This function adds the filter functions from bfilter, stacking them on top of the BUFF. It returns the new top BUFF, or NULL on error. BUFF * bpushbuffer (BUFF *, BUFF *); This function places the second buffer on the top of the stack that the first one is on. It returns the new top BUFF, or NULL on error. BUFF * bpopfilter(BUFF *); BUFF * bpopbuffer(BUFF *); Unattaches the top-most filter from the stack, and returns the new top-level BUFF, or NULL on error or when there are no BUFFs remaining. The two are synonymous. void bclosestack(BUFF *); Closes the I/O stack, removing all the filters in it. BUFF * bcreatestack(pool *p, int flags, struct bfilter *, struct bbottomfilter *, void *); This creates an I/O stack. It returns NULL on error. BUFF * bcreatebuffer(pool *p, int flags, struct bfilter *, void *); This creates a BUFF for later use with bpushbuffer. The BUFF is not set up to be used as an I/O stack, however. It returns NULL on error. int bsetstackopts(BUFF *, int, const void *); int bsetstackflags(BUFF *, int, int); These functions, respectively, set options on all the BUFFs in a stack. The new flag, B_MAXBUFFERING is used to disable a feature described in the next section, whereby only the first and last BUFFs will buffer data. ************************************************************************* Efficiency Considerations Buffering All input and output is buffered by the standard buffering code. People writing code to use buff.c should not concern themselves with buffering for efficiency, and should not buffer except when necessary. The write function will typically be called with large blocks of text; the read function will attempt to place the specified number of bytes into the buffer. Dean noted that there are possible problems w/ multiple buffers; further, some applications must not be buffered. This can be partially dealt with by turning off buffering, or by flushing the data when appropriate. However, some potential problems arise anyway. The simplest example involves shrinking transformations; suppose that you have a set of filters, A, B, and C, such that A outputs less text than it recieves, as does B (say A strips comments, and B gzips the result). Then after a write to A which fills the buffer, A writes to B. However, A won't write enough to fill B's buffer, so a memory copy will be needed. This continues till B's buffer fills up, then B will write to C's buffer -- with the same effect. [djg: I don't think this is the issue I was really worried about -- in the case of shrinking transformations you are already doing non-trivial amounts of CPU activity with the data, and there's no copying of data that you can eliminate anyway. I do recognize that there are non-CPU intensive filters -- such as DMA-capable hardware crypto cards. I don't think they're hard to support in a zero-copy manner though.] The maximum additional number of bytes which will be copied in this scenario is on the order of nk, where n is the total number of bytes, and k is the number of filters doing shrinking transformations. There are several possible solutions to this issue. The first is to turn off buffering in all but the first filter and the last filter. This reduces the number of unnecessary byte copies to at most one per byte, however it means that the functions in the stack will get called more frequently; but it is the default behavior, overridable by setting the B_MAXBUFFERING with bsetstackflags. Most filters won't involve a net shrinking transformation, so even this will rarely be an issue; however, if the filters do involve a net shrinking transformation, for the sake of network-efficiency (sending reasonably sized blocks), it may be more efficient anyway. A second solution is more general use of writev for communication between different buffers. This complicates the programing work, however. Memory copies Each write function is passed a pointer to constant text; if any changes are being made to the text, it must be copied. However, if no changes are made to the text (or to some smaller part of it), then it may be sent to the next filter without any additional copying. This should provide the minimal necessary memory copies. [djg: Unfortunately this makes it hard to support page-flipping and async i/o because you don't have any reference counts on the data. But I go into a little detail that already in docs/page_io.] Function chaining In order to avoid unnecessary function chaining for reads and writes, when a filter is pushed onto the stack, the buff.c code will determine which is the next BUFF which contains a read or write function, and reads and writes, respectively, will go directly to that BUFF. writev writev is a function for efficient writing to the system; in terms of this API, however, it also works for dealing with multiple blocks of text without doing unnecessary byte copies. It is not required. Currently, the system level writev is used in two contexts: for chunking and when a block of text is writen which, combined with the text already in the buffer, would make the buffer overflow. writev would be implemented both by the default bottom level filter and by the chunking filter for these operations. In addition, writev may, be used, as noted above, to pass multiple blocks of text w/o copying them into a single buffer. Note that if the next filter does not implement writev, however, this will be equivalent to repeated calls to write, which may or may not be more efficient. Up to IOV_MAX-2 blocks of text may be passed along in this manner. Unlike the system writev call, the writev in this API should be called only once, with a array with iovec's and a count as to the number of iovecs in it. If a bfilter defines writev, writev will be called whether or not NO_WRITEV is set; hence, it should deal with that case in a reasonable manner. [djg: We can't guarantee atomicity of writev() when we emulate it. Probably not a problem, just an observation.] ************************************************************************* Code in buff.c Default Functions The default actions are generally those currently performed by Apache, save that they they'll only attempt to write to a buffer, and they'll return an error if there are no more buffers. That is, you must implement read, write, and flush in the bottom-most filter. Except for close(), the default code will simply pass the function call on to the next filter in the stack. Some samples follow. Heuristics for writev Currently, we call writev for chunking, and when we get a enough so that the total overflows the buffer. Since chunking is going to become a filter, the chunking filter will use writev; in addition, bwrite will trigger bwritev as shown (note that system specific information should be kept at the filter level): in bwrite: if (fb->outcnt > 0 && nbyte + fb->outcnt >= fb->bufsiz) { /* build iovec structs */ struct iovec vec[2]; vec[0].iov_base = (void *) fb->outbase; vec[0].iov_len = fb->outcnt; fb->outcnt = 0; vec[1].iov_base = (void *)buff; vec[1].iov_length = nbyte; return bwritev (fb, vec, 2); } else if (nbye >= fb->bufsiz) { return write_with_errors(fb,buff,nbyte); } Note that the code above takes the place of large_write (as well as taking code from it). So, bwritev would look something like this (copying and pasting freely from the current source for writev_it_all, which could be replaced): ----- int bwritev (BUFF * fb, struct iovec * vec, int nvecs) { if (!fb) return -1; /* the bottom level filter implemented neither write nor * writev. */ if (fb->bfilter.bwritev) { return bf->bfilter.writev(fb->next, vec, nvecs); } else if (fb->bfilter.write) { /* while it's nice an easy to build the vector and crud, it's painful * to deal with partial writes (esp. w/ the vector) */ int i = 0,rv; while (i < nvecs) { do { rv = fb->bfilter.write(fb, vec[i].iov_base, vec[i].iov_len); } while (rv == -1 && (errno == EINTR || errno == EAGAIN) && !(fb->flags & B_EOUT)); if (rv == -1) { if (errno != EINTR && errno != EAGAIN) { doerror (fb, B_WR); } return -1; } fb->bytes_sent += rv; /* recalculate vec to deal with partial writes */ while (rv > 0) { if (rv < vec[i].iov_len) { vec[i].iov_base = (char *)vec[i].iov_base + rv; vec[i].iov_len -= rv; rv = 0; if (vec[i].iov_len == 0) { ++i; } } else { rv -= vec[i].iov_len; ++i; } } if (fb->flags & B_EOUT) return -1; } /* if we got here, we wrote it all */ return 0; } else { return bwritev(fb->next,vec,nvecs); } } ----- The default filter's writev function will pretty much like writev_it_all. Writing The general case for writing data is significantly simpler with this model. Because special cases are not dealt with in the BUFF core, a single internal interface to writing data is possible; I'm going to assume it's reasonable to standardize on write_with_errors, but some other function may be more appropriate. In the revised bwrite (which I'll ommit for brievity), the following must be done: check for error conditions check to see if any buffering is done; if not, send the data directly to the write_with_errors function check to see if we should use writev or write_with_errors as above copy the data to the buffer (we know it fits since we didn't need writev or write_with_errors) The other work the current bwrite is doing is ifdef'ing around NO_WRITEV numerous decisions regarding whether or not to send chunks Generally, buff.c has a number of functions whose entire purpose is to handle particular special cases wrt chunking, all of which could be simplified with a chunking filter. write_with_errors would not need to change; buff_write would. Here is a new version of it: ----- /* the lowest level writing primitive */ static ap_inline int buff_write(BUFF *fb, const void *buf, int nbyte) { if (fb->bfilter.write) return fb->bfilter.write(fb->next_writer,buff,nbyte); else return bwrite(fb->next_writer,buff,nbyte); } ----- If the btransmitfile function is called on a buffer which doesn't implement it, the system will attempt to read data from the file identified by the file_info_ptr structure and use other methods to write to it. Reading One of the basic reading functions in Apache 1.3b3 is buff_read; here is how it would look within this spec: ----- /* the lowest level reading primitive */ static ap_inline int buff_read(BUFF *fb, void *buf, int nbyte) { int rv; if (!fb) return -1; /* the bottom level filter is not set up properly */ if (fb->bfilter.read) return fb->bfilter.read(fb->next_reader,buf,nbyte,fb->bfilter_info); else return bread(fb->next_reader,buff,nbyte); } ----- The code currently in buff_read would become part of the default filter. Flushing data flush will get passed on down the stack automatically, with recursive calls to bflush. The user-supplied flush function will be called then, and also before close is called. The user-supplied flush should not call flush on the next buffer. [djg: Poorly written "expanding" filters can cause some nastiness here. In order to flush a layer you have to write out your current buffer, and that may cause the layer below to overflow a buffer and flush it. If the filter is expanding then it may have to add more to the buffer before flushing it to the layer below. It's possible that the layer below will end up having to flush twice. It's a case where writev-like capabilities are useful.] Closing Stacks and Filters When a filter is removed from the stack, flush will be called then close will be called. When the entire stack is being closed, this operation will be done automatically on each filter within the stack; generally, filters should not operate on other filters further down the stack, except to pass data along when flush is called. Flags and Options Changes to flags and options using the current functions only affect one buffer. To affect all the buffers on down the chain, use bsetstackopts or bsetstackflags. bgetopt is currently only used to grab a count of the bytes sent; it will continue to provide that functionality. bgetflags is used to provide information on whether or not the connection is still open; it'll continue to provide that functionality as well. The core BUFF operations will remain, though some operations which are done via flags and options will be done by attaching appropriate filters instead (eg. chunking). [djg: I'd like to consider filesystem metadata as well -- we only need a few bits of metadata to do HTTP: file size and last modified. We need an etag generation function, it is specific to the filters in use. You see, I'm envisioning a bottom layer which pulls data out of a database rather than reading from a file.]