From akosut@leland.Stanford.EDU Thu Jul 23 09:38:40 1998 Date: Sun, 19 Jul 1998 00:12:37 -0700 (PDT) From: Alexei Kosut To: new-httpd@apache.org Subject: Apache 2.0 - an overview For those not at the Apache meeting in SF, and even for those who were, here's a quick overview of (my understanding of) the Apache 2.0 architecture that we came up with. I present this to make sure that I have it right, and to get opinions from the rest of the group. Enjoy. 1. "Well, if we haven't released 2.0 by Christmas of 1999, it won't matter anyway." A couple of notes about this plan: I'm looking at this right now from a design standpoint, not an implementation one. If the plan herein were actually coded as-is, you'd get a very inefficient web server. But as Donald Knuth (Professor emeritus at Stanford, btw... :) points out, "premature optimization is the root of all evil." Rest assured there are plenty of ways to make sure Apache 2.0 is much faster than Apache 1.3. Taking out all the "slowness" code, for example... :) Also, the main ideas in this document mainly come from Dean Gaudet, Simon Spero, Cliff Skolnick and a bunch of other people, from the Apache Group's meeting in San Francisco, July 2 and 3, 1998. The other ideas come from other people. I'm being vague because I can't quite remember. We should have videotaped it. I've titled the sections of this document with quotes from our meeting, but they are paraphrased from memory, so don't take them too seriously. 2. "But Simon, how can you have a *middle* end?" One of the main goals of Apache 2.0 is protocol independence (i.e., serving HTTP/1.1, HTTP-NG, and maybe FTP or gopher or something). Another is to rid the server of the belief that everything is a file. Towards this end, we divide the server up into three parts, the front end, the middle end, and the back end. The front end is essentially a combination of http_main and http_protocol today. It takes care of all network and protocol matters, interpreting the request, putting it into a protocol-neutral form, and (possibly) passing it off to the rest of the server. This is approximately equivalent to the part of Apache contained in Dean's flow stuff, and it also works very well in certain non-Unix-like architectures such as clustered mainframes. In addition, part of this front-end might be optionally run in kernel space, giving a very fast server indeed... The back end is what generates the content. At the back of the back end we have backing stores (Cliff's term), which contain actual data. These might represent files on a disk, entries in a database, CGI scripts, etc... The back end also consists of other modules, which can alter the request in various fashions. The objects the server acts on can be thought of (Cliff again) as a filehandle and a set of key/value pairs (metainformation). The modules are set up as filters that can alter either one of those, stacking I/O routines onto the stream of data, or altering the metainformation. The middle end is what comes between the front and back ends. Think of http_request. This section takes care of arranging the modules, backing stores, etc... into a manner so that the path of the request will result in the correct entity being delivered to the front end and sent to the client. 3. "I won't embarrass you guys with the numbers for how well Apache performs compared to IIS." (on NT) For a server that was designed to handle flat files, Apache does it surprisingly poorly, compared with other servers that have been optimized for it. And the performance for non-static files is, of course, worse. While Apache is still more than fast enough for 95% of Web servers, we'd be remiss to dismiss those other 5% (they're the fun ones anyway). Another problem Apache has is its lack of a good, caching, proxy module. Put these together, along with the work Dean has done with the flow and mod_mmap_static stuff, and we realize the most important part of Apache 2.0: a built-in, all-pervasive, cache. Every part of the request process will involve caching. In the path outlined above, between each layer of the request, between each module, sits the cache, which can (when it is useful), cache the response and its metainformation - including its variance, so it knows when it is safe to give out the cached copy. This gives every opportunity to increase the speed of the server by making sure it never has to dynamically create content more than it needs to, and renders accelerators such as Squid unnecessary. This also allows what I alluded to earlier: a kernel (or near-to-kernel) based web server component, which could read the request, consult the cache to find the requested object, and spit it back out, without so much as an interrupt in the way. Of course, the rest of Apache (with all its modules - it's generally a bad idea to let unknown, untrusted code, insert itself into the kernel) sits up in user-space, ready to handle any request the micro-Apache can't. A built-in cache also makes a real working HTTP/1.1 proxy server trivially easy to write. 4. "Stop asking about backwards compatibility with the API. We'll write a compatibility module... later." If modules are as described above, then obviously they are very much distinct from how Apache's current modules function. The only module function that is similar to the current model is the handler, or backing store, that actually provides the basic stream of data that the server alters to product a response entity. The basic module's approach to its job is to stack a filter onto the output. But it's better to think of the modules not as a stack that the request flows through (a layer cake with cache icing between the layers), but more of a mosaic (pretend I didn't use that word. I wrote collage. You can't prove anything), with modules stuck onto various sides of the request at different points, altering the request/response. Today's Apache modules take an all-or-nothing approach to request handlers. They tell Apache what they can do, overestimating, and then are supposed to DECLINE if they don't pass a number of checks they are supposed to make. Most modules don't do this correctly. The better approach is to allow the modules to inform Apache exactly of what they can do, and have Apache (the middle-end) take care of invoking them when appropriate. The final goal of all of this, of course, is simply to allow CGI output to be parsed for server-side includes. But don't tell Dean that. 5. "Will Apache run without any of the normal Unix binaries installed, only the BSD/POSIX libraries?" Another major issue is, of course, configuration of the server. There are a number of distinct opinions on this, both as to what should be configured and how it should be done. We talked mainly about the latter, but the did touch on the former. Obviously, with a radically distinct module API, the configuration is radically different. We need a good way to specify how the modules are supposed to interact, and of controlling what they can do, when and how, balancing what the user asks the server to do, and what the module (author) wants the server to do. We didn't really come up with a good answer to this. However, we did make some progress on the other side of the issue: We agreed that the current configuration system is definitely taking the right approach. Having a well-defined repository of the configuration scheme, containing the possible directives, when they are applicable, what their parameters are, etc... is the right way to go. We agreed that more information and stronger-typing (no RAW_ARGS!) would be good, and may enable on-the-fly generated configuration managers. We agreed that such a program, probably external to Apache, would generate a configuration and pass it to Apache, either via a standard config file, or by calling Apache API functions. It is desirable to be able to go the other way, pulling current configuration from Apache to look at, and perhaps change it on the fly, but unfortunately is unlikely this information would always be available; modules may perform optimizations on their configuration that makes the original configuration unavailable. For the language and specification of the configuration, we thought perhaps XML might be a good approach, and agreed it should be looked into. Other issues, such as SNMP, were brought up and laughed at. 6. "So you're saying that the OS that controls half the banks, and 90% of the airlines, doesn't even have memory protection for seperate processes?" Obviously, there are a lot more items that have to be part of Apache 2.0, and we talked about a number of them. However, the four points above, I think, represent the core of the architecture we agreed on as a starting point. -- Alexei Kosut Stanford University, Class of 2001 * Apache *