This document describes the design of the POIFS system. It is organized as follows:
This document is written as part of an iterative process. As that process is not yet complete, neither is this document.
The design of POIFS is not dependent on the code written for the proof-of-concept prototype POIFS package.
As usual, the primary considerations in the design of the POIFS assumption involve the classic space-time tradeoff. In this case, the main consideration has to involve minimizing the memory footprint of POIFS. POIFS may be called upon to create relatively large documents, and in web application server, it may be called upon to create several documents simultaneously, and it will likely co-exist with other Serializer systems, competing with those other systems for space on the server.
We've addressed the risk of being too slow through a proof-of-concept prototype. This prototype for POIFS involved reading an existing file, decomposing it into its constituent documents, composing a new POIFS from the constituent documents, and writing the POIFS file back to disk and verifying that the output file, while not necessarily a byte-for-byte image of the input file, could be read by the application that generated the input file. This prototype proved to be quite fast, reading, decomposing, and re-generating a large (300K) file in 2 to 2.5 seconds.
While the POIFS format allows great flexibility in laying out the documents and the other internal data structures, the layout of the filesystem will be kept as simple as possible.
The design of the POIFS is broken down into two parts: discussion of the classes and interfaces, and discussion of how these classes and interfaces will be used to convert an appropriate Java InputStream (such as an XML stream) to a POIFS output stream containing an HSSF document.
Classes and InterfacesThe classes and interfaces used in the POIFS are broken down as follows:
Package | Contents |
---|---|
net.sourceforge.poi.poifs.storage | Block classes and interfaces |
net.sourceforge.poi.poifs.property | Property classes and interfaces |
net.sourceforge.poi.poifs.filesystem | Filesystem classes and interfaces |
net.sourceforge.poi.util | Utility classes and interfaces |
The block classes and interfaces are shown in the following class diagram.
Class/Interface | Description |
---|---|
BATBlock | The BATBlock class
represents a single big block
containing 128 BAT
entries. Its _fields array is
used to read and write the BAT entries
into the _data
array.Its createBATBlocks
method is used to create an array of
BATBlock instances from an array of
int BAT entries.Its calculateStorageRequirements
method calculates the number of BAT
blocks necessary to hold the specified
number of BAT entries. |
BigBlock | The BigBlock class is an
abstract class representing the common
big block of 512 bytes. It implements
BlockWritable,
trivially delegating the
writeBlocks method
of BlockWritable to its own abstract
writeData
method. |
BlockWritable | The BlockWritable interface
defines a single method,
writeBlocks , that
is used to write an implementation's
block data to an
OutputStream . |
DocumentBlock | The DocumentBlock class is
used by a Document to holds
its raw data. It also retains the
number of bytes read, as this is used
by the Document class to determine the
total size of the data, and is also
used internally to determine whether
the block was filled by the
InputStream or
not.The DocumentBlock
constructor is passed an
InputStream from which to
fill its _data
array.The size
method returns the number of bytes
read (_bytes_read
when the instance was
constructed.The partiallyRead
method returns true if the
_data array was
not completely filled, which may be
interpreted by the Document as having
reached the end of file
point.Typical use of the DocumentBlock class is like this: while
(true) |
HeaderBlock | The HeaderBlock class is
used to contain the data found in a
POIFS header. Its IntegerField members are used to read and write the appropriate entries into the _data
array.Its setBATBlocks ,
setPropertyStart ,
and setXBATStart
methods are used to set the
appropriate fields in the
_data
array.The calculateXBATStorageRequirements
method is used to determine how many
XBAT blocks are necessary to
accommodate the specified number of
BAT blocks.
|
PropertyBlock | The PropertyBlock class is
used to contain Property
instances for the PropertyTable
class. It contains an array, _properties of 4
Property instances, which together
comprise the 512 bytes of a BigBlock.The createPropertyBlockArray
method is used to convert a
List of Property
instances into an array of
PropertyBlock instances. The number of
Property instances is rounded up to a
multiple of 4 by creating empty
anonymous inner class extensions of
Property. |
The property classes and interfaces are shown in the following class diagram.
Class/Interface | Description |
---|---|
Directory | The Directory interface is
implemented by the RootProperty
class. It is not strictly necessary
for the initial POIFS implementation,
but when the POIFS supports directory
elements, this interface will be
more widely implemented, and so is
included in the design at this point
to ease the eventual support of
directory elements. Its methods are a getter/setter pair, getChildren ,
returning an Iterator of
Property
instances; and
addChild , which
will allow the caller to add another
Property instance to the Directory's
children. |
DocumentProperty | The DocumentProperty class
is a trivial extension of Property and is
used by Document to keep
track of its associated entry in the
PropertyTable. Its constructor takes a name and the document size, on the assumption that the Document will not create a DocumentProperty until after it has created the storage for the document data and therefore knows how much data there is. |
File | The File interface specifies the behavior of reading and writing the next and previous child fields of a Property. |
Property | The Property class is an
abstract class that defines the basic
data structure of an element of the Property
Table. Its ByteField, ShortField, and IntegerField members are used to read and write data into the appropriate locations in the _raw_data
array.The _index member is
used to hold a Propery instance's
index in the List of
Property instances maintained by PropertyTable,
which is used to populate the child
property of parent Directory
properties and the next property and
previous property of sibling File
properties.The _name ,
_next_file , and
_previous_file
members are used to help fill the
appropriate fields of the _raw_data
array.Setters are provided for some of the fields (name, property type, node color, child property, size, index, start block), as well as a few getters (index, child property). The preWrite method is
abstract and is used by the owning
PropertyTable to iterate through its
Property instances and prepare each
for writing.The shouldUseSmallBlocks
method returns true if the Property's
size is sufficiently small - how small
is none of the caller's business.
|
PropertyBlock | See the description in PropertyBlock. |
PropertyTable | The PropertyTable class
holds all of the DocumentProperty
instances and the RootProperty
instance for a Filesystem
instance. It maintains a List of its Property
instances
(_properties ), and
when prepared to write its data by a
call to preWrite ,
it gets and holds an array of PropertyBlock
instances
(_blocks .It also maintains its start block in its _start_block
member.It has a method, getRoot , to get
the RootProperty, returning it as an
implementation of Directory, and a
method to add a Property,
addProperty , and a
method to get its start block,
getStartBlock . |
RootProperty | The RootProperty class acts
as the Directory for
all of the DocumentProperty
instance. As such, it is more of a
pure directory
entry than a proper root
entry in the Property
Table, but the initial POIFS
implementation does not warrant the
additional complexity of a full-blown
root entry, and so it is not modeled
in this design. It maintains a List of its children,
_children , in
order to perform its
directory-oriented duties. |
The property classes and interfaces are shown in the following class diagram.
Class/Interface | Description |
---|---|
Filesystem | The Filesystem class is the
top-level class that manages the
creation of a POIFS document. It maintains a PropertyTable instance in its _property_table
member, a HeaderBlock
instance in its
_header_block
member, and a List of its
Document
instances in its
_documents
member.It provides methods for a client to create a document ( createDocument ),
and a method to write the Filesystem
to an OutputStream
(writeFilesystem ). |
BATBlock | See the description in BATBlock |
BATManaged | The BATManaged interface
defines common behavior for objects
whose location in the written file is
managed by the Block
Allocation Table. It defines methods to get a count of the implementation's BigBlock instances ( countBlocks ), and
to set an implementation's start block
(setStartBlock ). |
BlockAllocationTable | The BlockAllocationTable is
an implementation of the POIFS Block
Allocation Table. It is only
created when the Filesystem is
about to be written to an
OutputStream .It contains an IntList of block numbers for all of the BATManaged implementations owned by the Filesystem, _entries , which is
filled by calls to
allocateSpace .It fills its array, _blocks , of BATBlock
instances when its
createBATBlocks
method is called. This method has to
take into account its own storage
requirements, as well as those of the
XBAT blocks, and so calls
BATBlock.calculateStorageRequirements
and
HeaderBlock.calculateXBATStorageRequirements
repeatedly until the counts returned
by those methods stabilize.The countBlocks method
returns the number of BATBlock
instances created by the preceding
call to createBlocks. |
BlockWritable | See the description in BlockWritable |
Document | The Document class is used
to contain a document, such as an HSSF
workbook. It has its own DocumentProperty ( _property ) and
stores its data in a collection of DocumentBlock
instances
(_blocks ).It has a method, getDocumentProperty ,
to get its DocumentProperty. |
DocumentBlock | See the description in DocumentBlock |
DocumentProperty | See the description in DocumentProperty |
HeaderBlock | See the description in HeaderBlock |
PropertyTable | See the description in PropertyTable |
The utility classes and interfaces are shown in the following class diagram.
Class/Interface | Description |
---|---|
BitField | The BitField class is used primarily by HSSF code to manage bit-mapped fields of HSSF records. It is not likely to be used in the POIFS code itself and is only included here for the sake of complete documentation of the POI utility classes. |
ByteField | The ByteField class is an
implementation of FixedField for
the purpose of managing reading and
writing to a byte-wide field in an
array of bytes . |
FixedField | The FixedField interface
defines a set of methods for reading a
field from an array of
bytes or from an
InputStream , and for
writing a field to an array of
bytes . Implementations
typically require an offset in their
constructors that, for the purposes of
reading and writing to an array of
bytes , makes sure that
the correct bytes in the
array are read or written. |
HexDump | The HexDump class is a
debugging class that can be used to
dump an array of bytes to
an OutputStream . The
static method dump
takes an array of bytes ,
a long offset that is
used to label the output, an open
OutputStream , and an
int index that specifies
the starting index within the array of
bytes .The data is displayed 16 bytes per line, with each byte displayed in hexadecimal format and again in printable form, if possible (a byte is considered printable if its value is in the range of 32 ... 126). Here is an example of a small array of bytes
with an offset of
0x110:00000110 C8 00 00 00 FF 7F 90 01 00 00 00 00 00 00 05 01 ................ |
IntegerField | The IntegerField class is
an implementation of FixedField for
the purpose of managing reading and
writing to an integer-wide field in an
array of bytes . |
IntList | The IntList class is a
work-around for functionality missing
in Java (see http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
for details); it is a simple growable
array of ints that gets
around the requirement of wrapping and
unwrapping ints in
Integer instances in
order to use the
java.util.List
interface.IntList mimics the functionality of the java.util.List interface
as much as possible. |
LittleEndian | The LittleEndian class
provides a set of static methods for
reading and writing
shorts ,
ints , longs ,
and doubles in and out of
byte arrays, and out of
InputStreams , preserving
the Intel byte ordering and encoding
of these values. |
LittleEndianConsts | The LittleEndianConsts
interface defines the width of a
short , int ,
long , and
double as stored by Intel
processors. |
LongField | The LongField class is an
implementation of FixedField for
the purpose of managing reading and
writing to a long-wide field in an
array of bytes . |
ShortField | The ShortField class is an
implementation of FixedField for
the purpose of managing reading and
writing to a short-wide field in an
array of bytes . |
ShortList | The ShortList class is a
work-around for functionality missing
in Java (see http://developer.java.sun.com/developer/bugParade/bugs/4487555.html
for details); it is a simple growable
array of shorts that gets
around the requirement of wrapping and
unwrapping shorts in
Short instances in order
to use the java.util.List
interface.ShortList mimics the functionality of the java.util.List interface
as much as possible. |
StringUtil | The StringUtil class manages the processing of Unicode strings. |
This section describes the scenarios of how the POIFS classes and interfaces will be used to convert an appropriate XML stream to a POIFS output stream containing an HSSF document.
It is broken down as suggested by the following scenario diagram:
Step | Description |
---|---|
1 | The Filesystem is created by the client application. |
2 | The client
application tells the Filesystem to create a
document, providing an
InputStream and the name of the
document. This may be repeated several
times. |
3 | The client
application asks the Filesystem to write its
data to an OutputStream . |
Initialization of the POIFS system is shown in the following scenario diagram:
Step | Description |
---|---|
1 | The Filesystem object, which is created for each request to convert an appropriate XML stream to a POIFS output stream containing an HSSF document, creates its PropertyTable. |
2 | The PropertyTable
creates its RootProperty
instance, making the RootProperty the
first Property
in its List of Property
instances. |
3 | The Filesystem creates its HeaderBlock instance. It should be noted that the decision to create the HeaderBlock at Filesystem initialization is arbitrary; creation of the HeaderBlock could easily and harmlessly be postponed to the appropriate moment in writing the filesystem. |
Creating and adding a document to a POIFS system is shown in the following scenario diagram:
Step | Description |
---|---|
1 | The Filesystem
instance creates a new Document
instance. It will store the newly
created Document in a
List of BATManaged
instances. |
2 | The Document reads
data from the provided
InputStream , storing the
data in DocumentBlock
instances. It keeps track of the byte
count as it reads the data. |
3 | The Document creates a DocumentProperty to keep track of its property data. The byte count is stored in the newly created DocumentProperty instance. |
4 | The Filesystem requests the newly created DocumentProperty from the newly created Document instance. |
5 | The Filesystem
sends the newly created DocumentProperty
to the Filesystem's PropertyTable
so that the PropertyTable can add the
DocumentProperty to its
List of Property
instances. |
6 | The Filesystem gets the RootProperty from its PropertyTable. |
7 | The Filesystem adds the newly created DocumentProperty to the RootProperty. |
Although typical deployment of the POIFS system will only entail adding a single Document (the workbook) to the Filesystem, there is nothing in the design to prevent multiple Documents from being added to the Filesystem. This flexibility can be employed to write summary information document(s) in addition to the workbook.
Writing the filesystem is shown in the following scenario diagram:
Step | Description | |
---|---|---|
1 | The Filesystem adds
the PropertyTable
to its List of BATManaged
instances and calls the
PropertyTable's
preWrite
method. The action taken by the
PropertyTable is shown in the PropertyTable
preWrite scenario diagram. |
|
2 | The Filesystem creates the BlockAllocationTable. | |
3 | The Filesystem gets the block count from the BATManaged instance. | These three steps are
repeated for each BATManaged
instance in the Filesystem's
List of BATManaged
instances (i.e., the Documents, in
order of their addition to the
Filesystem, followed by the PropertyTable). |
4 | The Filesystem sends the block count to the BlockAllocationTable, which adds the appropriate entries to is IntList of entries, returning the starting block for the newly added entries. | |
5 | The Filesystem gives the start block number to the BATManaged instance. If the BATManaged instance is a Document, it sets the start block field in its DocumentProperty. | |
6 | The Filesystem tells the BlockAllocationTable to create its BatBlocks. | |
7 | The Filesystem gives the BAT information to the HeaderBlock so that it can set its BAT fields and, if necessary, create XBAT blocks. | |
8 | If the filesystem is unusually large (over 7MB), the HeaderBlock will create XBAT blocks to contain the BAT data that it cannot hold directly. In this case, the Filesystem tells the HeaderBlock where those additional blocks will be stored. | |
9 | The Filesystem gives the PropertyTable start block to the HeaderBlock. | |
10 | The Filesystem
tells the BlockWritable
instance to write its blocks to the
provided
OutputStream .This step is repeated for each BlockWritable instance, in this order:
|
PropertyTable preWrite scenario diagram
Step | Description |
---|---|
1 | The PropertyTable
calls setIndex for
each of its Property
instances, so that each Property now
knows its index within the
PropertyTable's List of
Property instances. |
2 | The PropertyTable requests the PropertyBlock class to create an array of PropertyBlock instances. |
3 | The PropertyBlock
calculates the number of empty Property
instances it needs to create and
creates them. The algorithm for the
number to create is:block_count = (properties.size()
+ 3) / 4; |
4 | The PropertyBlock
creates the required number of PropertyBlock
instances from the List
of Property
instances, including the newly created
empty Property
instances. |
5 | The PropertyTable
calls preWrite on
each of its Property
instances. For DocumentProperty
instances, this call is a no-op. For
the RootProperty,
the action taken is shown in the RootProperty
preWrite scenario diagram. |
RootProperty preWrite scenario diagram
Step | Description | |
---|---|---|
1 | The RootProperty
sets its child property with the index
of the child Property that is
first in its List of
children. |
|
2 | The RootProperty
sets its child's next property field
with the index of the child's next
sibling in the RootProperty's
List of children. If the
child is the last in the
List , its next property
field is set to -1 . |
These two steps are
repeated for each File in the RootProperty's
List of
children. |
3 | The RootProperty
sets its child's previous property
field with a value of
-1 . |