This document describes the design of the POIFS system. It is organized as follows:
This document is written as part of an iterative process. As that process is not yet complete, neither is this document.
The design of POIFS is not dependent on the code written for the proof-of-concept prototype POIFS package.
As usual, the primary considerations in the design of the POIFS assumption involve the classic space-time tradeoff. In this case, the main consideration has to involve minimizing the memory footprint of POIFS. POIFS may be called upon to create relatively large documents, and in web application server, it may be called upon to create several documents simultaneously, and it will likely co-exist with other Serializer systems, competing with those other systems for space on the server.
We've addressed the risk of being too slow through a proof-of-concept prototype. This prototype for POIFS involved reading an existing file, decomposing it into its constituent documents, composing a new POIFS from the constituent documents, and writing the POIFS file back to disk and verifying that the output file, while not necessarily a byte-for-byte image of the input file, could be read by the application that generated the input file. This prototype proved to be quite fast, reading, decomposing, and re-generating a large (300K) file in 2 to 2.5 seconds.
While the POIFS format allows great flexibility in laying out the documents and the other internal data structures, the layout of the filesystem will be kept as simple as possible.
The design of the POIFS is broken down into two parts: discussion of the classes and interfaces, and discussion of how these classes and interfaces will be used to convert an appropriate Java InputStream (such as an XML stream) to a POIFS output stream containing an HSSF document.
Classes and Interfaces
The classes and interfaces used in the POIFS are broken down as follows:
Package | Contents |
---|---|
net.sourceforge.poi.poifs.storage | Block classes and interfaces |
net.sourceforge.poi.poifs.property | Property classes and interfaces |
net.sourceforge.poi.poifs.filesystem | Filesystem classes and interfaces |
net.sourceforge.poi.util | Utility classes and interfaces |
The block classes and interfaces are shownin the following class diagram.
Class/Interface | Description |
---|---|
BATBlock | The BATBlock class represents a single big block containing 128
BAT entries. Its _fields array is used to
read and write the BAT entries into the _data array.
Its createBATBlocks method is used to create an array of BATBlock
instances from an array of int BAT entries.
Its calculateStorageRequirements method calculates the number of BAT blocks
necessary to hold the specified number of BAT entries.
|
BigBlock | The BigBlock class is an abstract class representing the common big block
of 512 bytes. It implements BlockWritable, trivially delegating
the writeBlocks method of BlockWritable to its own abstract writeData
method.
|
BlockWritable | The BlockWritable interface defines a single method,
writeBlocks , that is used to write an implementation's block data to an
OutputStream .
|
DocumentBlock | The DocumentBlock class is used by a
Document
to holds its raw data. It also retains the number of bytes read, as this is used by the
Document class to determine the total size of the data, and is also used internally to
determine whether the block was filled by the
InputStream
or not.
The DocumentBlock constructor is passed an InputStream from which
to fill its _data array.
The size method returns the number of bytes read (_bytes_read )
when the instance was constructed.
The partiallyRead method returns true if the _data array was not
completely filled, which may be interpreted by the Document as having reached the end of
file point.Typical use of the DocumentBlock class is like this: |
HeaderBlock | The HeaderBlock class is used to contain the data found in a POIFS header.
Its IntegerField members are used to read and write the appropriate entries into the _data
array.Its setBATBlocks
,
setPropertyStart
, and
setXBATStart
methods are used to set the appropriate fields in the
_data
array.The calculateXBATStorageRequirements
method is used to determine how many XBAT blocks are necessary to accommodate the specified
number of BAT blocks.
|
PropertyBlock | The PropertyBlock class is used to contain
Property
instances for the
PropertyTable
class. It contains an array, _properties of 4 Property instances, which
together comprise the 512 bytes of a BigBlock.
The createPropertyBlockArray method is used to convert a
List
of Property instances into an array of PropertyBlock instances. The number of Property
instances is rounded up to a multiple of 4 by creating empty anonymous inner class
extensions of Property.
|
The property classes and interfaces are shown in the following class diagram.
Class/Interface | Description |
---|---|
Directory | The Directory interface is implemented by the
RootProperty
class. It is not strictly necessary for the initial POIFS implementation, but when the POIFS
supports directory elements, this interface
will be more widely implemented, and so is included in the design at this point to ease the
eventual support of directory elements. Its methods are a getter/setter pair, getChildren
, returning an Iterator of
Property
instances; and
addChild
, which will allow the caller to add another Property instance to the Directory's children.
|
DocumentProperty | The DocumentProperty class is a trivial extension of
Property
and is used by Document to keep track of its associated entry in
the
PropertyTable. Its constructor takes a name and the document size, on the assumption that the Document will not create a DocumentProperty until after it has created the storage for the document data and therefore knows how much data there is. |
File | The File interface specifies the behavior of reading and writing the next and previous child fields of a Property. |
Property | The Property class is an abstract class that defines the basic data
structure of an element of the
Property Table. Its ByteField, ShortField, and IntegerField members are used to read and write data into the appropriate locations in the _raw_data
array.The _index
member is used to hold a Propery instance's index in the List of Property
instances maintained by PropertyTable, which is used to
populate the child property of parent
Directory
properties and the next property and previous property of sibling
File
properties.The _name
,
_next_file
, and
_previous_file
members are used to help fill the appropriate fields of the _raw_data array.Setters are provided for some of the fields (name, property type, node color, child property, size, index, start block), as well as a few getters (index, child property). The preWrite
method is abstract and is used by the owning PropertyTable to iterate through its Property
instances and prepare each for writing.The shouldUseSmallBlocks
method returns true if the Property's size is sufficiently small - how small is none of the
caller's business.
|
PropertyBlock | See the description in PropertyBlock. |
PropertyTable | The PropertyTable class holds all of the
DocumentProperty
instances and the
RootProperty
instance for a
Filesystem
instance. It maintains a List
of its
Property
instances (
_properties
), and when prepared to write its data by a call to
preWrite
, it gets and holds an array of
PropertyBlock
instances (
_blocks ) .It also maintains its start block in its _start_block
member.It has a method, getRoot
, to get the RootProperty, returning it as an implementation of
Directory, and a method to add a Property,
addProperty
, and a method to get its start block,
getStartBlock
.
|
RootProperty | The RootProperty class acts as the Directory for
all of the
DocumentProperty
instance. As such, it is more of a pure directory
entry
than a proper root entry
in the Property Table, but the initial
POIFS implementation does not warrant the additional complexity of a full-blown root entry,
and so it is not modeled in this design. It maintains a List
of its children,
_children
, in order to perform its directory-oriented duties.
|
The property classes and interfaces are shown in the following class diagram.
Class/Interface | Description |
---|---|
Filesystem | The Filesystem class is the top-level class that manages the creation of a
POIFS document. It maintains a PropertyTable instance in its _property_table
member, a
HeaderBlock
instance in its
_header_block
member, and a List of its
Document
instances in its
_documents
member.It provides methods for a client to create a document ( createDocument
), and a method to write the Filesystem to an
OutputStream
(
writeFilesystem
).
|
BATBlock | See the description in BATBlock |
BATManaged | The BATManaged interface defines common behavior for objects whose location
in the written file is managed by the Block Allocation
Table. It defines methods to get a count of the implementation's BigBlock instances ( countBlocks
), and to set an implementation's start block (
setStartBlock
).
|
BlockAllocationTable | The BlockAllocationTable is an implementation of the
POIFS Block Allocation Table. It is only created when the
Filesystem
is about to be written to an
OutputStream .It contains an IntList of block numbers for all of the BATManaged implementations owned by the Filesystem, _entries
, which is filled by calls to
allocateSpace
.It fills its array, _blocks
, of
BATBlock
instances when its
createBATBlocks
method is called. This method has to take into account its own storage requirements, as well
as those of the XBAT blocks, and so calls
BATBlock.calculateStorageRequirements
and
HeaderBlock.calculateXBATStorageRequirements
repeatedly until the counts returned by those methods stabilize.The countBlocks
method returns the number of BATBlock instances created by the preceding call to
createBlocks.
|
BlockWritable | See the description in BlockWritable |
Document | The Document class is used to contain a document, such as an HSSF workbook.
It has its own DocumentProperty ( _property
) and stores its data in a collection of
DocumentBlock
instances (
_blocks
).It has a method, getDocumentProperty
, to get its DocumentProperty.
|
DocumentBlock | See the description in DocumentBlock |
DocumentProperty | See the description in DocumentProperty |
HeaderBlock | See the description in HeaderBlock |
PropertyTable | See the description in PropertyTable |
The utility classes and interfaces are shown in the following class diagram.
Class/Interface | Description |
---|---|
BitField | The BitField class is used primarily by HSSF code to manage bit-mapped fields of HSSF records. It is not likely to be used in the POIFS code itself and is only included here for the sake of complete documentation of the POI utility classes. |
ByteField | The ByteField class is an implementation of
FixedField
for the purpose of managing reading and writing to a byte-wide field in an array of
bytes .
|
FixedField | The FixedField interface defines a set of methods for reading a field from
an array of
bytes
or from an
InputStream , and for writing a field to an array of
bytes . Implementations typically require an offset in their constructors that,
for the purposes of reading and writing to an array of
bytes , makes sure that the correct bytes in the array are read or
written.
|
HexDump | The HexDump class is a debugging class that can be used to dump an array of
bytes
to an OutputStream . The static method
dump
takes an array of bytes , a long offset that is used to label the
output, an open
OutputStream , and an
int
index that specifies the starting index within the array of
bytes .The data is displayed 16 bytes per line, with each byte displayed in hexadecimal format and again in printable form, if possible (a byte is considered printable if its value is in the range of 32 ... 126). Here is an example of a small array of bytes
with an offset of 0x110:
|
IntegerField | The IntegerField class is an implementation of
FixedField
for the purpose of managing reading and writing to an integer-wide field in an array
of bytes .
|
IntList | The IntList class is a work-around for functionality missing in Java (see
https://developer.java.sun.com/developer/bugParade/bugs/4487555.html
for details); it is a simple growable array of ints that gets around the
requirement of wrapping and unwrapping ints in
Integer
instances in order to use the
java.util.List
interface.
IntList mimics the functionality of the java.util.List
interface as much as possible.
|
LittleEndian | The LittleEndian class provides a set of static methods for reading and
writing
shorts ,
ints , longs , and doubles in and out of
byte
arrays, and out of
InputStreams , preserving the Intel byte ordering and encoding of these values.
|
LittleEndianConsts | The
LittleEndianConsts
interface defines the width of a
short , int ,
long , and
double
as stored by Intel processors.
|
LongField | The LongField class is an implementation of
FixedField
for the purpose of managing reading and writing to a long-wide field in an array of
bytes .
|
ShortField | The ShortField class is an implementation of
FixedField
for the purpose of managing reading and writing to a short-wide field in an array of
bytes .
|
ShortList | The ShortList class is a work-around for functionality missing in Java (see
https://developer.java.sun.com/developer/bugParade/bugs/4487555.html
for details); it is a simple growable array of shorts that gets around the
requirement of wrapping and unwrapping shorts in
Short
instances in order to use the
java.util.List
interface.
ShortList mimics the functionality of the java.util.List
interface as much as possible.
|
StringUtil | The StringUtil class manages the processing of Unicode strings. |
This section describes the scenarios of how the POIFS classes and interfaces will be used to convert an appropriate XML stream to a POIFS output stream containing an HSSF document.
It is broken down as suggested by the following scenario diagram:
Step | Description |
---|---|
1 | The Filesystem is created by the client application. |
2 | The client application tells the Filesystem to create a document,
providing an
InputStream
and the name of the document. This may be repeated several times.
|
3 |
The client application asks the Filesystem to write its data to
an OutputStream .
|
Initialization of the POIFS system is shown in the following scenario diagram:
Step | Description |
---|---|
1 | The Filesystem object, which is created for each request to convert an appropriate XML stream to a POIFS output stream containing an HSSF document, creates its PropertyTable. |
2 | The
PropertyTable
creates its
RootProperty
instance, making the RootProperty the first
Property
in its List of Property instances.
|
3 | The Filesystem creates its HeaderBlock instance. It should be noted that the decision to create the HeaderBlock at Filesystem initialization is arbitrary; creation of the HeaderBlock could easily and harmlessly be postponed to the appropriate moment in writing the filesystem. |
Creating and adding a document to a POIFS system is shown in the following scenario diagram:
Step | Description |
---|---|
1 | The
Filesystem
instance creates a new
Document
instance. It will store the newly created Document in a
List
of
BATManaged
instances.
|
2 | The Document reads data from the provided
InputStream , storing the data in
DocumentBlock
instances. It keeps track of the byte count as it reads the data.
|
3 | The Document creates a DocumentProperty to keep track of its property data. The byte count is stored in the newly created DocumentProperty instance. |
4 | The Filesystem requests the newly created DocumentProperty from the newly created Document instance. |
5 | The
Filesystem
sends the newly created
DocumentProperty
to the Filesystem's
PropertyTable
so that the PropertyTable can add the DocumentProperty to its
List
of
Property
instances.
|
6 | The Filesystem gets the RootProperty from its PropertyTable. |
7 | The Filesystem adds the newly created DocumentProperty to the RootProperty. |
Although typical deployment of the POIFS system will only entail adding a single Document (the workbook) to the Filesystem, there is nothing in the design to prevent multiple Documents from being added to the Filesystem. This flexibility can be employed to write summary information document(s) in addition to the workbook.
Writing the filesystem is shown in the following scenario diagram:
Step | Description | |
---|---|---|
1 | The Filesystem adds the
PropertyTable
to its List of
BATManaged
instances and calls the PropertyTable's
preWrite
method. The action taken by the PropertyTable is shown in
the PropertyTable preWrite scenario diagram.
|
|
2 | The Filesystem creates the BlockAllocationTable. | |
3 | The Filesystem gets the block count from the BATManaged instance. | These three steps are repeated for each
BATManaged
instance in the Filesystem's
List
of BATManaged instances (i.e., the Documents, in order of their
addition to the Filesystem, followed by the PropertyTable).
|
4 | The Filesystem sends the block count to the BlockAllocationTable, which adds the appropriate entries to is IntList of entries, returning the starting block for the newly added entries. | |
5 | The Filesystem gives the start block number to the BATManaged instance. If the BATManaged instance is a Document, it sets the start block field in its DocumentProperty. | |
6 | The Filesystem tells the BlockAllocationTable to create its BatBlocks. | |
7 | The Filesystem gives the BAT information to the HeaderBlock so that it can set its BAT fields and, if necessary, create XBAT blocks. | |
8 | If the filesystem is unusually large (over 7MB), the HeaderBlock will create XBAT blocks to contain the BAT data that it cannot hold directly. In this case, the Filesystem tells the HeaderBlock where those additional blocks will be stored. | |
9 | The Filesystem gives the PropertyTable start block to the HeaderBlock. | |
10 | The
Filesystem
tells the
BlockWritable
instance to write its blocks to the provided
OutputStream .This step is repeated for each BlockWritable instance, in this order:
|
Step | Description |
---|---|
1 | The
PropertyTable
calls
setIndex
for each of its
Property
instances, so that each Property now knows its index within the PropertyTable's List
of Property instances.
|
2 | The PropertyTable requests the PropertyBlock class to create an array of PropertyBlock instances. |
3 | The
PropertyBlock
calculates the number of empty
Property
instances it needs to create and creates them. The algorithm for the number to create is:
|
4 | The
PropertyBlock
creates the required number of
PropertyBlock
instances from the
List
of
Property
instances, including the newly created empty
Property
instances.
|
5 | The
PropertyTable
calls
preWrite
on each of its
Property
instances. For
DocumentProperty
instances, this call is a no-op. For the RootProperty, the
action taken is shown in the RootProperty preWrite scenario
diagram.
|
Step | Description | |
---|---|---|
1 | The
RootProperty
sets its child property with the index of the child Property that is
first in its List of children.
|
|
2 | The
RootProperty
sets its child's next property field with the index of the child's next sibling in the
RootProperty's
List
of children. If the child is the last in the
List , its next property field is set to -1 .
|
These two steps are repeated for each File in
the
RootProperty's
List
of children.
|
3 | The
RootProperty
sets its child's previous property field with a value of
-1 .
|