WOOD (William's Object Oriented Database) ----------------------------------------- Bill St. Clair

advertisement
WOOD (William's Object Oriented Database)
----------------------------------------Bill St. Clair
bill@cambridge.apple.com
Wood is a persistent object store for MCL 2.0. Its goal is to provide
a way to save/restore Lisp objects to/from disk. A secondary goal is
to remain as simple as possible so that it can be completed in a few
months.
This document is is intended to be a starting point for using Wood.
A more complete document may appear later, but hey, you've got source
code.
Wood's file format is new. It is not intended to be compatible with
anything. A Wood file is called a persistent heap.
A persistent heap has one distinguished root object. All objects in the
heap must be accessible from the root. Users may want to build other
access mechanisms on top of this. For instance, you may prefer to
create unique identifiers, enter them in an index, and make the index
be the root object of the persistent heap.
My plans for the first version of Wood are to support nested transactions
and recovery for single-user access. Later versions may add multi-user
support. The transaction log operates at the block file I/O level. It
knows nothing about Lisp objects. Object locking has disk page
resolution.
A disk page defaults to 512 bytes, but this is a parameter of persistent
heap creation.
Wood files can be as large as 4 gigabytes, but file sizes over 256 megs
will cons lots of bignum addresses.
B*-trees are used for indexing. EQ hash tables are implemented on top of
the B*-trees.
Consing areas provide some user control over making related objects
close to each other on disk. This is an idea from the Lisp machine.
Basically, each of the consing primitives takes an optional area
parameter telling where to create the storage.
This version must be garbage collected off-line. Future versions may
support incremental garbage collection if there is great demand and
I have time.
How goes the implementation
--------------------------This is version 0.6. See the file "@Release Notes 0.6" for new features
and a list of bug fixes.
I do not presently have time to work on anything but bug fixes for
Wood. If I do get some time, I'll probably do a garbage collector
first.
I have done quite a few consers, predicates, and accessors. I'll do more
as I need them and people ask for them. They're usually quite easy.
P-LOAD & P-STORE work for all Lisp objects.
The B*-trees work (I needed them for interning symbols). All of the other
P-xxx
functions described below work except where noted.
I have decided to do a simple undo/redo transaction log. It requires that
the log be forced any time an uncommitted page is written to disk (see
the "Questions" section below).
I have not yet written a garbage collector. The first one will probably
be a copying GC. I may do a mark, sweep, compact GC if there are lots
of requests.
I have not yet defined functions for saving/restoring Mac heap objects.
Alan Ruttenberg suggested as stream-based protocol for transferring bytes
to/from largish persistent objects containing random bytes.
I have not yet made a background function to flush dirty pages while
your machine is idle.
There should be a print-object method for PPTR's that tells at least
a little bit about the type and size of the object.
How to try it out
----------------Wood will not work in MCL 2.0b1. You must have 2.0f2 or later. It works
in
MCL 2.0 final.
Wood also requires a patch to MCL's hash table implementation. This
patch is included as the file "hash-table-patch.fasl" in the "Patches"
folder. It is also part of patch 1 for MCL 2.0. MCL patches are
available for anonymous FTP from cambridge.apple.com in the directory
"/pub/mcl2/patches/". They are also posted in the "MCL Discussion" area
on AppleLink.
To load the alpha implementation (assuming the "WOOD" folder is a subfolder
of your "CCL" folder):
(require "WOOD" "ccl:wood;wood")
All of the WOOD functions are exported from the package named "WOOD".
Note that because there is no recovery yet and because I haven't put
WITHOUT-INTERRUPTS or UNWIND-PROTECT in some of the places that need it,
crashing your machine or aborting at the wrong time may corrupt your
persistent heap.
(defun test-pheap ()
(unless (directory "temp.pheap")
(let* ((pheap (wood:open-pheap "temp.pheap" :if-does-not-exist
:create))
(a (wood:p-make-array pheap 10)))
(setf (wood:root-object pheap) a)
(dotimes (i 10)
(setf (wood:p-aref a i) (list i)))
(wood:close-pheap pheap)))
(let* ((pheap (wood:open-pheap "temp.pheap"))
(a (wood:root-object pheap)))
(dotimes (i 10)
(let ((value (print (wood:p-load (p-aref a i)))))
(unless (and (listp value)
(eql (car value) i)
(null (cdr value)))
(cerror "Continue."
"SB: ~s, WAS: ~s" (list i) value))))
(wood:close-pheap pheap)))
The file "example.lisp" contains a more extended example.
Opening and Closing persistent heaps
-----------------------------------A persistent heap is represented by an instance of the PHEAP class.
OPEN-PHEAP filename &key if-does-not-exist if-exists
area-segment-size page-size max-pages
Open a persistent heap from filename.
IF-DOES-NOT-EXIST defaults to :error
IF-EXISTS defaults to :overwrite
AREA-SEGMENT-SIZE is the default allocation block size for consing
areas
PAGE-SIZE defaults to 512
MAX-PAGES defaults to (ceiling 100000 page-size)
I.e, the disk-cache for this persistent heap will keep
up to 100000 bytes of the file in memory.
Returns the PHEAP instance. AREA-SEGMENT-SIZE and PAGE-SIZE will be
ignored if the file already exists.
CLOSE-PHEAP pheap
Close the persistent heap after flushing any unwriten blocks
to disk. Quitting from MCL will close any open persistent heaps,
though (after recovery is implemented) it will not abort active
transactions (this will be done the next time the persistent heap
if opened).
WITH-OPEN-PHEAP (pheap filename &rest options) &body body
WITH-OPEN-PHEAP is to OPEN-PHEAP & CLOSE-PHEAP as
WITH-OPEN-FILE is to OPEN & CLOSE. E.g. execute BODY with
PHEAP bound to the result of calling OPEN-PHEAP with the
FILENAME and the OPTIONS. Call CLOSE-PHEAP on exiting the
dynamic-extent of the WITH-OPEN-PHEAP form, whether normally
or abnormally.
ROOT-OBJECT pheap
Return the root object of the given persistent heap. This will be a
PPTR (see below) unless the root is an immediate object (why anyone
would save an immediate object as the root I can't imagine).
(SETF ROOT-OBJECT) new-root pheap
Change the root object of PHEAP to NEW-ROOT.
FLUSH-PHEAP pheap
Write all dirty pages to disk.
Informational functions
----------------------PHEAP-STREAM pheap
Returns the stream that PHEAP is using for I/O. Writing to this
stream will likely corrupt the persistent heap.
PHEAP-PATHNAME pheap
Returns the pathname for the file in which the persistent heap is
stored.
Access paradigms
---------------Wood will support three access paradigms. The first version will
support the intermediate and low level access paradigms. Later versions
may add support for real persistent objects.
1. Real persistent objects
==========================
This level provides a persistent-object metaclass. Objects with this
metaclass are stored on disk. SLOT-VALUE reads a slot into memory
(SETF SLOT-VALUE) writes a slot onto disk. At this level, the programmer
no longer needs to worry about whether an object is persistent or not.
The main reason I won't provide this immediately is that our CLOS
implementation does not yet support SLOT-VALUE-USING-CLASS.
2. Intermediate level access
============================
At this level, operations are done primarily on in-memory objects, and
WOOD is used to transfer objects to and from disk. There is also support
for large disk-based tables and arrays.
The names of the operators at this level begin with "P-".
There is usually a "P-" operator for each "DC-" operator in the
low level.
Pointers into the persistent heap are represented by a PPTR
instance. PPTR instances are interned (in a weak hash table),
so you will never cons more than one PPTR for a single disk address.
Two weak hash tables cache the corresondence between objects
in memory and (tagged) addresses in the persistent heap.
These operators try to be "smart" in that they will move objects from
memory to disk as necessary. Only P-LOAD conses disk objects in memory.
All the other accessors return PPTR instances or immediate Lisp objects.
Objects are copied between memory and disk with the following two
functions.
P-STORE pheap object &optional descend
Store the lisp OBJECT in the given persistent heap. DESCEND controls
what to do if an object that is already in the pheap is encountered:
:DEFAULT
NIL
T
the default. Recursive descent will stop when an object
that is already on disk is encountered.
recursive descent will stop as for :DEFAULT, but newly
consed objects will not be stored in the cache. This allows
storing of stack-consed or reuseable objects.
will do a complete recursive descent of the object possibly
overwriting values on the disk with updated values from
memory.
P-STORE returns a PPTR (unless OBJECT is an immediate object that
requires no consing to store on disk, in which case OBJECT will be
returned). All disk consing will be done in the current area.
In order to prevent saving the entire Lisp heap, P-STORE does not
recursively descend symbols, packages, or classes. When P-STORE saves
a CLOS instance, it saves only the instance slots, class slots are
ignored.
P-LOAD pptr &optional depth
Load an object from disk into memory. The depth argument controls
the depth to which the object will be converted to its in-memory
representation:
:DEFAULT
finds
NIL
:SINGLE
will
Recursively descends uvectors & conses stopping when it
cached values. This is the default.
Look the pointer up in the cache. Do no other conversion.
Translates a single level.
E.g. will cons an array, but it's non-immediate components
<fixnum>
length
be PPTR's.
Same as :SINGLE, but will do no conversion unless the
of a vector is less than the depth value. Will also stop
converting
a list when it becomes this long leaving a PPTR as the
final CDR.
T
Converts all levels. If an object is encountered that
has already been converted, it's storage will be
overwritten,
and possibly changed, by the values on disk.
When P-LOAD restores a CLOS instance it sets only the instance slots.
If there are class slots, they will get the default value (from the
DEFCLASS or INITIALIZE-INSTANCE method).
Consers
------Values will be P-STORE'd as necessary.
P-MAKE-AREA &key segment-size flags
Make a new consing area. SEGMENT-SIZE is the default size of a segment.
It will be rounded up to a multiple of the block size. FLAGS, a fixnum,
is currently unused, though it is stored in the area.
WITH-CONSING-AREA area &body body
Macro. Executes BODY with the default consing area bound to the given
AREA
(a PPTR as returned from P-MAKE-AREA).
P-CONS pheap car cdr &optional area
P-LIST pheap &rest elements
Make a list in the default consing area
P-LIST-IN-AREA pheap area &rest elements
Make a list in an explicit area. The elements will be P-STORE'd in the
default consing area. If AREA is NIL, the default area will be used.
P-MAKE-LIST pheap size &key initial-element area
Again, use the default consing area
P-MAKE-UVECTOR pheap length subtype &key initial-element area
Cons a uvector of the given length in the persistent heap. All data
types except fixnums, symbols, floats, cons cells, characters, &
some other internal immediate values are represented as uvectors.
The length must be a fixnum. The subtype is one of the $V_xxx values in
"WOODEQU.LISP". INITIAL-ELEMENT will default appropriately for vectors
that need to be initialized.
P-MAKE-ARRAY pheap dimensions &key area element-type initial-element
Does not yet support initial-contents, adjustable, fill-pointer,
displaced-to, or displaced-index-offset.
You can, however, save, with P-STORE, a memory array made by MAKE-ARRAY
with those keywords.
P-VECTOR pheap &rest elements
Cons the vector in the default consing area.
Predicates
---------P-LISTP object
P-CONSP object
P-ATOM object
P-UVECTORP object
P-PACKAGEP object
P-SYMBOLP object
P-STRINGP object
P-SIMPLE-STRING-P object
P-VECTORP object
P-SIMPLE-VECTOR-P object
P-ARRAYP object
Accessors
--------P-CxR list
(SETF P-CxR) value cons
These are supported for up to four levels.
P-UVSIZE uvector
P-UVREF uvector index
(SETF P-UVREF) value uvector index
P-SVREF simple-vector index
(SETF P-SVREF) value simple-vector index
P-%SVREF simple-vector index
(SETF P-%SVREF) value simple-vector index
Do no type or range checking. Use at your own risk.
P-LENGTH list-or-vector
P-AREF array &rest indices
(SETF P-AREF) value array &rest indices
P-ARRAY-RANK array
P-ARRAY-DIMENSIONS array
P-ARRAY-DIMENSION array dimension
Symbols and Packages
-------------------P-INTERN pheap string &key package area
Will make a new persistent package if PACKAGE is a package or the
name of a package in memory.
P-FIND-SYMBOL pheap string &optional package
This takes a PHEAP arg as it is common to want to look for a
symbol in the persistent heap without P-STORE'ing the string.
P-FIND-PACKAGE pheap package
PACKAGE can be a package or package name (string or symbol) either
in memory or on the disk.
P-MAKE-PACKAGE pheap name &key nicknames
Currently, persistent heap packages do not support inheritance or
external symbols. They are simply a way to associate a string
with a symbol. This is because of the problems of maintaining
two parallel package hierarchies (most of us have enough problems
with one).
P-SYMBOL-NAME symbol
P-SYMBOL-PACKAGE symbol
P-SYMBOL-VALUE symbol
(SETF P-SYMBOL-VALUE) value symbol
P-PACKAGE-NAME package
P-PACKAGE-NICKNAMES package
P-STRING string-or-symbol
BTREEs
-----P-MAKE-BTREE pheap &key area type
TYPE is currently unused.
P-BTREE-LOOKUP btree key-string &optional default
The key-string must be a string. Comparison is done with string< &
string=,
i.e. it is case sensitive.
P-BTREE-STORE btree key-string value
You can also use SETF with P-BTREE-LOOKUP
P-BTREE-DELETE btree key-string
Returns true if there was an entry for KEY-STRING
P-CLEAR-BTREE btree
Remove all entries from the BTREE.
P-MAP-BTREE btree function &optional from to
For each BTREE entry whose key is (if FROM is specified and non-NIL) >=
FROM
and (if TO is specified and non-NIL) <= TO, calls FUNCTION with two
arguments: the key and its value. Deals correctly with insertion or
deletion
during mapping. The first (key) argument to FUNCTION will be a
string allocated with dynamic-extent. Hence, if you wish to store
it anywhere for longer than FUNCTION's dynamic-extent you will need
to copy it (with, e.g., COPY-SEQ).
P-BTREE-P btree
Returns true if BTREE is a btree.
Hash tables
--------------------------------P-MAKE-HASH-TABLE &key test weak
TEST must be EQ or #'EQ.
WEAK must be NIL (the default), :KEY, or :VALUE.
P-GETHASH key hash-table &optional default
(SETF P-GETHASH) value key hash-table
P-REMHASH key hash-table
P-CLRHASH hash-table
P-HASH-TABLE-SIZE hash-table
P-MAPHASH function hash-table
Deals correctly with insertion or deletion during mapping.
P-HASH-TABLE-P hash-table
Returns true if hash-table is a hash table (either in-memory
or on disk).
CLOS hooks
---------There are some generic functions that allow you to customize the way
that WOOD saves/restores CLOS instance. You can decide to save/restore
only a subset of an instance's slots and you can do conversion of the
slot values on the way out/in.
WOOD-SLOT-NAMES-VECTOR object
Returns a vector of the names of the slots to save. The default
method returns a vector containing the name of all the instance slots.
It is important that multiple calls to this generic function on
instances of the same class return the same (EQ) vector.
WOOD-SLOT-VALUE object slot-name
Called for each slot of an instance when it is being saved to disk.
Allows you to convert the slot's value to some other form for being
saved
in the persistent heap. The default method calls SLOT-VALUE.
(SETF WOOD-SLOT-VALUE) value object slot-name
Called when an instance is being read into memory. Allows you to
reverse
the conversion provided by WOOD-SLOT-VALUE. The default method calls
(SETF SLOT-VALUE).
Objects that call a function when P-LOAD'ed
------------------------------------------Sometimes you need more than just saving the data associated with
one of your CLOS or structure instances. Wood has a hook very similar
to Common Lisp's MAKE-LOAD-FORM to allow this.
P-MAKE-LOAD-FUNCTION object
[Generic
Function]
P-MAKE-LOAD-FUNCTION (object t)
[Primary
method]
Called when a CLOS or structure instance is P-STORE'd to disk.
The provided method returns NIL, meaning just save the slots normally.
If you write a method, it should return two values:
1) load-function.args
2) init-function.args
Each of these should be a list whose CAR is suitable as a first
argument
to APPLY and whose CDR is suitable as a last argument to APPLY.
init-function.args can also be NIL to mean that there is no
initialization
beyond consing the object.
(apply (car load-function.args) (cdr load-function.args))
should create a possibly empty copy of OBJECT, call it OBJECT-COPY.
(apply (car init-function.args) OBJECT-COPY (cdr init-function.args)
should fill in the slots of OBJECT-COPY.
The following example does the same thing as the default (but will
take up more space on disk):
(defclass foo ()
((x :initarg :x)
(y :initarg :y)))
(defmethod p-make-load-function ((object foo))
(values '(allocate-instance-of-named-class foo)
`(set-slot-values (x y)
(,(slot-value object 'x) ,(slot-value
object 'y)))))
(defun allocate-instance-of-named-class (name)
(allocate-instance (find-class name)))
(defun set-slot-values (object slots values)
(loop
(unless slots (return))
(setf (slot-value object (pop slots)) (pop values))))
P-MAKE-LOAD-FUNCTION-OBJECT load-function.args init-function.args
Allows you to explicitly create a LOAD-FUNCTION disk object.
P-LOAD'ing the result of this function will behave as follows:
(let ((object (apply (car load-function.args) (cdr loadfunction.args))))
(when init-function.args
(apply (car init-function.args) object (cdr init-function.args)))
object)
3. Low level access
===================
This level is mostly used as the internals for implementing
the intermediate level. It can also be used by speed critical
code.
Values read or written at this level are either addresses (currently
integers) or immediate Lisp objects (fixnums, characters, short
floats, and a few internal markers). Each piece of data has an
associated flag saying whether it is an address or an immediate.
Hence, each reader at this level returns two values and each writer
takes an optional "immediate?" flag (for each pointer argument). It
is very easy to corrupt a persistent heap by using the low level
accessors. Usually, they are not what you want.
All data is tagged in the low three bits. You will need to know about
these tags to use Wood at this level. They are defined in the
file "woodequ.lisp".
The names of low-level
DISK-CACHE is the name
access to the bytes in
done at the DISK-CACHE
accessors begin with "DC-" for "DISK-CACHE". A
of the data structure that controls cached
the file. Transaction logging and recovery are
level.
DC-xxx functions exists for most of the P-xxx functions and a few more.
Questions
--------Should P-STORE allow the user fine-grained control over the consing area?
If so, how should it do this?
Three choices for the representation of (double) floats.
1) 8 bytes anywhere in memory.
This is the most space efficient, but makes it impossible
for a memory walker to distinguish floats and cons cells.
It will NOT cause the garbage collector any problems.
This is what the current implementation does.
2) Store as a vector. Requires 16 bytes per float.
(Float vectors will still need only eight bytes per entry).
This frees up a tag, but I have no use in mind for it.
3) Cons floats in a special area.
This requires only 8 bytes per float, but we need to allocate
a page full of floats at a time. This allows a memory walker
to distinguish floats and conses.
P-INTERN currently creates a new package if a package with that
name exists in memory. Is this correct or should P-STORE be
responsible for that?
2 choices for the first recovery method. I plan on redo-undo
recovery with in-place database updating.
1) Never force the log
Recovery requires scanning the entire log since the last backup
to restore uncommitted modified blocks. The advantage is that
no unneccessary I/O is ever done during normal operation.
2) Force the log whenever an uncommitted block is written to disk
This allows checkpointing and the log only needs to be scanned
(and maintained) back to the beginning of the transactions that
were active at the last checkpoint. This will reduce log space
and recovery time, but will require more disk-head movement during
normal operation (I think).
I'm leaning towards option 2 at the moment. Remember that the log for
option 1
will tend to be 1.5 to 2 times as big as the persistent heap file (I
think).
How important is it to integrate disk-based objects with the Common Lisp
type
and class system? Does this even make sense? Do I need to do P-TYPEP &
P-TYPE-OF in other than the simple way (which may cons its brains out):
(defun p-typep (thing type)
(typep (p-load thing) type))
Download