WOOD (William's Object Oriented Database) ----------------------------------------Bill St. Clair bill@cambridge.apple.com Wood is a persistent object store for MCL 2.0. Its goal is to provide a way to save/restore Lisp objects to/from disk. A secondary goal is to remain as simple as possible so that it can be completed in a few months. This document is is intended to be a starting point for using Wood. A more complete document may appear later, but hey, you've got source code. Wood's file format is new. It is not intended to be compatible with anything. A Wood file is called a persistent heap. A persistent heap has one distinguished root object. All objects in the heap must be accessible from the root. Users may want to build other access mechanisms on top of this. For instance, you may prefer to create unique identifiers, enter them in an index, and make the index be the root object of the persistent heap. My plans for the first version of Wood are to support nested transactions and recovery for single-user access. Later versions may add multi-user support. The transaction log operates at the block file I/O level. It knows nothing about Lisp objects. Object locking has disk page resolution. A disk page defaults to 512 bytes, but this is a parameter of persistent heap creation. Wood files can be as large as 4 gigabytes, but file sizes over 256 megs will cons lots of bignum addresses. B*-trees are used for indexing. EQ hash tables are implemented on top of the B*-trees. Consing areas provide some user control over making related objects close to each other on disk. This is an idea from the Lisp machine. Basically, each of the consing primitives takes an optional area parameter telling where to create the storage. This version must be garbage collected off-line. Future versions may support incremental garbage collection if there is great demand and I have time. How goes the implementation --------------------------This is version 0.6. See the file "@Release Notes 0.6" for new features and a list of bug fixes. I do not presently have time to work on anything but bug fixes for Wood. If I do get some time, I'll probably do a garbage collector first. I have done quite a few consers, predicates, and accessors. I'll do more as I need them and people ask for them. They're usually quite easy. P-LOAD & P-STORE work for all Lisp objects. The B*-trees work (I needed them for interning symbols). All of the other P-xxx functions described below work except where noted. I have decided to do a simple undo/redo transaction log. It requires that the log be forced any time an uncommitted page is written to disk (see the "Questions" section below). I have not yet written a garbage collector. The first one will probably be a copying GC. I may do a mark, sweep, compact GC if there are lots of requests. I have not yet defined functions for saving/restoring Mac heap objects. Alan Ruttenberg suggested as stream-based protocol for transferring bytes to/from largish persistent objects containing random bytes. I have not yet made a background function to flush dirty pages while your machine is idle. There should be a print-object method for PPTR's that tells at least a little bit about the type and size of the object. How to try it out ----------------Wood will not work in MCL 2.0b1. You must have 2.0f2 or later. It works in MCL 2.0 final. Wood also requires a patch to MCL's hash table implementation. This patch is included as the file "hash-table-patch.fasl" in the "Patches" folder. It is also part of patch 1 for MCL 2.0. MCL patches are available for anonymous FTP from cambridge.apple.com in the directory "/pub/mcl2/patches/". They are also posted in the "MCL Discussion" area on AppleLink. To load the alpha implementation (assuming the "WOOD" folder is a subfolder of your "CCL" folder): (require "WOOD" "ccl:wood;wood") All of the WOOD functions are exported from the package named "WOOD". Note that because there is no recovery yet and because I haven't put WITHOUT-INTERRUPTS or UNWIND-PROTECT in some of the places that need it, crashing your machine or aborting at the wrong time may corrupt your persistent heap. (defun test-pheap () (unless (directory "temp.pheap") (let* ((pheap (wood:open-pheap "temp.pheap" :if-does-not-exist :create)) (a (wood:p-make-array pheap 10))) (setf (wood:root-object pheap) a) (dotimes (i 10) (setf (wood:p-aref a i) (list i))) (wood:close-pheap pheap))) (let* ((pheap (wood:open-pheap "temp.pheap")) (a (wood:root-object pheap))) (dotimes (i 10) (let ((value (print (wood:p-load (p-aref a i))))) (unless (and (listp value) (eql (car value) i) (null (cdr value))) (cerror "Continue." "SB: ~s, WAS: ~s" (list i) value)))) (wood:close-pheap pheap))) The file "example.lisp" contains a more extended example. Opening and Closing persistent heaps -----------------------------------A persistent heap is represented by an instance of the PHEAP class. OPEN-PHEAP filename &key if-does-not-exist if-exists area-segment-size page-size max-pages Open a persistent heap from filename. IF-DOES-NOT-EXIST defaults to :error IF-EXISTS defaults to :overwrite AREA-SEGMENT-SIZE is the default allocation block size for consing areas PAGE-SIZE defaults to 512 MAX-PAGES defaults to (ceiling 100000 page-size) I.e, the disk-cache for this persistent heap will keep up to 100000 bytes of the file in memory. Returns the PHEAP instance. AREA-SEGMENT-SIZE and PAGE-SIZE will be ignored if the file already exists. CLOSE-PHEAP pheap Close the persistent heap after flushing any unwriten blocks to disk. Quitting from MCL will close any open persistent heaps, though (after recovery is implemented) it will not abort active transactions (this will be done the next time the persistent heap if opened). WITH-OPEN-PHEAP (pheap filename &rest options) &body body WITH-OPEN-PHEAP is to OPEN-PHEAP & CLOSE-PHEAP as WITH-OPEN-FILE is to OPEN & CLOSE. E.g. execute BODY with PHEAP bound to the result of calling OPEN-PHEAP with the FILENAME and the OPTIONS. Call CLOSE-PHEAP on exiting the dynamic-extent of the WITH-OPEN-PHEAP form, whether normally or abnormally. ROOT-OBJECT pheap Return the root object of the given persistent heap. This will be a PPTR (see below) unless the root is an immediate object (why anyone would save an immediate object as the root I can't imagine). (SETF ROOT-OBJECT) new-root pheap Change the root object of PHEAP to NEW-ROOT. FLUSH-PHEAP pheap Write all dirty pages to disk. Informational functions ----------------------PHEAP-STREAM pheap Returns the stream that PHEAP is using for I/O. Writing to this stream will likely corrupt the persistent heap. PHEAP-PATHNAME pheap Returns the pathname for the file in which the persistent heap is stored. Access paradigms ---------------Wood will support three access paradigms. The first version will support the intermediate and low level access paradigms. Later versions may add support for real persistent objects. 1. Real persistent objects ========================== This level provides a persistent-object metaclass. Objects with this metaclass are stored on disk. SLOT-VALUE reads a slot into memory (SETF SLOT-VALUE) writes a slot onto disk. At this level, the programmer no longer needs to worry about whether an object is persistent or not. The main reason I won't provide this immediately is that our CLOS implementation does not yet support SLOT-VALUE-USING-CLASS. 2. Intermediate level access ============================ At this level, operations are done primarily on in-memory objects, and WOOD is used to transfer objects to and from disk. There is also support for large disk-based tables and arrays. The names of the operators at this level begin with "P-". There is usually a "P-" operator for each "DC-" operator in the low level. Pointers into the persistent heap are represented by a PPTR instance. PPTR instances are interned (in a weak hash table), so you will never cons more than one PPTR for a single disk address. Two weak hash tables cache the corresondence between objects in memory and (tagged) addresses in the persistent heap. These operators try to be "smart" in that they will move objects from memory to disk as necessary. Only P-LOAD conses disk objects in memory. All the other accessors return PPTR instances or immediate Lisp objects. Objects are copied between memory and disk with the following two functions. P-STORE pheap object &optional descend Store the lisp OBJECT in the given persistent heap. DESCEND controls what to do if an object that is already in the pheap is encountered: :DEFAULT NIL T the default. Recursive descent will stop when an object that is already on disk is encountered. recursive descent will stop as for :DEFAULT, but newly consed objects will not be stored in the cache. This allows storing of stack-consed or reuseable objects. will do a complete recursive descent of the object possibly overwriting values on the disk with updated values from memory. P-STORE returns a PPTR (unless OBJECT is an immediate object that requires no consing to store on disk, in which case OBJECT will be returned). All disk consing will be done in the current area. In order to prevent saving the entire Lisp heap, P-STORE does not recursively descend symbols, packages, or classes. When P-STORE saves a CLOS instance, it saves only the instance slots, class slots are ignored. P-LOAD pptr &optional depth Load an object from disk into memory. The depth argument controls the depth to which the object will be converted to its in-memory representation: :DEFAULT finds NIL :SINGLE will Recursively descends uvectors & conses stopping when it cached values. This is the default. Look the pointer up in the cache. Do no other conversion. Translates a single level. E.g. will cons an array, but it's non-immediate components <fixnum> length be PPTR's. Same as :SINGLE, but will do no conversion unless the of a vector is less than the depth value. Will also stop converting a list when it becomes this long leaving a PPTR as the final CDR. T Converts all levels. If an object is encountered that has already been converted, it's storage will be overwritten, and possibly changed, by the values on disk. When P-LOAD restores a CLOS instance it sets only the instance slots. If there are class slots, they will get the default value (from the DEFCLASS or INITIALIZE-INSTANCE method). Consers ------Values will be P-STORE'd as necessary. P-MAKE-AREA &key segment-size flags Make a new consing area. SEGMENT-SIZE is the default size of a segment. It will be rounded up to a multiple of the block size. FLAGS, a fixnum, is currently unused, though it is stored in the area. WITH-CONSING-AREA area &body body Macro. Executes BODY with the default consing area bound to the given AREA (a PPTR as returned from P-MAKE-AREA). P-CONS pheap car cdr &optional area P-LIST pheap &rest elements Make a list in the default consing area P-LIST-IN-AREA pheap area &rest elements Make a list in an explicit area. The elements will be P-STORE'd in the default consing area. If AREA is NIL, the default area will be used. P-MAKE-LIST pheap size &key initial-element area Again, use the default consing area P-MAKE-UVECTOR pheap length subtype &key initial-element area Cons a uvector of the given length in the persistent heap. All data types except fixnums, symbols, floats, cons cells, characters, & some other internal immediate values are represented as uvectors. The length must be a fixnum. The subtype is one of the $V_xxx values in "WOODEQU.LISP". INITIAL-ELEMENT will default appropriately for vectors that need to be initialized. P-MAKE-ARRAY pheap dimensions &key area element-type initial-element Does not yet support initial-contents, adjustable, fill-pointer, displaced-to, or displaced-index-offset. You can, however, save, with P-STORE, a memory array made by MAKE-ARRAY with those keywords. P-VECTOR pheap &rest elements Cons the vector in the default consing area. Predicates ---------P-LISTP object P-CONSP object P-ATOM object P-UVECTORP object P-PACKAGEP object P-SYMBOLP object P-STRINGP object P-SIMPLE-STRING-P object P-VECTORP object P-SIMPLE-VECTOR-P object P-ARRAYP object Accessors --------P-CxR list (SETF P-CxR) value cons These are supported for up to four levels. P-UVSIZE uvector P-UVREF uvector index (SETF P-UVREF) value uvector index P-SVREF simple-vector index (SETF P-SVREF) value simple-vector index P-%SVREF simple-vector index (SETF P-%SVREF) value simple-vector index Do no type or range checking. Use at your own risk. P-LENGTH list-or-vector P-AREF array &rest indices (SETF P-AREF) value array &rest indices P-ARRAY-RANK array P-ARRAY-DIMENSIONS array P-ARRAY-DIMENSION array dimension Symbols and Packages -------------------P-INTERN pheap string &key package area Will make a new persistent package if PACKAGE is a package or the name of a package in memory. P-FIND-SYMBOL pheap string &optional package This takes a PHEAP arg as it is common to want to look for a symbol in the persistent heap without P-STORE'ing the string. P-FIND-PACKAGE pheap package PACKAGE can be a package or package name (string or symbol) either in memory or on the disk. P-MAKE-PACKAGE pheap name &key nicknames Currently, persistent heap packages do not support inheritance or external symbols. They are simply a way to associate a string with a symbol. This is because of the problems of maintaining two parallel package hierarchies (most of us have enough problems with one). P-SYMBOL-NAME symbol P-SYMBOL-PACKAGE symbol P-SYMBOL-VALUE symbol (SETF P-SYMBOL-VALUE) value symbol P-PACKAGE-NAME package P-PACKAGE-NICKNAMES package P-STRING string-or-symbol BTREEs -----P-MAKE-BTREE pheap &key area type TYPE is currently unused. P-BTREE-LOOKUP btree key-string &optional default The key-string must be a string. Comparison is done with string< & string=, i.e. it is case sensitive. P-BTREE-STORE btree key-string value You can also use SETF with P-BTREE-LOOKUP P-BTREE-DELETE btree key-string Returns true if there was an entry for KEY-STRING P-CLEAR-BTREE btree Remove all entries from the BTREE. P-MAP-BTREE btree function &optional from to For each BTREE entry whose key is (if FROM is specified and non-NIL) >= FROM and (if TO is specified and non-NIL) <= TO, calls FUNCTION with two arguments: the key and its value. Deals correctly with insertion or deletion during mapping. The first (key) argument to FUNCTION will be a string allocated with dynamic-extent. Hence, if you wish to store it anywhere for longer than FUNCTION's dynamic-extent you will need to copy it (with, e.g., COPY-SEQ). P-BTREE-P btree Returns true if BTREE is a btree. Hash tables --------------------------------P-MAKE-HASH-TABLE &key test weak TEST must be EQ or #'EQ. WEAK must be NIL (the default), :KEY, or :VALUE. P-GETHASH key hash-table &optional default (SETF P-GETHASH) value key hash-table P-REMHASH key hash-table P-CLRHASH hash-table P-HASH-TABLE-SIZE hash-table P-MAPHASH function hash-table Deals correctly with insertion or deletion during mapping. P-HASH-TABLE-P hash-table Returns true if hash-table is a hash table (either in-memory or on disk). CLOS hooks ---------There are some generic functions that allow you to customize the way that WOOD saves/restores CLOS instance. You can decide to save/restore only a subset of an instance's slots and you can do conversion of the slot values on the way out/in. WOOD-SLOT-NAMES-VECTOR object Returns a vector of the names of the slots to save. The default method returns a vector containing the name of all the instance slots. It is important that multiple calls to this generic function on instances of the same class return the same (EQ) vector. WOOD-SLOT-VALUE object slot-name Called for each slot of an instance when it is being saved to disk. Allows you to convert the slot's value to some other form for being saved in the persistent heap. The default method calls SLOT-VALUE. (SETF WOOD-SLOT-VALUE) value object slot-name Called when an instance is being read into memory. Allows you to reverse the conversion provided by WOOD-SLOT-VALUE. The default method calls (SETF SLOT-VALUE). Objects that call a function when P-LOAD'ed ------------------------------------------Sometimes you need more than just saving the data associated with one of your CLOS or structure instances. Wood has a hook very similar to Common Lisp's MAKE-LOAD-FORM to allow this. P-MAKE-LOAD-FUNCTION object [Generic Function] P-MAKE-LOAD-FUNCTION (object t) [Primary method] Called when a CLOS or structure instance is P-STORE'd to disk. The provided method returns NIL, meaning just save the slots normally. If you write a method, it should return two values: 1) load-function.args 2) init-function.args Each of these should be a list whose CAR is suitable as a first argument to APPLY and whose CDR is suitable as a last argument to APPLY. init-function.args can also be NIL to mean that there is no initialization beyond consing the object. (apply (car load-function.args) (cdr load-function.args)) should create a possibly empty copy of OBJECT, call it OBJECT-COPY. (apply (car init-function.args) OBJECT-COPY (cdr init-function.args) should fill in the slots of OBJECT-COPY. The following example does the same thing as the default (but will take up more space on disk): (defclass foo () ((x :initarg :x) (y :initarg :y))) (defmethod p-make-load-function ((object foo)) (values '(allocate-instance-of-named-class foo) `(set-slot-values (x y) (,(slot-value object 'x) ,(slot-value object 'y))))) (defun allocate-instance-of-named-class (name) (allocate-instance (find-class name))) (defun set-slot-values (object slots values) (loop (unless slots (return)) (setf (slot-value object (pop slots)) (pop values)))) P-MAKE-LOAD-FUNCTION-OBJECT load-function.args init-function.args Allows you to explicitly create a LOAD-FUNCTION disk object. P-LOAD'ing the result of this function will behave as follows: (let ((object (apply (car load-function.args) (cdr loadfunction.args)))) (when init-function.args (apply (car init-function.args) object (cdr init-function.args))) object) 3. Low level access =================== This level is mostly used as the internals for implementing the intermediate level. It can also be used by speed critical code. Values read or written at this level are either addresses (currently integers) or immediate Lisp objects (fixnums, characters, short floats, and a few internal markers). Each piece of data has an associated flag saying whether it is an address or an immediate. Hence, each reader at this level returns two values and each writer takes an optional "immediate?" flag (for each pointer argument). It is very easy to corrupt a persistent heap by using the low level accessors. Usually, they are not what you want. All data is tagged in the low three bits. You will need to know about these tags to use Wood at this level. They are defined in the file "woodequ.lisp". The names of low-level DISK-CACHE is the name access to the bytes in done at the DISK-CACHE accessors begin with "DC-" for "DISK-CACHE". A of the data structure that controls cached the file. Transaction logging and recovery are level. DC-xxx functions exists for most of the P-xxx functions and a few more. Questions --------Should P-STORE allow the user fine-grained control over the consing area? If so, how should it do this? Three choices for the representation of (double) floats. 1) 8 bytes anywhere in memory. This is the most space efficient, but makes it impossible for a memory walker to distinguish floats and cons cells. It will NOT cause the garbage collector any problems. This is what the current implementation does. 2) Store as a vector. Requires 16 bytes per float. (Float vectors will still need only eight bytes per entry). This frees up a tag, but I have no use in mind for it. 3) Cons floats in a special area. This requires only 8 bytes per float, but we need to allocate a page full of floats at a time. This allows a memory walker to distinguish floats and conses. P-INTERN currently creates a new package if a package with that name exists in memory. Is this correct or should P-STORE be responsible for that? 2 choices for the first recovery method. I plan on redo-undo recovery with in-place database updating. 1) Never force the log Recovery requires scanning the entire log since the last backup to restore uncommitted modified blocks. The advantage is that no unneccessary I/O is ever done during normal operation. 2) Force the log whenever an uncommitted block is written to disk This allows checkpointing and the log only needs to be scanned (and maintained) back to the beginning of the transactions that were active at the last checkpoint. This will reduce log space and recovery time, but will require more disk-head movement during normal operation (I think). I'm leaning towards option 2 at the moment. Remember that the log for option 1 will tend to be 1.5 to 2 times as big as the persistent heap file (I think). How important is it to integrate disk-based objects with the Common Lisp type and class system? Does this even make sense? Do I need to do P-TYPEP & P-TYPE-OF in other than the simple way (which may cons its brains out): (defun p-typep (thing type) (typep (p-load thing) type))