Click here to this post as Word Document

advertisement
Document Storage Formats – An Introduction
Document storage databases are all the buzz these days. Interestingly enough they are actually not a
very recent invention. Already 20 years ago there were object oriented databases using the same very
same concepts.
In this four part mini blog series I will take a bird’s eye view at the connection and relationships between
object oriented databases, object stores, serialization and document storage. I will present and explain
the most common document storage formats and will try to find the reason why object stores are all the
buzz today but were not 20 years ago.
Episode 1 - Introduction and definitions
Document Storage: A document in context of software development is considered, plain and simple, a
computer data-file or data-set. For different purposes the data file or set might contain different
content. Often, when the content of the document requires a predictable structure, meta data is being
introduced to define the document data’s formatting. In such case you consider the document’s data to
be structured, otherwise unstructured. An example for unstructured data is a word document; an
example for structured data is an XML document.
An application has different options as to where
to store its document’s data. Most commonly
applications employ a combination of disk
storage and memory caching. Obviously the
choice for storage will impact performance and
scalability of your application. In more recent
days there are also options to store the data in
the cloud or on solid state disk (SSD).
Object Storage: Within the application all object
instances live in random access memory (unless they are memory mapped to e.g. your swap file, which
for now, we assume they are not). If the power goes down your random access memory will lose all its
data and therefore all of the application’s object instance data (amongst other data). If you want to hold
on to that data you have to persist it; that is where Object Storage comes into play as an option to
persist your applications object instance data to.
Most software and web development frameworks (for
example .Net) provide ready-to-use boilerplate source
code or
complet
e
impleme
ntations
of
interface
s for your custom classes which enable you to easily
persist their instance data to a document of a
particular structure. Such code automatically takes
care of finding your instance data, convert it to the
appropriate format (e.g. for date/time data types).
Some of them even take enumerations, complex data types and even object hierarchies and arrays into
consideration. Common are implementations for Binary format and maybe XML. But online you can find
boilerplate source code to persist your objects to pretty much any format you have ever heard of.
Some challenges for Object Storage are dealing with object versioning and object inheritance. Also
handling Object references rather than instances can be tricky. Don’t assume that those “advanced
features” are naturally implemented in all development frameworks. Always make sure to verify before
using and benchmark after coding. Default implementations are often not the best performing ones.
Serialization: According to Wikipedia “serialization is the process of converting a data structure or object
state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a
network connection link) and "resurrected" later in the same or another computer environment.”
(Source: http://en.wikipedia.org/wiki/Serialization)
Most commonly serialization is referred to as what you have to do when you want to persist an object
instance in your application onto a storage medium other than memory. The basis of the motivation for
serialization is that in opposite to random access memory, all other storage types (like disk, cloud or
SSD) do NOT provide a fast mechanism to access any byte, anywhere at any time. Performance suffers
heavily when trying to read e.g. from a
harddrive using random access. In order
to make up for some of the natural
performance inferiority of e.g. harddrive
storage, it is better to “stream” the data
to and from the storage device in a
sequential access fashion; at the end of
a serialization you usually receive a
structured document, which is suitable
for “streaming” it to its storage destination, the document storage.
It is obvious that serialization comes at a price. Depending on which document storage format you
choose you observe different impact on CPU processing utilization and transmission bandwidth.
Serialization tends to “bloat” the amount of data you have to transmit and store.
Sometimes memory mapped disk storage can be an alternative to serialization but, I believe, is not
widely used. Maybe as SSDs become cheaper and faster there may be a day when memory mapping
becomes a viable alternative to serialization (I have pitched this idea to FusionIO but they didn’t seem
impressed).
Common Document Storage Formats: As mentioned before there are many
different document formats for structured document storage. Most famously
probably XML and Binary. With MongoDB becoming a “household document
store” JSON and BSON are also becoming more widely known (yes, I know, the
Java programmers out there will disagree with me that it took MongoDB to
make JSON famous); and a more exotic one is Protocol Buffers, a very
compressed, binary, structured document storage format introduced and used
by Google.
Preview of next week’s Episode 2: In my next blog episode I will look closer at the XML and Binary
document formats. Also I will start to look at different aspects of storage formats in general which have
an impact on performance and scalability and provide reasons as to why one format might be more
suitable for a certain use case than another.
Resources and references: If you are interested in more information
please visit the following links:
http://www.json.org/
http://bsonspec.org/
http://code.google.com/p/protob
uf/
http://www.w3.org/XML/
http://msdn.microsoft.com/enus/library/72hyey7b(v=vs.71).aspx
http://en.wikipedia.org/wiki/Seria
lization
http://www.cs.cornell.edu/info/p
eople/chichao/ccc-ch5.pdf
http://www.teamjohnston.net/bl
ogs/jesse/post/2007/04/08/Seriali
zation-Problems-andSolutions.aspx
http://www.boost.org/doc/libs/1_
35_0/libs/serialization/doc/special
.html
http://java.dzone.com/articles/ob
ject-serialization-evil
http://www.versant.com/pdf/wp_
vsnt_serialization.pdf (“consider
the source” on this one!)
Download