An Overview of NoSQL (word) - Electrical and Computer Engineering

advertisement
An Overview of NoSQL
Cheng Lei
Department of Electrical and Computer Engineering
University of Victoria, Canada
rexlei86@uvic.ca
I.
Introduction
The relational databases
have dominated the database
markets and applied in data
storage and data query for years
till the NoSQL database arise. The
relational database management
system has its unique advantages,
which make it ubiquitous and
famous among incorporates and
developers. Since the DBMS has
been run decades in different
operating systems, it has the
mature technology to maintain, to
query, to optimize. Also, there are
more developers who master
these technology and more
resources for the fresher to learn
such techniques.
However, the bottleneck of
relational database management
system arises as the Internet
boosts. The population of Internet
boosts as exponential level. The
data access workload of the
servers becomes the burden. The
distribution of users is worldwide,
which requires the servers are
deployed worldwide so that there
is little impact for the delay. This
mechanism of global distribution
of servers enhances the efficiency
and provides better user-friendly
experience.
Although
this
architecture of server distribution
has its own pros, it also brings the
limits to the relational database
management systems as it is not
specifically designed for the
distributed system. Thus causes
the difficulty to guarantee the data
integrity. Meanwhile, the JSON
data structure has widely used in
almost all kinds of Internet
communications and its structure
becomes increasingly complicated
so that more storage spaces and
tables are required to satisfy these
requirements.
Therefore, a new welldesigned database is called for to
meet these requirements and
NoSQL is that kind of database. As
the NoSQL is specifically designed
for this purpose, it is flexible to
store data as it is unnecessary to
create the schemas to form the
data structures. Moreover, the
design for the distributed system
provides more advantages to
deploy the databases globally and
secure the data integrity and
consistency
effectively.
This
design is also scalable for largescale datasets. There are many
NoSQL databases already applied,
for instances, MangoDB, Redis.
II.
Data Models
For the NoSQL databases, it
uses the aggregate data models to
model the datasets. The aggregate
data models are consists of keyvalue data model, document data
model and column-family stores
model, and the aggregate is
defined as a collection of related
objects treated as a unit. In
particular, it is a unit for data
manipulation and management of
consistency. Typically, the update
operation is to update the
aggregates with atomic operations
and communicate with our data
storage in terms of aggregates.
The Key-Value data model
as one of aggregate models is
strongly
aggregate-oriented,
which means these databases are
treated as primarily constructed
through aggregates with each
aggregate having a key that is used
to get at the data (the value). The
aggregate is opaque to the
databases, but is just treated as
some big blob of mostly
meaningless bits. This model
provides the aggregate access
interfaces by using lookup based
on the keys.
The document data model,
similar to the Key-Value data
model, treats the aggregate as that
it is built by the structure of the
document and each document
owns a document ID. The
document ID as a key is used to
lookup
the
aggregate.
The
difference from the Key-Value data
model is that the document
database has the ability to see the
aggregate structure. Unlikely, the
aggregate is transparent to the
key-value data model databases.
The column-family stores
concept comes from the structure
of Google’s BigTable. This model
defines a table as a parse matrix or
a two-level map. Each row as a
record has a row key and several
column keys. The row key (the
first key) is often described as a
row identifier while column values
(second-level values) as the
column keys. Each column can be
tuples with different types of
datasets.
III.
Distribution Models
As mentioned in the
Introduction section, the NoSQL
databases
works
well
incorporated with distributed
system. Therefore, there are some
distribution models designed for
the distribution purpose, namely,
single server model, sharding
model, master-slave replication
model, peer-to-peer replication
model and the combination model
with sharding and replication.
The single server model, by
the name directly, is to deploy
only server to store the datasets
and all data operations happens
on this server. This policy is
simple and ensures the data
integrity and consistency, as well
as
the
operation
conflicts.
Apparently, the overload of all
kinds of data operations occurs in
the server, thus it cannot
guarantee to minimize the request
delay and the failover when there
are hundreds of thousands
requests from all over the world.
Moreover, once the only existed
sever goes down, it cannot
provides service any more. What
the worse is that the data is lost
forever once the server’s hard
drive is broken.
Another distribution model
is named sharding. By the name of
it, it shards the datasets into
different parts, stores the different
sub-datasets into different servers.
The sharding strategy can be
based on the location, or data
categories or other rules. Each
shard reads and writes its own
data. Because of the imbalance of
data throughput in different
shards, it may cause inefficient
usage among some shards.
Therefore, this policy has its own
pros and cons.
The
master-slave
replication selects one server as
the master node, and the others
are treated as slave nodes. At the
initial state, each node has all the
datasets. The master node has the
authority to control the slave
nodes. All the write requests occur
on the master node, then, the
master node propagates these
changes to all the slaves. While, in
order to balance the node burden,
the read requests can be done
from the master or the slaves.
Unlike the single server model,
this model has the benefit for
continuing service when the
master is down as the slave node
can be assigned as the master
node if the master fails.
The
peer-to-peer
replication has the similar
structure except without the
master node. Each node in the this
model has the same dataset, and
each node has the same authority
and deals with both read and
write requests and communicates
the write requests with each other
to keep the datasets up-to-date in
each node.
The combination with
sharding
and
master-slave
replication absorbs the advantages
from both sharding and masterslave replication. This model
implements multiple masters, but
each data item only has single
master so that it can balance the
write/read operation burden on
each node and provides the
recovery ability once some nodes
fail. The combination of peer-topeer replication and sharding is
another way, and replicates the
dataset on each node three times
so that even if one node fails, the
data on that failed node is still
available and continue to serve.
Download