doc_02_08_2014_13_35_58

advertisement
Coherence, A NoSQL Revolution
Data Distribution and Replication in Coherence
Renu Kanwar
Prakriti Trivedi
M-Tech, Department of CS and IT
Govt. engineering college, Ajmer
Rajasthan, India.
renu.khangaroth@gmail.com
Asst.Prof, Department of CS and IT
Govt. engineering college, Ajmer
Rajasthan, India.
niyuvidu@rediffmail.com
Abstract— NoSQL abbreviated as “not only Structured query
waiting time for a client to access any data, to solve this
problem database clustering emerged as a solution were data
were stored at multiple sites to avoid bottlenecking but end up
into another problem of distributed locks i.e. whenever update
on one DB site is going on rest are locked for any kind of
changes to maintain consistency.
Language”. Is a solution to all the present problems of data
storage, As we have been using the Traditional Relational
Database (RDBMS) for storing our data for the past so many
years but as the computer world is growing there is a need for
the data storage technologies to grow as well, today the world
requires databases which could store and process big data
effectively, high performing databases for large reads and writes
are required, where large concurrent applications are handled
such as search engines. NoSQL databases are solution for all
speed and scalability issues of RDBMS as they can easily handle
unstructured data, NoSQL do not guarantee the properties of
ACID instead they use a weaker BASE property which focuses
on the eventual consistency. For the same a very fine product of
Oracle is used named as Coherence, Coherence works on the
NoSQL principles and by dividing the data to be stored into two
domains Facts and dimensions where Facts includes all the larger
data sets and Dimensions smaller data fragments thus
distributing the Facts to various machines and replicating the
Dimensions for each machine making scalability for database a
child’s play plus enhancing the speed of the data access
tremendously.
Figure 1. Data Clustering solving problem of Data bottleneck
Keywords— NoSQL, Data bottleneck, Coherence, Fact,
Dimensions, Distribution, Replication, Coherence cluster, Cluster
Nodes
I.
Limitations of RDBMS
Apart from the problem of data bottlenecking there are other
issues available with RDBMS like Speed and Scalability, In
RDBMS data is stored inside Disk and when large amount of
data is stored in the disk and over a single machine congestion
occurs at the data store and the access time increases, Also
scaling RDBMS is not easy because we need to buy bigger
machines to increase the processing and storage space which
would cost a lot.
INTRODUCTION
Motivation
Big data storage is the requirement of the computer world
today where Databases containing terabytes of data and
handles big data quite easily draw the public attention towards
themselves, as the best example ever quoted for Big data
storage could be “The Internet” it’s the biggest database in the
world, however the technique behind managing such big data
repository is commendable most of such big data traffic
application had turned towards NoSQL.
There are certain demerits of the database which makes the
data access a problematic issue like
a) Disk base input output which is very slow as compared if
the data is being accessed from RAM.
b) Join between lots of tables.
c) Relational verses Object Mismatch: It is a set of conceptual
and technical difficulties often encountered when RDBMS is
used with object oriented programming language.
Data Bottleneck
Since long ago traditional relational databases are used for
storing data’s but with RDBMS one major and most common
problem encountered is when there are huge data traffics and
multiple clients want to access same data concurrently a
bottleneck occurs at the storage site which increases the
1
Apart from this a plus point of RDBMS is that as all the
data sets are accessed from a single machine there is much
reduced network latency but when we are considering terabytes
of data and we need super speed and scalability for our data
accesses with reduced network latency also RDBMS fails to do
it and thus we have to look beyond for a better alternative.
II.
be stored in the cluster than first 10 numbers would be on one
node of cluster 11 to 20 would be on another node and so on
while in data cluster numbers 1 to 100 are replicated at all the
DB sites which eats up the storage space, thus it can be said
that Coherence utilizes the storage space optimally without
wastage of memory space.
NOSQL REVOLUTION
Due to the lack of RDBMS to meet up the challenges of the
day by day grooming computer world people started thinking
of better alternatives which gave rise to the NoSQL
Movement which started in 2009 to solve the problems for
which RDBMS was not fit, as RDBMs provide a variety of
features and rich semantics of ACID property which are more
than necessary for particular applications and Use cases thus
for avoiding the Unneeded complexity NoSQL was chosen as
best alternative. As RDBMS works on strict restricted
semantics which increases the overall instructional overheads
on the system while NoSQL eliminates all these unnecessary
overheads. As shown in the diagram below that only 7% of the
instructions are utilized in useful work rest of the instructions
performs unnecessary tasks which are not required of every
Use Case.
HandCodedoptimization
16%
Logging
12%
Instructions
Locking
16%
Figure 3. Coherence Cluster
Usefull
Work
7%
Buffer
Manager
35%
Latching
14%
Figure 2. Database Instructional overheads
NoSQL instead of using the strict ACID (atomicity,
consistency, isolation and durability) semantics used a weaker
BASE (basically available, soft state and eventual
consistency) which means a system would be available all the
time without any bottleneck which leads to waiting but with
soft state where system can switch to next state without
completing the previous ones and eventually consistent not
bothered about the consistency at each state at the end of the
process states consistency should be maintained.
Oracle Coherence
A very fine product of Oracle which works on the NoSQL
principles as shown in figure its architecture consist of a
coherence cluster in the middle which act as a cache and
persistent data store lie behind this cache when a client request
for any particular data it is directed to the node containing that
data unlike data cluster in coherence cluster data is not
replicated at multiple sites instead data is distributed among
the cluster like in layman language if numbers till 100 are to
As shown in fig 3 whenever any client accesses any data, data
is accessed from the cache and later on saved in the persistent
data store.
Important Features of Coherence which makes it standalone
1. Speed: The middle layer of the Fig. 3 is the
coherence cluster where the data is stored this data
repository lies inside the memory i.e. RAM thus
when data is to be accessed it is accessed from RAM
which is much more faster as compared to disk
access, But as RAM is a volatile memory and data
loss could be faced when there is power failure thus
the data is asynchronously stored in the persistent
data store which is slow process and completes the
task in idle hours when there is not heavy data traffic.
2. Scalability: In NoSQL horizontal scalability is
performed therefore system could be scaled with ease
like in Coherence whenever the system requires it
can scale up just by adding a new machine as a new
node of the coherence cluster and the system would
become robust as new nodes would be inserted.
Figure 4. Scalability Performance
2
In the Coherence cluster Trade data which are bigger data’s
are distributed within the cluster i.e. among various machines
performing different trades and very small data values which
will use very less space are replicated inside each node so that
no cross joins are applied and the values used by any trade
data are held with themselves decreasing the response time for
any application.
While comparing the two we could draw some useful results
and conclude that the structure formed by Coherence is
flexible and data could be easily accommodated in it without
increasing further overheads, when we compare the two in
regard of response time which they spend for searching data
and viewing data we get the result shown in the following
graph representing both the database response time and the
time taken by coherence.
3.
Fault Tolerance: As the data is stored in RAM
therefore it is very important to have an excellent
backup and failover system thus best feature of
coherence is its backup structure where back of any
one node would be saved with another node of the
cluster say there are 5 nodes numbered 1 to 5 then
backup of 1 would be with 4 , 2 with 5, 3 with 1 and
so on…. which makes the system robust and fault
tolerant.
Thus it can be said that with these features coherence had
proved to be a solution for various applications and use
cases where RDBMS is not an appropriate choice.
III.
COMPARISON AND RESULTS
When we compare the two i.e. RDBMS and NoSQL
(Coherence) we come across many important facts about the
two, If we take data sets from an Investment bank were
trading is done we face data sets were trades with terabytes of
data are carried out and which is not possible with the
RDBMS to store such big data sets as bigger machines are
required were as coherence performs this task with ease as to
scale up the machine it would make use of multiple machines
acting
as
nodes
of
coherence
cluster.
Response Time (Sec)
Response time
3
2.5
2
1.5
1
0.5
0
RDBMS
Coherence
1 2 3 4 5 6 7 8 9 10
Figure 7. Response graph of DB and Coherence
The above graph shows the two technologies which we are
discussing with respect to the response time both will take,
while searching any data from the database as it could be seen
that initially when we have handful of data the response time
taken by DB is less that compared to coherence because the
infrastructure required for Coherence is quite large and that
takes most of the time while performing an action but as the
data would grow time taken by coherence is comparatively
much less that DB because we know that it works as a cache
and data access from RAM is much faster as compared to that
stored in disk.
Figure 5. RDBMS schema for an Investment bank
RDBMS schema for any problem would look like this with
too many joins and relationships and as data would grow the
performance would be hampered, while coherence cluster will
look like
When we talk about the storing large amount of data’s that
means we need to scale our data storage options as data will
grow simultaneously and we need to have provisions to store
and process the data thus we require bigger machine i.e.
scaling our database but scaling with RDBMS is not so easy
because it works on single machine and purchasing bigger
machine costs a lot of money like a machine with 4 Gb RAM
would cost for 50,000 INR and 40 Gb RAM machine would
cost in crore, while if we join 10 machines (known as
horizontal scaling) each of 4 Gb it would cost around 5 lakhs
which is much less therefore scaling up machines horizontally
is easy and cost effective.
Figure 6. Coherence cluster for an investment bank
3
CONCLUSIONS
Cost of Machine
RDBMS is limiting for many use cases because today’s
requirement is Speed and scalability. Therefore many newer
databases are in use where applications are shifting towards InMemory databases which could enhance the data access also
people are moving towards such databases which could work
according to our Use-Case.
Cost in $K
400
300
200
RDBMS
100
Coherence
0
1
2
3
4
5
NoSql is appropriate solution for those use cases where
ACID is not considered that important and people are ready to
stake ACID for speed and scalability.
6
Thus It can be concluded from the above facts that Database is
aging as the time is changing, NoSQL will definitely rule the
computer world and Big data storage would be the future,
There are various solutions available in the market today
working on NoSQL, this term is quite popular among Web 2.0
leaders, Big names like Facebook, Google, Digg, Amazon,
LinkedIn, and Twitter all are using NoSQL in one form or
other using it for different types of applications and Use
Cases. Apart from these many multinational companies are in
the race launching NoSQL solutions for their own work
Coherence is an example developed by Oracle. But the most
important fact to be kept in mind is that both SQL and NoSQL
would coexist and each would have its place.
Figure 8. Infrastructural cost for DB and Coherence
As coherence require bit of infrastructural cost which is
higher than that of our traditional database therefore for
applications where small sized data is to be processed
RDBMS wins because when data is less the infrastructural
cost of Coherence is much more but as the size of the data
grows and we need to increase the storage area and need to
scale up machine Coherence takes up the advantage because
of its flexibility to introduce any number of machines thus
reducing the overall cost of the system.
Also when we compare the two as how much CPU load is on
stored data we can get some important results
ACKNOWLEGMENT
App Performance-Impact
on DB
I extent thanks to Mrs Prakriti Trivedi for guiding me
throughout my research work with her guidance teaching and
suggestions motivated me for my work.
CPU load on DB
C 100
P
80
U 60
% 40
REFRENCES
[1]
40%
reduce
(
[2]
)
[3]
RDBMS
20
[4]
0
1 2 3 4 5 6 7 8 9 10
Coherence
Figure 9.Load over CPU by DB and Coherence
[5]
From the above graph we can conclude that the load on CPU
is reduced by 40% which is commendable while working with
Coherence in all we can say that the traditional RDBMS is
good and should be used till the size of data is small and
handled easily but as the size of the data grows there are other
factors which are not considered good in the technical market
because now it’s the time for the upcoming technologies to
meet the demand of the current users while working with the
traditional systems it is not possible.
[6]
[7]
[8]
4
Bogdan George Tudorica and Cristian Bucur,”A comparison between
several NoSQL databases with comments and notes”, IEEE Conference
on Commerce and Enterprise Computing. April 2011.
N.Leavitt, “Will NoSQL Databases, Live up to their promises”? IEEE
computer Society, vol.43 (2), 2010, pp.12-14.
J.Ernst,”SOL databases v. NoSQL databases”,Comm of ACM,
vol.53(4),2010.
Jing Han, Junde and Meina Song, “A Novel Solution - Distributed
Memory NoSQL Database for Cloud Computing”, 10th IEEE
International Conference on Computer and Information Science. June
2011.
Michael stonebraker, Ugur Cetintemel, “One Size Fits All”: An Idea
Whose Time Has Come and Gone, 10th IEEE/ACIS International
Conference on Computer and Information Science, 2011.
Edlich, Stefan, “NoSQL, your ultimate guide to the non - relational
universe!” http://nosql-database.org.
Distributed Caching: why it Matters For predictable Scalability on the
web, and where it’s proving its Value, White paper, Info world custom
solution.
Understanding
distributed
and
in-Memory
architectures
http:www.benstopford.com/2011-/08/14/distributed-storage-phasechange-memory-and-the-rebirth-of-the-in-memory-databases
Download