IEEE Transactions on Magnetics

advertisement
1
A Comparative Analysis on Location Based and Graph Based Distributed
Database Partitioning
1
1
TinMyintNaing, 2AungWin
Ph.D Student of UT(YCC), PyinOoLwin, Myanmar,utinmyintnaing08@gmail.com
2
Principal, UT(YCC), PyinOoLwin, Myanmar

can be achieved by means of horizontal partitioning of
Abstract—Distributed database partitioning is one of the
distributed database. By placing partitions on different
challenges in distributed applications area. Some researches
nodes, it is often possible to accomplish nearly linear
have been carried out to gain the optimal solution in recent
years. In this paper, we describe the comparative style on two
speedup, especially for analytical queries where each node
can scan its partitions in parallel. Besides improving
approaches for distributed database partitioning. The first one
is based on the location and the latter is emphasized on the
graph based partitioning. In location based strategy, the
scalability, partitioning can also improve availability within
the environment [9]. Workload balancing in distributed
client’s source ipaddress is the core factor to divide the
application environment is crucial and an efficient
distributed database. Distributed transactions and its affected
partitioning scheme can help to accomplish the desired
nodes are the main players in graph based partitioning. The
goals. Distributed transactions are expensive, and thus, it is
experimental results show that the graph based approach is
necessary to obtain the optimal partitions for good
better in some factors while location based get advantage in
performance from distributed OLTP databases [1].
some factors.
Graph partitioning is an important problem that has
Index Terms -- Partitioning; Distributed database; graph
partitioning.
extensive applications in many areas, including scientific
computing, VLSI design, task scheduling, geographical
I. INTRODUCTION
information systems, and operations research. The problem
Scalability, reliability and availability are related important
is to partition the vertices of a graph to p roughly equal
issues in implementing distributed database applications. It
parts, such that the number of edges connecting vertices in
is important to handle the overhead problem when the
different parts is minimized. For graph based approach,
system is trying to scale out. Otherwise, it would only
multi level k-way graph partitioning scheme is the key [3].
produce the performance penalty for the system. There are
various approaches to implement the distributed database
The k-way partitioning problem is most frequently solved by
applications. Some applications are emphasized on the
recursive bisection. That is, we first obtain a 2-way
reliability issue and some are oriented to availability. [6].
partitioning of V, and then we recursively obtain a 2-way
partitioning of each resulting partition. After log k phases,
Location based strategy is oriented to the availability as the
graph G is partitioned into k partitions [2]. It is realistic that
affected nodes will be placed in the nearest database server.
the distributed transactions are converted to graph in where
While the graph based approach is suited for the scalability
tuples that are affected by a distributed transaction are
measure in the same application area. The primary way of
transformed to vertex and edge will be connected two
increasing database scalability in distributed environment
vertexes if they are in the same transaction. Thus graph
partitioning approaches are one of the best techniques
among the existing that can solve the problem of distributed
2
transactions to tend to scalability measure in distributed
location. Then the affected nodes are found out for each
environment [7].
transaction and number of access times for a certain
transaction is computed for clustering in later. Then the
II. DISTRIBUTED DATABASE PARTITIONING
access pattern for a particular affected node comes out after
A. LOCATION BASED APPROACH
doing analysis the logged file. Some nodes are accessed
The location based partitioning for distributed databases is
from a single location and some are accessed by multiple
in fact horizontal partitioning. In where some of the rows of
location, etc.
a table (or relation) are put into a base relation at one site,
Clustering a node which is accessed from single location is
and other rows are put into a base relation at another site.
quiet easy. But a node which is accessed from multiple
More generally, the rows of a relation are distributed to
locations is little busy to cluster to a certain partition.
many sites. The basic concept of location based horizontal
Although certain clustering algorithm can be used to do that,
partitioning is that data are stored close to where they are
the frequency based selection is utilized in this approach for
used and separate from other data used by other users or
simplicity. After clustering the affected nodes, these nodes
applications [4]. Moreover, Data can be stored to optimize
are moved to their respective partitions means database
performance for local access in this approach. The most
servers. And then necessary evaluation for performance
important point in this scheme is to detect the pattern of
could be conducted.
transactions and make necessary movement of nodes from
one database server to another [5]. The overall procedure for
B. GRAPH BASED APPROACH
location based partitioning scheme is shown in the following
The graph based approach uses transactions and its affected
Figure 1.
nodes as main source. In there, tuples stand for graph vertex
Client’s requests
and transactions become edges in this framework. It is
reasonable to represent the transactions as edges in graph
Obtaining and
Analysis of Queries
minimize the weight of cut edges which approximately
minimizes the number of multi-sited transaction. The overall
Query Router
Classification and
Clustering Nodes
DB
and these edges will be cutting to get balanced partitions that
procedure is shown in the following figure 2.
Client’s requests
DB
Migrating Nodes
Figure 1. Overall Procedure of Location Based
Obtaining and
Analysis of Queries
Query Router
Graph
Representation
Partitioning
Obtaining and analysis queries phase is the main in this
DB
DB
Partitioning Graph
approach as transaction access pattern can be detected and
can also be made necessary analysis about affected nodes.
Replacement
Firstly, the transaction must be logged in a particular format
for detection and analysis. The transactions have to be
Figure 2. Overall procedure of Graph Based
recorded accompanied with its client’s address means
Partitioning
3
In order to use the graph representation model, it is
necessary to collect the various queries traces from multiple
clients. After getting the required queries traces, the analysis
is made to gain the useful information from queries and find
out the transaction ids and their victim nodes which are the
sore input for the graph representation. Some heuristic
procedures have to be carried out in order to reduce the
Figure 4. A partitioned graph for four transactions
graph size and to support in partitioning process. The first
one is transaction level sampling which can limit the size of
workload trace represented in the graph, reducing the
number of edges. The next one is tuple-level sampling which
can also reduce the number of nodes (tuples) in the graph.
Lastly, filtering is done to discard occasional statements that
scan large portion of a table, as they could produce many
edges carrying little information and filtering is needed to be
done to remove the tuples that are accessed rarely from the
graph.[7]
In graph representation, each tuple is represented as a node
in the graph and edges are connected between two tuples
that are used in the same transaction. Edge weights accounts
for the number of transactions that co-access a pair of tuples.
The weights are going to be used in graph partitioning
algorithm which will try to get the balanced partitions. The
weight of a partition is computed as the sum of the weights
of the nodes assigned to that partition. A simplified version
of stock table of TPC-C database is used to explain the use
of graph representation. The first transaction affects on the
node 13 and node 23 while the second covers node 14 and
node 24. The third one affects
on node 11, 12 and 22. The fourth transaction covers node
22, 23 and 24 respectively. The corresponding graph is
shown in Figure. 3.
After creating the graph representation for transactions, it is
ready to conduct the graph partitioning using the most
famous graph partition package, Metis. The metis utilizes
multi level k-way graph partitioning for large graph in order
to get a balanced minimum-cut partitioning of the graph into
k partitions. Graph partitioning splits the graph into k non
overlapping partitions such that the overall cost of the cut
edges is minimized while keeping the weight of partitions
within a constant factor of perfect balance [2]. So this graph
operation
approximately
minimizes
the
number
of
distributed transactions while balancing the load evenly
across nodes. The sample graph partitioning for four
transactions workload traces stated in above is shown in
Figure 4.
When the graph has been partitioned to several clusters, the
affected nodes assigned in a particular cluster are moved to
the destination database server. Finishing the migration
process to several database servers, the evaluation could be
conducted to test for the performance.
III. RESULTS OF EXPERIMENT
During experiment, various trials have been made for both
schemes with very low ended personal computers of core i5
processors in which 4GB RAM and 1GB graphic card.
Standard TPC-C benchmark is utilized to evaluate the
Figure 3. Graph representation for four transactions.
results of the proposed two approaches. Firstly transaction
response time and throughput comparison are tested for both
4
approaches. Four clients per machine with three database
Secondly, scalability measure for both schemes is tested by
servers are used to get the required experimental results. The
adding more servers from one to three and made analysis
tested results for transaction response time and throughput
when more servers are added to the existing system. Lastly,
are shown in Figure.5 and Figure.6 respectively.
distributed transaction ratio is tested and shown in table I.
partitioning scheme is slightly shorter in some workload
location
based
graph based
shorter than the location based approach. So, the
ns
tra
tra
1
In the evaluation of distributed transactions status, it can be
27
44
9
tra
ns
than that of the second approach.
14
98
8
74
9
traces. But in some traces, graph based approach gets much
throughputs of location based approach are not always more
ns
TRT(s)
In the testing of transaction response time, location based
1600
1400
1200
1000
800
600
400
200
0
seen clearly that the graph based approach is better than the
Figure 5. Transaction response time for various
location
based
approach.
This
is
because
graph
representation used in the former approach is based on the
workload
normal
and
distributed
transactions.
This
graph
is
transformed from the transactions. And partitioning the
Throughput(trns/s)
30
graph means that finding the cluster of nodes by minimizing
25
the distributed affected nodes in graph. Thus, the graph
20
location
based
15
10
graph
based
5
based approach is significantly better then the location based
approach in distributed transaction status.
0
IV. CONCLUSION
7498 9844 1E+0 2E+0 3E+0 3E+0
4
4
4
4
In brief, graph based partitioning approach are more
effective than that of location based approach for distributed
Figure 6. Throughput comparison for various workload.
databases. While the location based approach gain the
significant improvement in availability measure, the latter
Table I Comparison of Distributed Transactions
Distributed Transactions(%)
No. Of
takes great benefit in other factors such as scalability,
distributed transactions ratio, etc. The response time and
Location
Graph
based
based
7758
2.294
1.907
0.387
18957
1.442
1.471
0.029
39083
1.015
0.8417
0.173
V. REFERENCES
45621
2.143
1.845
0.283
[1] C. Curino, E. Jones, Y. Zhang,S. Madden. “Schism: a
69119
2.028
1.721
0.307
Workload-Driven Approach to Database Replication and
149910
1.879
1.623
0.256
Partitioning.”, 36th International Conference on Very Large
Transactions
Difference
throughput are not so difference in both schemes. In general,
both approaches are good in their places according to their
nature of distributed database application.
Data Bases, September 13-17, 2010, Singapore.
5
[2] George Karypis and Vipin Kumar, “Multilevel k-way
Partitioning Scheme for Irregular Graphs”, Journal of
Parallel and Distributed Computiong 48, 96–129 (1998)
[3] G. Karypis and V. Kumar. A fast and high quality
multilevel scheme for partitioning irregular graphs. SIAM
Journal on Scientific Computing, 20(1):359–392, 1999.
[4] M. Ra, “Horizontal partitioning for distributed database
design,” In Advances in Database Research, World
Scientific Publishing, pp. 101–120, 1993.
[5] A. Y. Seadim, “An Overview of distributed database
management”,
Distributed
Operating
System,
Term
Project,1998.
[6] T. Myint Naing, A. Win “Enforcing Scalability: An
Efficient Approach for
Distributed
Database Partial
Replication”, ICAET’2014, Singapore.
[7] T. Myint Naing, A. Win “An Efficient Paradigm for
Distributed Database Scalability”, ICSE’ 2013, Myanmar.
Download