1 A Comparative Analysis on Location Based and Graph Based Distributed Database Partitioning 1 1 TinMyintNaing, 2AungWin Ph.D Student of UT(YCC), PyinOoLwin, Myanmar,utinmyintnaing08@gmail.com 2 Principal, UT(YCC), PyinOoLwin, Myanmar can be achieved by means of horizontal partitioning of Abstract—Distributed database partitioning is one of the distributed database. By placing partitions on different challenges in distributed applications area. Some researches nodes, it is often possible to accomplish nearly linear have been carried out to gain the optimal solution in recent years. In this paper, we describe the comparative style on two speedup, especially for analytical queries where each node can scan its partitions in parallel. Besides improving approaches for distributed database partitioning. The first one is based on the location and the latter is emphasized on the graph based partitioning. In location based strategy, the scalability, partitioning can also improve availability within the environment [9]. Workload balancing in distributed client’s source ipaddress is the core factor to divide the application environment is crucial and an efficient distributed database. Distributed transactions and its affected partitioning scheme can help to accomplish the desired nodes are the main players in graph based partitioning. The goals. Distributed transactions are expensive, and thus, it is experimental results show that the graph based approach is necessary to obtain the optimal partitions for good better in some factors while location based get advantage in performance from distributed OLTP databases [1]. some factors. Graph partitioning is an important problem that has Index Terms -- Partitioning; Distributed database; graph partitioning. extensive applications in many areas, including scientific computing, VLSI design, task scheduling, geographical I. INTRODUCTION information systems, and operations research. The problem Scalability, reliability and availability are related important is to partition the vertices of a graph to p roughly equal issues in implementing distributed database applications. It parts, such that the number of edges connecting vertices in is important to handle the overhead problem when the different parts is minimized. For graph based approach, system is trying to scale out. Otherwise, it would only multi level k-way graph partitioning scheme is the key [3]. produce the performance penalty for the system. There are various approaches to implement the distributed database The k-way partitioning problem is most frequently solved by applications. Some applications are emphasized on the recursive bisection. That is, we first obtain a 2-way reliability issue and some are oriented to availability. [6]. partitioning of V, and then we recursively obtain a 2-way partitioning of each resulting partition. After log k phases, Location based strategy is oriented to the availability as the graph G is partitioned into k partitions [2]. It is realistic that affected nodes will be placed in the nearest database server. the distributed transactions are converted to graph in where While the graph based approach is suited for the scalability tuples that are affected by a distributed transaction are measure in the same application area. The primary way of transformed to vertex and edge will be connected two increasing database scalability in distributed environment vertexes if they are in the same transaction. Thus graph partitioning approaches are one of the best techniques among the existing that can solve the problem of distributed 2 transactions to tend to scalability measure in distributed location. Then the affected nodes are found out for each environment [7]. transaction and number of access times for a certain transaction is computed for clustering in later. Then the II. DISTRIBUTED DATABASE PARTITIONING access pattern for a particular affected node comes out after A. LOCATION BASED APPROACH doing analysis the logged file. Some nodes are accessed The location based partitioning for distributed databases is from a single location and some are accessed by multiple in fact horizontal partitioning. In where some of the rows of location, etc. a table (or relation) are put into a base relation at one site, Clustering a node which is accessed from single location is and other rows are put into a base relation at another site. quiet easy. But a node which is accessed from multiple More generally, the rows of a relation are distributed to locations is little busy to cluster to a certain partition. many sites. The basic concept of location based horizontal Although certain clustering algorithm can be used to do that, partitioning is that data are stored close to where they are the frequency based selection is utilized in this approach for used and separate from other data used by other users or simplicity. After clustering the affected nodes, these nodes applications [4]. Moreover, Data can be stored to optimize are moved to their respective partitions means database performance for local access in this approach. The most servers. And then necessary evaluation for performance important point in this scheme is to detect the pattern of could be conducted. transactions and make necessary movement of nodes from one database server to another [5]. The overall procedure for B. GRAPH BASED APPROACH location based partitioning scheme is shown in the following The graph based approach uses transactions and its affected Figure 1. nodes as main source. In there, tuples stand for graph vertex Client’s requests and transactions become edges in this framework. It is reasonable to represent the transactions as edges in graph Obtaining and Analysis of Queries minimize the weight of cut edges which approximately minimizes the number of multi-sited transaction. The overall Query Router Classification and Clustering Nodes DB and these edges will be cutting to get balanced partitions that procedure is shown in the following figure 2. Client’s requests DB Migrating Nodes Figure 1. Overall Procedure of Location Based Obtaining and Analysis of Queries Query Router Graph Representation Partitioning Obtaining and analysis queries phase is the main in this DB DB Partitioning Graph approach as transaction access pattern can be detected and can also be made necessary analysis about affected nodes. Replacement Firstly, the transaction must be logged in a particular format for detection and analysis. The transactions have to be Figure 2. Overall procedure of Graph Based recorded accompanied with its client’s address means Partitioning 3 In order to use the graph representation model, it is necessary to collect the various queries traces from multiple clients. After getting the required queries traces, the analysis is made to gain the useful information from queries and find out the transaction ids and their victim nodes which are the sore input for the graph representation. Some heuristic procedures have to be carried out in order to reduce the Figure 4. A partitioned graph for four transactions graph size and to support in partitioning process. The first one is transaction level sampling which can limit the size of workload trace represented in the graph, reducing the number of edges. The next one is tuple-level sampling which can also reduce the number of nodes (tuples) in the graph. Lastly, filtering is done to discard occasional statements that scan large portion of a table, as they could produce many edges carrying little information and filtering is needed to be done to remove the tuples that are accessed rarely from the graph.[7] In graph representation, each tuple is represented as a node in the graph and edges are connected between two tuples that are used in the same transaction. Edge weights accounts for the number of transactions that co-access a pair of tuples. The weights are going to be used in graph partitioning algorithm which will try to get the balanced partitions. The weight of a partition is computed as the sum of the weights of the nodes assigned to that partition. A simplified version of stock table of TPC-C database is used to explain the use of graph representation. The first transaction affects on the node 13 and node 23 while the second covers node 14 and node 24. The third one affects on node 11, 12 and 22. The fourth transaction covers node 22, 23 and 24 respectively. The corresponding graph is shown in Figure. 3. After creating the graph representation for transactions, it is ready to conduct the graph partitioning using the most famous graph partition package, Metis. The metis utilizes multi level k-way graph partitioning for large graph in order to get a balanced minimum-cut partitioning of the graph into k partitions. Graph partitioning splits the graph into k non overlapping partitions such that the overall cost of the cut edges is minimized while keeping the weight of partitions within a constant factor of perfect balance [2]. So this graph operation approximately minimizes the number of distributed transactions while balancing the load evenly across nodes. The sample graph partitioning for four transactions workload traces stated in above is shown in Figure 4. When the graph has been partitioned to several clusters, the affected nodes assigned in a particular cluster are moved to the destination database server. Finishing the migration process to several database servers, the evaluation could be conducted to test for the performance. III. RESULTS OF EXPERIMENT During experiment, various trials have been made for both schemes with very low ended personal computers of core i5 processors in which 4GB RAM and 1GB graphic card. Standard TPC-C benchmark is utilized to evaluate the Figure 3. Graph representation for four transactions. results of the proposed two approaches. Firstly transaction response time and throughput comparison are tested for both 4 approaches. Four clients per machine with three database Secondly, scalability measure for both schemes is tested by servers are used to get the required experimental results. The adding more servers from one to three and made analysis tested results for transaction response time and throughput when more servers are added to the existing system. Lastly, are shown in Figure.5 and Figure.6 respectively. distributed transaction ratio is tested and shown in table I. partitioning scheme is slightly shorter in some workload location based graph based shorter than the location based approach. So, the ns tra tra 1 In the evaluation of distributed transactions status, it can be 27 44 9 tra ns than that of the second approach. 14 98 8 74 9 traces. But in some traces, graph based approach gets much throughputs of location based approach are not always more ns TRT(s) In the testing of transaction response time, location based 1600 1400 1200 1000 800 600 400 200 0 seen clearly that the graph based approach is better than the Figure 5. Transaction response time for various location based approach. This is because graph representation used in the former approach is based on the workload normal and distributed transactions. This graph is transformed from the transactions. And partitioning the Throughput(trns/s) 30 graph means that finding the cluster of nodes by minimizing 25 the distributed affected nodes in graph. Thus, the graph 20 location based 15 10 graph based 5 based approach is significantly better then the location based approach in distributed transaction status. 0 IV. CONCLUSION 7498 9844 1E+0 2E+0 3E+0 3E+0 4 4 4 4 In brief, graph based partitioning approach are more effective than that of location based approach for distributed Figure 6. Throughput comparison for various workload. databases. While the location based approach gain the significant improvement in availability measure, the latter Table I Comparison of Distributed Transactions Distributed Transactions(%) No. Of takes great benefit in other factors such as scalability, distributed transactions ratio, etc. The response time and Location Graph based based 7758 2.294 1.907 0.387 18957 1.442 1.471 0.029 39083 1.015 0.8417 0.173 V. REFERENCES 45621 2.143 1.845 0.283 [1] C. Curino, E. Jones, Y. Zhang,S. Madden. “Schism: a 69119 2.028 1.721 0.307 Workload-Driven Approach to Database Replication and 149910 1.879 1.623 0.256 Partitioning.”, 36th International Conference on Very Large Transactions Difference throughput are not so difference in both schemes. In general, both approaches are good in their places according to their nature of distributed database application. Data Bases, September 13-17, 2010, Singapore. 5 [2] George Karypis and Vipin Kumar, “Multilevel k-way Partitioning Scheme for Irregular Graphs”, Journal of Parallel and Distributed Computiong 48, 96–129 (1998) [3] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359–392, 1999. [4] M. Ra, “Horizontal partitioning for distributed database design,” In Advances in Database Research, World Scientific Publishing, pp. 101–120, 1993. [5] A. Y. Seadim, “An Overview of distributed database management”, Distributed Operating System, Term Project,1998. [6] T. Myint Naing, A. Win “Enforcing Scalability: An Efficient Approach for Distributed Database Partial Replication”, ICAET’2014, Singapore. [7] T. Myint Naing, A. Win “An Efficient Paradigm for Distributed Database Scalability”, ICSE’ 2013, Myanmar.