Dynamic Fragmentation and Heuristic Join Algorithm Aditya C Awalkar Dept. of Computer Engg. Vidyalankar Inst. of Tech. Mumbai, India awalkaradi95@gmail.com Ahmed Sabeeh Quadri Dept. of Computer Engg. Vidyalankar Inst. of Tech. Mumbai, India asabeeh18@hotmail.com ABSTRACT— Distributed database systems yield an enhancement on communication and data processing because of its data distribution throughout different network sites. Fragmentation of the database, query processing and query optimization techniques of the distributed database system are the most important factors contributing to its efficiency. In this paper we present a model for dynamically fragmenting a distributed database which ensures optimization. Further we discuss about an improved join algorithm which we call as heuristic join algorithm. This paper compares this algorithm with traditional join and semijoin algorithms. For this we take variety of cases and plot the results on a graph. This way we analyze the cases in which the algorithm works better than others. Keywords—Distributed Database, Query, Dynamic Fragmentation, Join, Semijoin, Heuristic, Optimization. I. INTRODUCTION A database management system is a software that manages the storage, retrieval and updating of data in a computer system. When the database is kept at the same location where it is accessed from, it is called as centralized database. However, it has major disadvantages like it creates a bottleneck when the demand is huge on the database i.e. many users are accessing the data at the same time. To overcome this a distributed database scheme was adopted. In distributed database scheme the database is divided into several sites which are logically interrelated and can be accessed from any site connected in that network. Distributed database system (DDBS) technology is the combination of two diametrically opposed approaches to data processing: database system and computer network technologies. Distributed processing better corresponds to the organizational structure of today’s widely distributed enterprises, and thus this Aayush Khator Dept. of Computer Engg. Vidyalankar Inst. of Tech. Mumbai, India aayush.khator@vit.edu.in system is more reliable as well as responsive. Also, most of the daily applications of computer technology are distributed. Following are the promises of Distributed Database Management Systems: 1) Transparent Management of Distributed and Replicated Data 2) Reliability through Distributed Transactions 3) Improved Performance 4) Easier System Expansion 5) Replication In distributed databases, a table is fragmented and stored at different sites. The fragmentation is done according to minterm predicates and the individual fragments hence formed should satisfy lossless join condition. There are 3 types of fragmentation1) Horizontal Fragmentation 2) Vertical Fragmentation 3) Hybrid Fragmentation Since fragmentation determines the data stored at each site, it becomes an important aspect in distributed database management. II. DYNAMIC FRAGMENTATION MODEL Issues with existing Methodology In Distributed Database, tables/fragments are stored at different sites to insure optimization. User fires a query to fetch results from the database. He is unaware about the internal architecture i.e. fragmentation and distribution. The query processor sends the query to the site where the relevant data is stored. However, due to many different reasons it may happen that the data access of a particular site is reduced while the data access on other sites is increased. Thus the former becomes redundant while increased load on the latter slows down the overall operation. This will ultimately reduce the overall efficiency. Consider an example where there are two sites having tables tax_1 and tax_2 respectively. Site 1- tax_1 (Tax details for annual income less than Rs 10 lacs) Site 2- tax_2 (Tax details for annual income more than Rs 10 lacs) Due to increase in income tax slab rates, the usage of tax details for income less than 10 lacs reduced while that of income greater than 10 lacs increased i.e. usage(tax_1) <<< usage(tax_2) This leads to higher load on site 2 while site 1 becomes almost redundant. This ultimately reduces efficiency. The user gets the result after a longer period of time. To solve this problem, we propose our model. Model Architecture Each computation is done after a specific period of time since the efficiency will reduce if the computation is done for every query or done after a short time interval. The database of the computation module however keeps storing query statistics and data accessed at every query fired on the given site. At every computation, the computation module calculates the result for a particular site or many sites using a set of parameters, the records of which were stored in the computation module. The parameters are fragments accessed, number of queries fired per week, average and peak CPU load, query fired on a particular site and the cost procured in executing it. The result obtained is compared with the threshold which was set initially. If the result is more than the threshold, re-fragmentation will take place. The refragmentation will be done by the fragmentation module. The fragmentation module will use the basic fragmentation methodology with addition to database statistics for optimization. It will take into account the last fragmentation taken place and transfer only those records which are extraneous on the site to be re-fragmented. Since the refragmentation considers the last fragmentation and transfers the additional records, it saves much of communication cost. Advantages Dynamic Fragmentation Model showing 2 loops (in reality it goes to infinity) The figure shows architecture of Iterative model of Fragmentation. It consists of 2 modules:- Fragmentation Module It does the job of fragmenting the database and distributing it over various sites. It takes input from computation module. It also takes the database statistics and previous fragmentation details. Computation Module It is the most important module. It stores query statistics and also maintains record of which fragments were accessed over a given period of time. It has access of database statistics from the time of database creation. In each loop, it does computation over the query statistics and gives results. If the results are more than a set threshold, the database needs to be re-fragmented. Basic Working Redundancy Elimination Faster Query Processing Improved reliability i.e. decreased chances of server crash. Distribution of load across all sites. III. JOIN BASED ALGORITHMS Join and semi-join based algorithms are frequently used algorithms. Every database has relations between tables, hence while deriving specific information, joins are widely used. In case of distributed databases, join adds much of cost in the query operation. Hence, optimizing join operation becomes a crucial step for faster query processing. The drawback of join approach is that the entire operand relations must be transferred amongst the sites. The benefit of semi-join is that it reduces the number of tuples that are needed to form the join. In distributed databases, it is of great importance since it reduces the network cost. The semijoin operator has the vital property of shrinking the size of the operand relation. When the primal cost component which is considered by the query processor is communication, a semijoin proves very useful since it significantly reduces the data sent between the sites. However, use of semijoins may cause an increase in the number of messages which will further increase the local processing time. We consider only network cost. We consider the best cases in all the 3 methods Join Method But in certain cases, such as those involving relations with contrasting relation size, a different approach proves to be much more efficient and productive than the regular traditional approach of semijoin. Let us see all the three approaches. 1) Transfer Department relation to site 1 2) Perform join operation as follows: R = ΠDName,EName (Department ⋈Employee.DeptNo=Department.DNo Employee) 3) Transfer the result R to site 3. Example Cost: Transferring Department Relation from Site 2 to Site 1 + Transferring computed result of Site 1 to Site 3 _____________________________________ = 100*(30) + 10,000*(32+16) = 4,83,000 Consider two relations Employee and Department with the following schema1. Employee (ENo,EName,DeptNo,……) ENo EName DeptNo …….. where each data item in 1. ENo. is of 2 Bytes 2. EName is of 32 Bytes 3. DeptNo is of 2 Bytes Semijoin Method Total Size of Employee Relation attributes is 100 Bytes. Number of tuples in Employee relation= 10,000 2. Department (DNo,DName,…..) DNo DName ………. where each data item in 1. DNo is 2 Bytes 2. DName is of 16 bytes Total Size of Department Relation attributes is 30 Bytes. Number of tuples in Department relation= 100 Employee relation is at Site 1 and Department table is at Site 2 and both sites are connected to each other over network and the query is fired from a third site which too is a part of the network Employee Department Site 1 Site 2 1) Transfer Department relation to site 3. 2) Transfer DNo to site 1 as R1. 3) Perform join operation as follows: R2= ΠDeptNo,EName (R1 ⋈Employee.DeptNo=R1.DNo Employee) 4) Transfer R2 to site 3. 5) R = ΠDName,EName (R2⋈R2.DeptNo=Department.DNo Department) Cost: Transferring Department Relation from Site 2 to Site 3 + Transferring Department.DNo Relation(R1) from Site 2 to Site 1 + Transferring computed result(R2) of Site 1 to Site 3 _____________________________________ = 100*(30) + 100*2 + 10,000*(32+2) = 343,200 Heuristic Method Result Site Site 3 SQL Query: SELECT Employee.EName,Department,DName FROM Department, Employee WHERE Employee.DNo = Department.DNo 1) Transfer Department.DNo and Department.DName relation as R1 to site 3. R1= ΠDName,DNo (Department) 2) Transfer Employee.DeptNo and Employee.Ename relation as R2 to site 3.R2= ΠEName,DeptNo (Employee) 3. 3. Perform join operation as follows: 3) R= ΠDName,EName (R1⋈R1.DNo=R2.DeptNo R2) Cost: Transferring R1 from Site 2 to Site 3 + Transferring R2 from Site 2 to Site 3 _____________________________________ = 100*(16+2) + 10,000*(32+2) = 341,800 550000 500000 Cost 450000 From the example, we can draw out that semijoin improves the efficiency as compared to join. However the efficiency further increases by using the heuristic method as shown. Although the difference was only of 1400 Bytes, but in real cases where the relation sizes are humongous in terms of both tuples and attributes, the difference will be much larger. Hence, the heuristic method is better than both join methods. Algorithm 400000 350000 300000 0 500 1000 1500 2000 Number of Tuples of Department Join Heuristic Result: Heuristic performs better than other two algorithms. Case 3: Changing the size of DName attribute 500000 Comparison 450000 Cost 1) For all the sites participating in the query. 2) Select the required attributes as mentioned in the query and also the joining attributes. 3) Transfer the resultant relations to a centralized site, preferably the site at which result is expected. 4) Perform natural join between all the relations. 400000 350000 300000 0 Comparing the algorithms on various test cases by varying a particular quantity while keeping others constant. Case 1: Changing size of join attribute DNO 500000 20 Join 40 DName size 60 Semijoin 80 Heuristic Result: Join cost increases rapidly, while semijoin cost remains constant. Heuristic cost goes on increasing and at a point crosses semijoin. Case 4: Changing total size of Department attributes 450000 Cost Semijoin 550000 400000 500000 450000 Cost 350000 400000 300000 350000 0 5 Size of DNo Join Semijoin 10 Heuristic Result: Join cost remains constant. Heuristic performs slightly better than Semijoin and both increase linearly. Case 2: Changing the number of Tuples of Department 300000 0 100 200 300 400 Total size of Department attributes Join Semijoin Heuristic Result: Heuristic cost remains constant while join and semijoin cost goes on increasing. CONCLUSION This paper proposes a model which dynamically fragments the database when the load on a site increases to a large extent. It re-fragments it in such a way that the load gets distributed across all sites thus increasing the efficiency. In the second part of the paper, we analyzed that our heuristic algorithm works better than traditional join algorithms in majority of the cases. Acknowledgment We would like to thank Prof. Vipul Dalal (Distributed Database Management System Professor, VIT) for guiding us in writing this paper. We would also like to thank Prof. Sachin Deshpande, head of Computer Engg. Dept. for motivating us. References [1] “Principles of Distributed Database Systems”, M. Tamer Ozsu [2]”Database System Concept”, Mc­Graw Hill- Silberschatz, Korth, Sudarshan [3]“Distributed Database System”, Pearson Educati on India- Chhanda Ray [4] “Distributed Database Management System”, Wiley India- Seed K. Rahimi and Frank S. Haug [5] Database Fragmentation and Allocation: An Integrated Methodlogy and Case Study- Ajit M. Tamhankar and Sudha Ram, Member, IEEE [6] Carey, M. And Lu,h. Load balancing in a locally distributed database system. In Proc. ACM SIGMOD Conf., Washington,USA,1986, PP.108119. [7] Distributed Database Management Systems Issues and Approaches- Amjad Umar, July 1988.