ANALYZING THE PHASES OF QUERY PROCESSING Deepa Prakash Kapai B.E, Visvesvaraya Technological University, India, 2007 PROJECT Submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in COMPUTER SCIENCE at CALIFORNIA STATE UNIVERSITY, SACRAMENTO SPRING 2012 ANALYZING THE PHASES OF QUERY PROCESSING A Project by Deepa Prakash Kapai Approved by: __________________________________, Committee Chair Mary Jane Lee, Ph.D. __________________________________, Second Reader Robert Buckley, M.S. ____________________________ Date ii Student: Deepa Prakash Kapai I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the Project. __________________________, Graduate Coordinator Nikrouz Faroughi, Ph.D. Department of Computer Science iii ___________________ Date Abstract of ANALYZING THE PHASES OF QUERY PROCESSING by Deepa Prakash Kapai Databases are fundamental part of managing data. Storing multimedia-based objects like graphics image, videos, audio, etc in databases is a basic need of everyday users. However, storing these objects in the relational databases is not possible because relational databases support single value data. To store these complex objects, objectoriented databases were introduced. Object-oriented databases support all the features of object-oriented programming languages, such as, inheritance, encapsulation, polymorphism, etc, along with other features of the databases. This project provides an introduction to the object-oriented databases along with its features, design of an object model and design of the classes, objects and queries. Another major functionality of a database is to respond to the users, i.e., process user queries. Efficiency with speed is an important feature in today’s market. So there is a need to properly process and optimize a query. This project provides the basic idea of hydrid-hash pointer (HP) based approach and multiwavefront (MW) approach for processing the queries. The focus of this project is to design an optimization function and implement this function on Select and Join queries. This project analyzes the iv performance of the Select and Join queries. Performance factors will include query execution time, memory space requirement and response time. This project makes recommendations for novice object-oriented database developers in developing applications where data retrieval time (execution time plus response time) is an important factor. , Committee Chair Mary Jane Lee, Ph.D. _______________________ Date v ACKNOWLEDGEMENT This space provides me a great honor to thank all the people with whose support this project and my masters have been a success. I would take this opportunity to convey my sincere thank you to all. Firstly, I would like to thank Dr. Mary Jane Lee, her help and supportive guide throughout this project in highs and lows has been commendable. She took extra effort to review the report and kept on giving her pieces of advice during my course of project completion. I would like to also thank Prof. Robert Buckley for his extended support. He provided me with his valuable suggestions when they are really needed. Furthermore, I would like to thank the Department of Computer Science at California State University, Sacramento for extending this opportunity for me to pursue my Masters degree and guiding me all the way to become a successful student. Last but not the least; I am thankful to my parents Prakash H Kapai and Bindu P Kapai for their constant support and belief in me, their words of wisdom and moral support helped me overcome all the challenges and through their guidance I was able to successfully complete my project and earn my Masters Degree. vi TABLE OF CONTENTS Page Acknowledgement ............................................................................................................. vi List of Tables ..................................................................................................................... ix List of Figures ..................................................................................................................... x Chapter 1. INTRODUCTION .......................................................................................................... 1 1.1 Query Processing and Optimization…………………………………………..2 1.2 Need of Query Processing and Optimization…………………………………2 1.3 Goal of the Project………………………………………………………...…..4 2. OBJECT-ORIENTED DATABASES………………………………………………….5 2.1 Object-Oriented Concepts…….……………………………………………….5 2.2 Object-Oriented Database Concepts…..……..……………..…………………6 2.3 Advantages of Object-Oriented Databases……………………………….….10 2.4 Object-Oriented Databases vs Relational Databases………………………...10 3. DESIGN OF OBJECT-ORIENTED DATABASES………………………………….14 3.1 Project Requirements………………………………………………………...14 3.2 Object Model……………..…….……………………………………………14 3.3 Installation……………………………………………………………………16 3.4 Database Schema………..………………………………………………...…17 3.3 Queries…………………………………………….…………………………20 4. QUERY PROCESSING AND OPTIMIZATION IN OODB………………………...24 4.1 Query Processing for OODB………………..……………………….………24 4.2 Hybrid-Hash Pointer (HP) based approach………………………………….24 4.3 Multiwavefront (MW) approach…………………………..………………...29 4.4 Query Optimization……….………………………………………………....34 5. RESULTS……………………………………………………………………………..38 5.1 Select Query Result…….…………………………….………………………38 vii 5.2 Join Query Result………..……………..…………….………………………41 6. CONCLUSION AND FUTURE WORK……………………………………………..44 6.1 Conclusion……….…….…………………………….………………………45 6.2 Future Work……………..……………..…………….………………………45 Bibliography……………………………………………………………………………..46 viii LIST OF TABLES Page 1. Table 2.1: Object state interpreted based on type constructor………………………….8 2. Table 2.2: Student table……………………………………………………………….11 3. Table 2.3: Output from Relational Databases…………………………………………12 4. Table 2.4: Output from Object-Oriented Databases…………………………………..12 ix LIST OF FIGURES Page 1. Figure 2.1: Built-in interface…………………………………………………………....9 2. Figure 3.1: Object Model……………………………………………………………...15 3. Figure 4.1: Query Tree for Select query in OODB ………………….………………..26 4. Figure 4.2: Query Tree for Join query in OODB…...…………………………………27 5. Figure 4.3: Query Graph for Select query in OODB….………………………………31 6. Figure 4.4: Query Graph for Join query in OODB……………………………………32 7. Figure 4.5: Rewriting Phase…………………………………………………………...36 8. Figure 4.6: After Optimization phase………………………………………………....37 9. Figure 5.1: Execution time of Select query…………………..……………………….38 10. Figure 5.2: Memory Usage…………….…………………………………………….39 11. Figure 5.3: Response Time…………………………………………………………..40 12. Figure 5.4: Execution time of Join query………………………………………….…41 13. Figure 5.5: Memory Usage…………….…………………………………………….42 14. Figure 5.6: Response Time…………………………………………………………..43 x 1 Chapter 1 INTRODUCTION Storing huge amount of data across the globe by large organizations has become an integral part of its infrastructural development. Millions of basic transactions, such as, withdrawal of money from ATM, paying credit card bill online, are conducted daily. By storing data in digital form, daily transactions have become efficient and fast. Today storage of data is not a problem. These days devices are available which can store huge amounts of data in gigabytes, e.g., devices like iPods and flash devices. Therefore the problem today is not primarily the storage of data. Storage of data has become much easier and the cost for storing data has also fallen. However, this fall in cost and data storage has resulted in new challenge - the retrieval of data. How efficiently can we retrieve particular data from the gigantic stack of data? Earlier large databases were usually meant for storing data of size 100MB or few giga bytes. But today large databases can store 1015 bytes of data. To get the data fast, indexes were introduced in databases. A database index [1] is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space. Indexes can be created using one or 2 more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records [1]. However, indexes alone were not helpful in the efficient retrieval of data. Another very important element in efficient retrieval of the data is the “Query Processing and Optimization”. 1.1 Query Processing and Optimization: Query Processing [2] is used to obtain the desired and particular information from a database system in a predictable and reliable manner. Query Optimization [2] is used to obtain results back in timely manner. Query Processing and Optimization are extremely important aspects of DBMS. They help to determine how long a particular query takes to retrieve specific data. From this, we can differentiate whether the query is interactive or batched in nature. An interactive query is one which gives the results immediately. However, the batched query retrieval of information isn’t prompt rather it lags and takes time. 1.2 Need of Query Processing and Optimization: Let’s take an example to see why query processing and optimization is so useful. Suppose there is 1GB of data in the student database. And the user wants to retrieve all the names of the students who are on student visa from computer science department and have GPA greater than or equal to 3.5. The user uses the “Select” query to display the student names, 3 select studname from student where visa = ‘international’ and department = ‘CS’ and gpa >=3.5. This Select query requires scanning through the 1GB of data which can take about 1000 seconds. Now consider a query which is on two different tables and each table is of 1GB of data. A bad query execution plan would compute Cartesian product of two tables before it returns the results. If the user has to compute the Cartesian product of 1GB of data times 1GB of data where each access of table will take around 1000 seconds, then the user will get back the results in 15 - 20 minutes. This plan is an ineffective method of executing this query and deliverance. An efficient query execution would try to rework a given query in a more effective manner. To make the above query execution plan effective, we would first see what user really wants and then use the join operation which is better than Cartesian product. If we are able to figure this out, then the user will be able to get the results within 5 minutes which is much more efficient than Cartesian product. 4 1.3 Goal of the Project: For any application involving databases, one of the most important features is the retrieval time. Hence, choosing the right processing and optimization technique is an important decision for any database developer. The goal of this project is to understand the steps involved in the query processing, design of Select and Join object-oriented queries, design of an optimization function and apply this function on the Select and Join queries. In addition this project will analyze the performance of these queries in terms of execution time, response time and memory usage. This report is structured as follows: Chapter 2 discusses about the object-orientation concepts in general and how these concepts are applied in databases. Chapter 3 discusses about the design of object model, database schema and queries. Chapter 4 discussed about hybrid-hash pointer based and multiwavefront processing approach on Select and Join queries, design of the optimization function and apply this optimization function on the Select and Join queries. Chapter 5 discusses about the results of analysis using query designed. Finally the conclusions and future works are discussed in Chapter 6. 5 Chapter 2 OBJECT-ORIENTED DATABASES Object-Oriented databases [3] are a database management system in which information is represented in the form of objects as used in object-oriented programming. The data of the object can be accessed only by the methods associated with that object. ObjectOriented databases are mainly used in applications like computer aided design (CAD), multimedia, GUI based application and so on. These applications are made up of fundamental objects which are basic building blocks of the application. 2.1 Object-Oriented Concepts: Below are the concepts that come from Object-Oriented programming language, i. Fundamental building blocks in an object-oriented system - “Object”. Object represents an instance. Object belongs to a particular type and type is known as “Class”. We can now think objects are variables of the type class. ii. “Abstractions” are used in representing the necessary features without including any detailed explanation about it. It tells about the attributes present and what can they do? iii. An object can wrap attributes and methods in a single unit using “encapsulation”. 6 iv. “Interface” (also known as Signature of Object) to object allows exposure to the outside world. Any external entity can interactive with objects through the interface by calling particular methods. v. An “attribute” in a class which hold the data or information. vi. “Object State” tells about the values of each attributes. vii. “Message Passing” - When an external entity invokes a method of object, it is said to have pass a message to the object. And the message in turn will invoke methods of the object. viii. “Inheritance” is process by which objects of one class get the properties of objects of another class ix. “Polymorphism” is ability to take more than one form, i.e., an operation may exhibit different behavior in different instances. This behavior depends upon the types of data used in the operation. 2.2 Object-Oriented Database Concepts: Object-oriented database supports all the features of object-oriented programming language. Below are the additional features which are in terms of object-oriented databases [4], i. Persistent Objects [4] - Persistent means which can be present permanently. Objects which can exist even after the program has finished using the object. That 7 means the object exists on some persistent storage like disk and can be re-read back from the disk whenever required. ii. Object Identifiers (OID) [4], [5] - It is important to uniquely identify each persistent object that is stored in the database. It is mandatory and system generated. User need not to be aware of it. OID does not depend on the value of attributes. It is similar to primary key which is used in relational databases. But there are difference between object identifiers and primary key [6]: a. Object Identifiers are automatically created when a new object is added to the system whether user specifies it or not. Whereas for primary key, user needs to specify what forms the primary key in the databases. b. In relational algebra, each tuple is unique with table representing set of tuples. So in worst case we can consider entire tuple as the primary key for the table. However, object identifiers are separate attributes that is entire object cannot uniquely identify given object. This is because two or more objects belonging to the same type can have same state and hence be indistinguishable as far as attributes are concerned. But still they represent two different objects. iii. Object Structure [4]: Objects which are stored in databases are with direct association to real world objects. Every instance of an object is characterized by the state/structure of an object. The state/structure of object is defined as triple (i, c, v). “i” stands for object identifier (OID). “c” is the type constructor and specifies what type of value object will have. There are different types of 8 constructors like atom, tuple, set, list, bag and array. “v” is the object state. Object state “v” is usually interpreted based on the constructor “c” which is shown below in the Table 2.1, Table 2.1: Object state interpreted based on type constructor iv. Type ‘c’ Object State ‘v’ Atom Value in domain of basic values. Set OID = {i1, i2, i3, … , in} Tuple <a1 : i1, a2 : i2, … , an : in > List Ordered list [i1, i2, i3, … , in] Array Array of OID’s. Instance Variables [4]: Attributes are defined at the class level. When an object is instantiated, it becomes instance variables. Instance variable of different object could be different even though they represent same attribute. v. Signature and Methods of objects: Just like object oriented programming language, objects are defined by signature and methods which are the interfaces of the objects. vi. Referential Integrity: It uses object identifiers, suppose when an object A refers to another object B, these references are been captured by putting the OID’s of object B as an attribute of object A. Referential integrity is enforced by ensuring 9 that at any point in time OID is represented as an attribute which is a valid OID. It also maintains the dependencies between objects and avoiding dangling references [7]. vii. Extends: It is the collection of objects of same type. Type definition + collection of instance forms extend. viii. Interface: In object-oriented database, we have built-in interfaces which are shown below in figure 2.1, Object Timestamp Date Time Collection Set List Bag Interval Array Figure 2.1: Built-in interface Dictionary 10 2.3 Advantages of Object-Oriented Databases: The advantages of having object-oriented databases are, i. Object Identifiers are automatically generated when new object is added in system. ii. We can have the structure of the object and their behavior [4]. iii. It interacts well with object-oriented languages like Smalltalk, C++, and Java etc [4][8]. Due to it, there is no extra effort needed to design a data layer for interaction with the object oriented programming languages. Also it gives higher performance. iv. Less programming effort because of inheritance, re-usability and extensibility of code [8]. v. Object oriented databases combine object oriented features with the database features. vi. It provides integrated storage area of information which can be used by multiple users, applications and so on. vii. It supports complex objects, abstract and multimedia data types. 2.4 Object-Oriented Databases vs Relational Databases: Below are some of the differences between object-oriented and relational databases, i. Relational databases are made up of tables which consist of rows and columns. Each column has name and can store single data value. 11 In object-oriented databases, data is in the form of objects. Objects comprises of structure (i.e. variables) and behavior (i.e. methods). ii. Relational databases supports basic datatypes like integers, floating-point, characters, strings and so on. Object-oriented databases support basic datatypes as well as large objects like images, videos, and audio and so on. iii. In relational databases, user needs to specify primary key. Otherwise by default entire row is considered as the primary key. Whereas in object-oriented databases, the system automatically generates object identifiers. iv. Suppose we have student table (shown in Table 2.2) which consists of columns like studID, studName, gpa, deptName. Table 2.2: Student table studID Studname Gpa deptName s11 John 3.6 CS s12 Megan 3.0 EE s13 Amy 3.25 ME s14 Kevin 3.9 CS 12 Suppose we need all students name who belong to ‘CS’ department and have gpa >=3.5. And we have query for it is, select s.name from student s where s.deptName = ‘CS’ and s.gpa >= 3.5. In relational databases, the output of the query is shown in the Table 2.3 Table 2.3: Output from Relational Databases Studname John Kevin Whereas in object-oriented databases, the output of the query is shown in the Table 2.4, Table 2.4: Output from Object-Oriented Databases v. String String John Kevin In above output (i.e. from iv), relational database returns table with rows. Whereas the object-oriented databases returns a collection of objects. 13 vi. In above query (i.e. from iv),‘s’ in relational database represents alias name for the student table. Whereas in object-oriented database, ‘s’ represents persistent objects. vii. View is a virtual table which consists of fields based on the result of the query. The fields in the view are from one or more tables. In relational databases, views are created as [9], create view view_name as select column_name from table_name where conditions. Invoking a view can be done as, Select * from view_name. Whereas in objected-oriented databases, views are created using method name and parameters. If parameters are used, then these parameters are used in the conditions. define method_name (parameters) as select column_name from table_name where conditions. Invoking a view is done by, method_name (parameters). 14 Chapter 3 DESIGN OF OBJECT-ORIENTED DATABASES 3.1 Project Requirements: The requirement for this project requires working computers having Windows 7 or higher version of OS, Microsoft Office 2007 or higher version, Microsoft SQL Server 2008 or higher version and Microsoft Visual Studio 2008 or higher version. 3.2 Object Model: A data model [8] is logic group of real world objects with constraints on them and relationships among them. A database language is a concrete syntax for a data model [8]. A database system implements a data model. The major purpose of data models is to support the development of databases by providing the description and format of data. In object oriented databases, a data model is known as object model. OODB supports modeling and creating of data as objects. It must have object-oriented features like inheritance, polymorphism, and encapsulation. These features enable the storage and retrieval of complex data objects. OODB handles complex data like graphics, video, CAD application. 15 The implementation phase starts with design of the object model for analysis. This project discusses design of object model for student database. Figure 3.1 shows object model for student database. Student stud_id name ssn dob address phone Registration 1 1 term Courses * checkEligibility() course_no title faculty_name course_hours enrollment( ) gpacal( ) register_course () Graduate _Student * Undergrad_Stude nt undergrad_major gre_score unit_per_fee major unit_per_fee cal_tution_fee( ) cal_tution_fee( ) Figure 3.1: Object Model 16 From Figure 3.1, each rectangle stands for a class. In each rectangle, it is divided into three parts - First part stands for class name, Second part stands for attributes for that class and Third part stands for the behavior or methods of the class. Here I have defined five classes - Student, Registration, Courses, Graduate_Student and Undergrad_Student. Lines represent relation between connected classes. For example, Student and Registration class have 1 to 1 relationship that means Student first should check whether student is eligible to register for that term. There is also many to many relationship between Registration and Courses that means once the student is eligible to register for the term, then student can register for courses. In Figure 3.1, there exists parent and child relationship that means child can use the properties of the parent. Student is considered a base class and Graduate and Undergrad Student are considered derived classes. Each class has its own attributes and methods defined. But Graduate and Undergrad Student class will also inherit properties of the Student class. 3.3 Installation: Next step in implementation requires installation of SQL Server, Visual Studio and Microsoft Office. The software version of SQL Server is 2008, Visual Studio is 2008 and Office is 2007. All software was obtained from msdn website. 17 After installation, the next task was to start creating a database. After successfully creating databases, data insertion was started and different types of Select and Join queries have been implemented on them. 3.4 Database Schema: The next phase is to design the logical schema for the object model (from Figure 3.1). This project uses C++ language for creation of classes, attributes and methods. This design of schema is done in Visual Studio. A class is specified using the “class” keyword which consists of attributes, relationships and methods. Attribute is specified using “attribute” keyword, type and attribute name. The type of attribute can be basic or structure type. Basic types are integer, float, character, string etc. Structure types [14] are fixed set of labeled objects, possibly of different types, into a single object. Structured types are defined using the “struct” keyword, structure name, opening parentheses (i.e., { ), fixed set of objects, closing parentheses (i.e.,}) and semicolon. Methods are specified by writing the return type (i.e. integer, float etc.) along with the method name and parentheses. Relationships explain how each class is related to each other. Relationship is specified using “relationship” keyword and class name it is related to. 18 The logical schema for student database (from Figure 3.1) is shown below, Class Student { (extend students key stud_id) attribute string stud_id; attribute Name name; attribute integer ssn; attribute Date dob; attribute Address address; attribute integer phone; relationship Registration belongs_to Student; float gpacal( ); integer register_course( ); abstract float cal_tution_fee( ); }; struct Name { string first; string middle; string last; }; 19 struct Address { string street; string number; string city; string state; string zip; }; class Registration { attribute integer term; relationship Student belongs_to Registration; relationship set <Courses> takes inverse Courses : : taken; integer checkEligibility( ); }; class Courses { attribute string courseno; attribute string title; attribute Name faculty_name; attribute interger course_hours; relationship set <Registration> taken inverse Registration : : takes; string enrollment; 20 }; class Graduate_Student extends Student { attribute string undergrad_ major; attribute integer gre_score; attribute integer unit_per_fee; float cal_tution_fee( ); }; class Undergrad_Student extends Student { attribute string major; attribute integer unit_per_fee; float cal_tution_fee( ); }; 3.5 Queries: To query object-oriented databases, Object Query Language (OQL) was used. OQL is a query language for OODB which is similar to Structured Query Language (SQL). Like SQL, there is a Data Definition Language (DDL) and a Data Manipulation Language. Similarly in OODB, there is an Object Definition Language (ODL) and an Object Manipulation Language (OML). ODL is used to specify the logical schema for the object 21 database. OML is used to manipulate objects like inserting, deleting or updating in an object database. Before Implementation, one task that needs to be done is to choose types of queries for analysis. This project focuses on the design of Select and Join queries based on logical schema designed in Section 3.4. Select Query: Like SQL, even OQL uses select-from-where structure to write queries. Select query will display all the records of the table from the database. Using the schema from Section 3.4, below are some examples of the Select query [15] [16] 1. Display title of the course whose course number is CS201. select c.title from Courses c where c.courseno = “CS201” 2. Display the names of graduate student who live in “Sacramento” city and gpa is greater than 3.5 select g.name,g.unit_per_fee from Graduate_Student g where g.address.city = “Sacramento” and g.gpa >=3.5 22 3. Display all the course title, course number and faculty name that have enrollment less than 20 students. select distinct struct (CourseNo:c.courseno,title:c.title,FacultyName: c.faculty_name, (select x from c.offers x where x.enrollment < 20 )) from courses c 4. Display all the names of undergraduate students whose major is CS and who have registered for the course CS159. select u.name.last, u.name.first, u.name.middle from (select s from Undergrad_Student s where s.major = “CS”) as u where u.register_course = “CS159” 5. Display name of the faculty who teaches the student name = David Lee select distinct (c.last, c.first) from Courses c, c.teaches s where s in (select x from sections x, x.is_taken_by t where t.last = “Lee” and c.first = “David”) Join Query: Like SQL, OQL query can join classes in the where clause. Join query will display data from two or more classes based on relationship. Using the schema from Section 3.4, below are some examples of the Join query [15] [16] 1. Display all course numbers and names which were offered in “Fall 2011”. select distinct (c.courseno, c.title) from Courses c, Registration r 23 where r.belongs_to c and r.term = “Fall 2011” 2. List all courses taken by the student David Lee. select c.courseno, c.title from students s s.takes x, x.belongs_to c where s.last = “ Lee” and s.first = “David” 3. List all graduate students who have taken 3 courses exactly, less than 3 courses and greater than 3 courses. select s.last, s.first from Graduate_Student s s.takes x, x.belongs_to c group by less: count (s.takes) < 3 equal: count (s.takes) = 3 greater: count(s.takes) > 3 4. Display number of students enrolled and eligible of taking course CS 206. select c.enrollment from Courses c, c.belongs_to r where c.courseno = “CS 206 ” and r. checkEligibility = ‘Y’ 24 Chapter 4 QUERY PROCESSING AND OPTIMIZATION IN OODB Processing and Optimization of a query is very important for effective retrieval of data from the databases. In today’s world, people do online transaction and usually they do not like to wait; they need the response as soon as possible. Even though we have fast databases, the response time may be longer as there will be many people at same time visiting that particular site. So to reduce response time, we need to optimize the query. 4.1 Query Processing for OODB: Once the query is submitted to the database, the next step is query processing. This allows us to process a given query into algebraic form. In object-oriented databases, there are two new approaches used for query processing, i. hybrid-hash pointer (HP) based approach ii. Multiwavefront (MW) approach. 4.2 Hybrid-Hash Pointer (HP) based approach: In HP approach [5] [12], horizontal partitioning technique is used. In horizontal partitioning [12], instances of an object class are divided horizontally into segments 25 which are stored across multiple nodes in the system. They are fetched from memory for processing. The steps for query processing are [5] [17], i. The query written in OQL (Object Query Language) is submitted to databases. Then the scanner selects all the required tokens, checks for the syntax and semantics of the query. Also it checks whether attributes and table names which are given in the query match with the database. ii. Once the required checking of the query is completed, then the query is represented in query tree. In the query tree, nodes represent algebraic operators and leaves represent entity/object classes. This tree is processed in a leaves-toroot order. Below is example of query tree using the Select and Join queries described in Chapter 3. Select Query: Display name of the faculty who teaches the student name = David Lee select distinct (c.last, c.first) from Courses c, c.teaches s where s in (select x from sections x, x.is_taken_by t where t.last = “Lee” and t.first = “David”); 26 The query tree for the Select query is shown in Figure 4.1, select c.last, c.first from Courses C, Student S where in s Select x from section x, Courses t where and = t.last = lee t.first David Figure 4.1: Query Tree for Select query in OODB Join Query: Display number of students enrolled and eligible of taking course CS 206. select c.enrollment from Courses c, c.belongs_to r where c.courseno = “CS 206 ” and r. checkEligibility = ‘Y’ The query tree for the Join query is shown in Figure 4.2, 27 select c.enrollment( ) from Courses C, Registration R where join Courses Registration = c.courseno = CS206 R.CheckEligibilty ( ) ‘Y’ Figure 4.2: Query Tree for Join query in OODB iii. Once the query is represented in the query tree, processing phase starts. During the processing, second condition in the query is selected and scanned. Then it is hashed to their object instance identifiers. For each object instance selected, if the instance identifiers are in the required bucket found, then it is placed in the memory. Otherwise it is placed in the buffer and later written to the disk. iv. Then first condition is selected and scanned. Then pass on each of the first condition instance to the nodes which store the second condition instances pointed to by the first condition instance. Apparently when receiving a first condition instance, a processing node is hashed. If the instance is hashed to the required bucket of second condition, then join operation is performed. Otherwise it will be stored in buffer and later written to the disk. 28 v. During the processing of a query, all the temporary results which contain instances of object references are constructed. vi. Then the instances of each object class are horizontally-partitioned and stored in large number of processing nodes when a database is established. These partitioned segments can thus be read from memory and be processed in parallel by their corresponding processors. Instances of temporary results are transferred among processors to perform a join operation or its equivalent. Below is the pseudo code which I have designed for processing steps and horizontal partition, //Get table Definition tableDef = objectDatabase.Tables (tableName) //Create a new table based on table Definition create tempName = new tableDef /*Get Values of Current Table and check it with second condition. If found, hash it to the required buffer*/ Foreach value in tableDef { currentValues = tableDef.values If (currentValues == secondCondition) { objectTemp = hash(buffer, currentValues) } } /*Check Constraint for first condition. If found, hash it with matching second condition and perform join operation.*/ If (firstCondition==ObjectTemp) 29 { newResults= hash(objectTemp.Name, firstCondition) } //Store new results in new table tempName = newResults //partition based on the rows in the new table lengthTable = len(tempName) if (k < lengthTable) return tempName [k] i=0 while ((k - i) >= 0) { j = (i-1)/2; sum += j * ( tempName [k - i]) i++ } return partition(k,sum) 4.3 Multiwavefront (MW) approach: In MW approach [5] [12] [13], we use horizontal as well as vertical partition technique for optimization. In vertical partitioning, attributes of an object class are divided into vertical columns. Then group attributes in the same vertical column based on their frequency of being used together. Attributes with complex data types such as video, voice, image, or graph can be partitioned into separate columns, as they occupy a lot of memory. These columns of data are stored separately and thus can be accessed from memory independently. 30 The steps for query processing are [5] [12][13][17], i. The high level query which written in OQL (Object Query Language) is submitted to databases. Then the scanner selects all the required tokens, checks for the syntax and semantics of the query. Also it checks whether attributes and table names which are given in the query match with the database. ii. MW approach uses a graph-based processing strategy. Below is example of query graph using same Select and Join queries which are used in HP approach. Select Query: Display name of the faculty who teaches the student name = David Lee select distinct (c.last, c.first) from Courses c, c.teaches s where s in (select x from sections x, x.is_taken_by t where t.last = “Lee” and t.first = “David”); The query graph for the Select query is shown in Figure 4.3, 31 Lee last Student Name Section x Courses t and first David Figure 4.3: Query Graph for Select query in OODB Join Query: Display number of students enrolled and eligible of taking course CS 206. select c.enrollment from Courses c, c.belongs_to r where c.courseno = “CS 206 ” and r. checkEligibility = ‘Y’ The query graph for the Join query is shown in Figure 4.4, 32 CS 206 courseno Course s enrollment( ) Join registration CheckEligibility( ) Y Figure 4.4: Query Graph for Join query in OODB iii. Once the query is represented in the query graph, processing phase starts. MW approach has processing steps in two phases. In first phase, selection operation is performed and instance identifiers which satisfy the selection condition are transmitted among processors. During the second phase, attributes for only those objects which satisfy the search during the first phase are retrieved. This approach avoids the creation of large temporary tables, reduces the amount of data transfer among processors, and reduces the amount of I/O requirement. iv. In a large database, each class can have a large number of attributes and methods. The instance of each class is first horizontally-partitioned and is stored in a number of processing nodes (same as the HP approach); then each node is vertically partitioned into columns. This vertical partitioning allows values of an attribute having a complex data type to be independently accessed and processed in the memory. 33 Below is the pseudo code which I have designed for processing steps and vertical partition, //Get table Definition Set tableDef = objectDatabase.Tables(tableName) //Get Values of Current Table and check it with condition. If found, store it in buffer Foreach value in tableDef { currentValues = tableDef.values If (currentValues == (Condition)) { objectStore = currentValues } } //using the same horizontal partition steps as defined in HP approach lengthTable = len(tableDef) if (k < lengthTable) return tableDef [k] i=0 while ((k - i) >= 0) { j = (i-1)/2; sum += j * ( tableDef [k - i]) i++ } return partition(k,sum) //partition based on the columns after horizontal partition is done x = sizeof(items) i = x / items if ((x % items) != 0) j++ a = new array; foreach (arrary_item in x) { a[i] = x; i++ 34 if (i == j) { divide array based on j items } } 4.4 Query Optimization: The next phase is query optimization. This project uses two functions - Get and Priority to optimize and rewrite the query. Get function is used for scanning tables and objects in memory (i.e., after partition is completed in the processing step) which are needed and stores table and object names. Then the Priority function gets all the inter-related objects and tables together in a scope. Then it priorities objects and tables and calls them based on priority. In the below examples, Course class is not called until Section and Student condition are matched. Below is the pseudo code which I have designed for Get and Priority function, //Get function - takes OODB query has input get(query) { word =0; //scan the each word in query foreach word in query { word++; while(word >= “from” and word < “where”) i = word +1; } //once the scanning of query is done, store table name in a container ObjectContainer tableNames = Container.openquery(query); for( j=1;j <= i; j++) 35 { temp[j] = dbquery.delegate(query[i]); tableNames = temp[j]. retrievetables(); } } // Priority function //using pquery.h header file for assigning priority to the tables #include “pquery.h” Priority (query, tableNames) { pquery pq; n = total_tables(tablesNames); //scanning each word in query to see which data columns are output word =0; //scan the each word in query and put the tables columns in array foreach word in query { while(word >= “select” and word < “from”) i = word++; array[i] = word; } //check tables columns to which it table it belongs to foreach tablesNames { x = total_columns(i); y = sizeof(tableNames[j]) for(j=0;j < y;j++) { for(i=0;i<y;i++) { If(array[i] == tablesNames) { flag = true; return flag; 36 } } If(flag) priority pg is assigned } } Based on the get( ) and priority( ) function, first phase of optimization is shown below. It uses same Select and Join queries described in HP and MW approach. Display name of the faculty who teaches the student name = David Lee select distinct (c.last, c.first) from Courses c, c.teaches s where s in (select x from sections x, x.is_taken_by t where t.last = “Lee” and t.first = “David”); The rewriting for the select query is shown in Figure 4.5, Project c.last, c.first Mat Course c Mat Student t Select t.last = lee and t.first = David Mat Section x Get Courses c, Section x, Student t Figure 4.5: Rewriting phase 37 This rewriting of the query can be further optimized using operators like index scan, assembly (used for joining table) and filter (sort based on condition). Figure 4.6 shows further optimization for the Figure 4.5 using the scan operators. Project c.last, c.first Assembly Courses Scan Section c Scan c.teaches s Assembly Student Scan s Filter s.last = lee s.first = David Figure 4.6: After Optimization phase 38 Chapter 5 RESULTS This chapter presents the results of the execution time, memory usage and response time for Select and Join queries. 5.1 Select Query Result: Based on the two processing and optimization techniques, Figure 5.1 shows the execution time needed to execute the Select query. Figure 5.1: Execution time of Select query From Figure 5.1, the x-axis represents Query number (from Section 3.5 using Select Query) and the y-axis represents time in seconds. Execution time is the time taken by 39 CPU to execute the query. Execution time also includes time spent for run-time services, system services and interrupts (if any comes in during execution). From Figure 5.1, it can be seen that MW approach is faster than HP approach. MW approach is faster because it selects data columns which are needed and loads them in memory. Figure 5.2 shows the memory usage for the Select query. Figure 5.2: Memory Usage From Figure 5.2, the x-axis represents Query number (from Section 3.5 using Select Query) and the y-axis represents memory in bits. From Figure 5.2, it can be seen that MW approach requires less memory, as there is no creation of temporary tables or 40 objects. It can be stated that MW approach is more efficient in terms of memory requirements. Figure 5.3 shows the response time for the Select query. Figure 5.3: Response Time From Figure 5.3, the x-axis represents Query number (from Section 3.5 using Select Query) and the y-axis represents response time in milliseconds. Response time is the time taken by the system to respond to the user with the results of the query. From Figure 5.3, there is little difference in response time for the MW and HP approach. 41 5.2 Join Query Result: Based on the two processing and optimization techniques, Figure 5.4 shows the execution time needed to execute the Join query. Figure 5.4: Execution time of Join query From Figure 5.4, the x-axis represents Query number (from Section 3.5 using Join Query) and the y-axis represents time in seconds. It can be seen that the MW approach is faster than HP approach. MW approach is faster because it selects data columns from two tables which are necessary and loads them in memory. 42 Figure 5.5 shows the memory usage for the Join query. Figure 5.5: Memory Usage From Figure 5.5, the x-axis represents Query number (from Section 3.5 using Join Query) and the y-axis represents memory in bits. From Figure 5.5, it can be seen that the MW approach occupies less memory, as there is no creation of temporary tables or objects. It can be stated that MW approach is more efficient in terms of memory requirements. 43 Figure 5.6 shows the response time for the Join query. Figure 5.6: Response Time From Figure 5.6, the x-axis represents Query number (from Section 3.5 using Join Query) and the y-axis represents response time in milliseconds. From Figure 5.6, it obvious that response time for MW approach is better than HP approach. It is easier for the system to respond back as MW approach stores the selected columns which are needed in the memory. 44 Chapter 6 CONCLUSION AND FUTURE WORK In analyzing the results from Chapter 5, it is obvious that multiwavefront approach is better than hydrid-hash pointer based approach. Below are the reasons for it i. MW approach takes less time and memory when compared to HP approach because, During the processing of queries in HP approach, the temporary results which contain instances of object references are constructed. Whereas in MW approach, there is no creation of the temporary tables. During the selection of data in MW approach, only needed data columns are selected and loaded into memory. Whereas in HP approach, all the data columns are loaded since data are not stored separately. Therefore, the I/O cost for the MW approach is less than that of the HP approach. ii. HP approach supports only horizontal partition. Whereas in MW approach, it supports horizontal partition as well as vertical partition. Vertical partition is important as it allows the data values in the columns to have complex data and it can be independently accessed and processed in the memory. It stores data in small chunks of memory which will be easier during the scanning and searching of data. 45 6.1 Conclusion: Based on the results presented in Chapter 5 and above discussed reasons, the MW approach is far better than the HP approach in terms of execution time and memory usage. In the case of response time for Select query, the results were similar for both the approaches. But the response time for Join query showed some difference in both the approaches. For the applications in which execution time and memory usage is a concern, the MW approach is a better choice over the HP approach based on performance analysis. 6.2 Future Work: Future work may include the following: i. HP approach can be remodeled by removing the construction of temporary table. The performance may not be equal to MW approach. But it will be better than the current performance. ii. Someone can include analysis based on other parameters like security, portability and resource management. iii. Someone can also design object-oriented queries for Delete, Update and Insert and compare the performance. 46 BIBLIOGRAPHY [1] The Wikipedia link for database index [Online] http://en.wikipedia.org/wiki/Database_index [2] Michael L. Rupley, Jr, “Introduction to Query Processing and Optimization”, Technical Report in Indiana University at South Bend, p.2-4, Jan 2008. [3] The Wikipedia link for object-oriented database management system[Online] http://en.wikipedia.org/wiki/Object_database [4] A presentation “Object Oriented Databases” by the students of University of California, Berkley. Mathieu Metz, Palani Kumaresan, Napa Gavinlertvatana, Kristine Pei Keow Lee, Prabhu Ramachandran, 8th December 2004[Online] http://ieor.berkeley.edu/~goldberg/courses/F04/215/215-OODB.ppt [5] Stanley Y.W. Su, Fellow, IEEE, Sanjay Ranka, Member, IEEE, and Xiang He, “Performance Analysis of Parallel Query Processing Algorithms for Object-Oriented Databases”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, p.356-375, VOL. 12, NO. 6, NOVEMBER/DECEMBER 2000 [6] Forum for differences between relational databases and object-oriented databases[Online] http://forum.world.st/Object-IDs-vs-relational-keys-td1596002.html [7] Referential Integrity Is Important For Databases - Michael (blaha@computer.org), Modelsoft Consulting Corp, www.modelsoftcorp.com . Blaha 47 [8] White paper, “Object-Oriented Database Theory - An Introduction & Indexing in OODBS” by David Maier, Ming-Ju Lee and Andreas Gruenhagen[Online] http://www.csd.uoc.gr/~hy562/Papers/OODBMS.pdf [9] SQL and Views[Online] http://www.w3schools.com/sql/sql_view.asp [10] Francis Chu,Joseph Y. Halpern,Praveen Seshadri y(Department of Computer Science,Cornell University), “Least Expected Cost Query Optimization: An Exercise in Utility” , the 25th International Conference on Very Large Data Bases, p.411-422, March 2005. [11] Presentation on Fundamentals of Database Systems by Elmasri and Navathe[Online] http://faculty.kfupm.edu.sa/ICS/mwaslam/ICS324/ENACh15final.ppt [12] MSDN library for partition[Online] http://msdn.microsoft.com/en-us/library/ms178148.aspx [13] A.K. Thakore, S.Y.W. Su, and H. Lam, “Algorithms for Asynchronous Parallel Processing of Object-Oriented Databases”, IEEE Trans. Knowledge and Data Eng., March 1995. [14] The Wikipedia link for struct in programming language[Online] http://en.wikipedia.org/wiki/Struct_(C_programming_language) [15] Michael Kifer, Wom Kima and Yehoshua Sagiv, “Querying Object-Oriented Databases”, appeared in ACM SIGMOD Conference on Management of Data, San Diego, CA, June 1992. 48 [16] Jay Banerjee, Wom Kim and Kyung-Chang kim, “Queries in Object-Oriented Databases”, IEEE Trans. Knowledge and Data Eng., August 6th, 2002. [17] M Tamer Ozsu and Jose A.Blakeley, “Query Processing in Object-Oriented Databases Systems”, In Proc. ACM SIGMOD Int. Conf. on Management of Data, p.312– 321, October 2000. [18] Hennie J.Steenhagen, Peter M.G. Apers, Henk M.Blanken, Rolf A. de By, “From Nested Queries to Join Queries in OODB”, Proceedings of the 20th VLDB Conference Santiago, Chile,1994.