Student: Deepa Prakash Kapai

advertisement
ANALYZING THE PHASES OF QUERY PROCESSING
Deepa Prakash Kapai
B.E, Visvesvaraya Technological University, India, 2007
PROJECT
Submitted in partial satisfaction of
the requirements for the degree of
MASTER OF SCIENCE
in
COMPUTER SCIENCE
at
CALIFORNIA STATE UNIVERSITY, SACRAMENTO
SPRING
2012
ANALYZING THE PHASES OF QUERY PROCESSING
A Project
by
Deepa Prakash Kapai
Approved by:
__________________________________, Committee Chair
Mary Jane Lee, Ph.D.
__________________________________, Second Reader
Robert Buckley, M.S.
____________________________
Date
ii
Student: Deepa Prakash Kapai
I certify that this student has met the requirements for format contained in the University
format manual, and that this project is suitable for shelving in the Library and credit is to
be awarded for the Project.
__________________________, Graduate Coordinator
Nikrouz Faroughi, Ph.D.
Department of Computer Science
iii
___________________
Date
Abstract
of
ANALYZING THE PHASES OF QUERY PROCESSING
by
Deepa Prakash Kapai
Databases are fundamental part of managing data. Storing multimedia-based objects like
graphics image, videos, audio, etc in databases is a basic need of everyday users.
However, storing these objects in the relational databases is not possible because
relational databases support single value data. To store these complex objects, objectoriented databases were introduced. Object-oriented databases support all the features of
object-oriented
programming
languages,
such
as,
inheritance,
encapsulation,
polymorphism, etc, along with other features of the databases. This project provides an
introduction to the object-oriented databases along with its features, design of an object
model and design of the classes, objects and queries.
Another major functionality of a database is to respond to the users, i.e., process user
queries. Efficiency with speed is an important feature in today’s market. So there is a
need to properly process and optimize a query. This project provides the basic idea of
hydrid-hash pointer (HP) based approach and multiwavefront (MW) approach for
processing the queries. The focus of this project is to design an optimization function and
implement this function on Select and Join queries. This project analyzes the
iv
performance of the Select and Join queries. Performance factors will include query
execution time, memory space requirement and response time. This project makes
recommendations for novice object-oriented database developers in developing
applications where data retrieval time (execution time plus response time) is an important
factor.
, Committee Chair
Mary Jane Lee, Ph.D.
_______________________
Date
v
ACKNOWLEDGEMENT
This space provides me a great honor to thank all the people with whose support this
project and my masters have been a success. I would take this opportunity to convey my
sincere thank you to all.
Firstly, I would like to thank Dr. Mary Jane Lee, her help and supportive guide
throughout this project in highs and lows has been commendable. She took extra effort to
review the report and kept on giving her pieces of advice during my course of project
completion.
I would like to also thank Prof. Robert Buckley for his extended support. He provided me
with his valuable suggestions when they are really needed.
Furthermore, I would like to thank the Department of Computer Science at California
State University, Sacramento for extending this opportunity for me to pursue my Masters
degree and guiding me all the way to become a successful student.
Last but not the least; I am thankful to my parents Prakash H Kapai and Bindu P Kapai
for their constant support and belief in me, their words of wisdom and moral support
helped me overcome all the challenges and through their guidance I was able to
successfully complete my project and earn my Masters Degree.
vi
TABLE OF CONTENTS
Page
Acknowledgement ............................................................................................................. vi
List of Tables ..................................................................................................................... ix
List of Figures ..................................................................................................................... x
Chapter
1. INTRODUCTION .......................................................................................................... 1
1.1 Query Processing and Optimization…………………………………………..2
1.2 Need of Query Processing and Optimization…………………………………2
1.3 Goal of the Project………………………………………………………...…..4
2. OBJECT-ORIENTED DATABASES………………………………………………….5
2.1 Object-Oriented Concepts…….……………………………………………….5
2.2 Object-Oriented Database Concepts…..……..……………..…………………6
2.3 Advantages of Object-Oriented Databases……………………………….….10
2.4 Object-Oriented Databases vs Relational Databases………………………...10
3. DESIGN OF OBJECT-ORIENTED DATABASES………………………………….14
3.1 Project Requirements………………………………………………………...14
3.2 Object Model……………..…….……………………………………………14
3.3 Installation……………………………………………………………………16
3.4 Database Schema………..………………………………………………...…17
3.3 Queries…………………………………………….…………………………20
4. QUERY PROCESSING AND OPTIMIZATION IN OODB………………………...24
4.1 Query Processing for OODB………………..……………………….………24
4.2 Hybrid-Hash Pointer (HP) based approach………………………………….24
4.3 Multiwavefront (MW) approach…………………………..………………...29
4.4 Query Optimization……….………………………………………………....34
5. RESULTS……………………………………………………………………………..38
5.1 Select Query Result…….…………………………….………………………38
vii
5.2 Join Query Result………..……………..…………….………………………41
6. CONCLUSION AND FUTURE WORK……………………………………………..44
6.1 Conclusion……….…….…………………………….………………………45
6.2 Future Work……………..……………..…………….………………………45
Bibliography……………………………………………………………………………..46
viii
LIST OF TABLES
Page
1. Table 2.1: Object state interpreted based on type constructor………………………….8
2. Table 2.2: Student table……………………………………………………………….11
3. Table 2.3: Output from Relational Databases…………………………………………12
4. Table 2.4: Output from Object-Oriented Databases…………………………………..12
ix
LIST OF FIGURES
Page
1. Figure 2.1: Built-in interface…………………………………………………………....9
2. Figure 3.1: Object Model……………………………………………………………...15
3. Figure 4.1: Query Tree for Select query in OODB ………………….………………..26
4. Figure 4.2: Query Tree for Join query in OODB…...…………………………………27
5. Figure 4.3: Query Graph for Select query in OODB….………………………………31
6. Figure 4.4: Query Graph for Join query in OODB……………………………………32
7. Figure 4.5: Rewriting Phase…………………………………………………………...36
8. Figure 4.6: After Optimization phase………………………………………………....37
9. Figure 5.1: Execution time of Select query…………………..……………………….38
10. Figure 5.2: Memory Usage…………….…………………………………………….39
11. Figure 5.3: Response Time…………………………………………………………..40
12. Figure 5.4: Execution time of Join query………………………………………….…41
13. Figure 5.5: Memory Usage…………….…………………………………………….42
14. Figure 5.6: Response Time…………………………………………………………..43
x
1
Chapter 1
INTRODUCTION
Storing huge amount of data across the globe by large organizations has become an
integral part of its infrastructural development. Millions of basic transactions, such as,
withdrawal of money from ATM, paying credit card bill online, are conducted daily. By
storing data in digital form, daily transactions have become efficient and fast.
Today storage of data is not a problem. These days devices are available which can store
huge amounts of data in gigabytes, e.g., devices like iPods and flash devices. Therefore
the problem today is not primarily the storage of data. Storage of data has become much
easier and the cost for storing data has also fallen. However, this fall in cost and data
storage has resulted in new challenge - the retrieval of data. How efficiently can we
retrieve particular data from the gigantic stack of data? Earlier large databases were
usually meant for storing data of size 100MB or few giga bytes. But today large
databases can store 1015 bytes of data.
To get the data fast, indexes were introduced in databases. A database index [1] is a data
structure that improves the speed of data retrieval operations on a database table at the
cost of slower writes and increased storage space. Indexes can be created using one or
2
more columns of a database table, providing the basis for both rapid random lookups and
efficient access of ordered records [1]. However, indexes alone were not helpful in the
efficient retrieval of data. Another very important element in efficient retrieval of the data
is the “Query Processing and Optimization”.
1.1 Query Processing and Optimization:
Query Processing [2] is used to obtain the desired and particular information from a
database system in a predictable and reliable manner. Query Optimization [2] is used to
obtain results back in timely manner. Query Processing and Optimization are extremely
important aspects of DBMS. They help to determine how long a particular query takes to
retrieve specific data. From this, we can differentiate whether the query is interactive or
batched in nature. An interactive query is one which gives the results immediately.
However, the batched query retrieval of information isn’t prompt rather it lags and takes
time.
1.2 Need of Query Processing and Optimization:
Let’s take an example to see why query processing and optimization is so useful.
Suppose there is 1GB of data in the student database. And the user wants to retrieve all
the names of the students who are on student visa from computer science department and
have GPA greater than or equal to 3.5. The user uses the “Select” query to display the
student names,
3
select studname from student where visa = ‘international’ and department = ‘CS’ and gpa
>=3.5.
This Select query requires scanning through the 1GB of data which can take about 1000
seconds.
Now consider a query which is on two different tables and each table is of 1GB of data.
A bad query execution plan would compute Cartesian product of two tables before it
returns the results. If the user has to compute the Cartesian product of 1GB of data times
1GB of data where each access of table will take around 1000 seconds, then the user will
get back the results in 15 - 20 minutes. This plan is an ineffective method of executing
this query and deliverance.
An efficient query execution would try to rework a given query in a more effective
manner. To make the above query execution plan effective, we would first see what user
really wants and then use the join operation which is better than Cartesian product. If we
are able to figure this out, then the user will be able to get the results within 5 minutes
which is much more efficient than Cartesian product.
4
1.3 Goal of the Project:
For any application involving databases, one of the most important features is the
retrieval time. Hence, choosing the right processing and optimization technique is an
important decision for any database developer. The goal of this project is to understand
the steps involved in the query processing, design of Select and Join object-oriented
queries, design of an optimization function and apply this function on the Select and Join
queries. In addition this project will analyze the performance of these queries in terms of
execution time, response time and memory usage. This report is structured as follows:
Chapter 2 discusses about the object-orientation concepts in general and how these
concepts are applied in databases. Chapter 3 discusses about the design of object model,
database schema and queries. Chapter 4 discussed about hybrid-hash pointer based and
multiwavefront processing approach on Select and Join queries, design of the
optimization function and apply this optimization function on the Select and Join queries.
Chapter 5 discusses about the results of analysis using query designed. Finally the
conclusions and future works are discussed in Chapter 6.
5
Chapter 2
OBJECT-ORIENTED DATABASES
Object-Oriented databases [3] are a database management system in which information is
represented in the form of objects as used in object-oriented programming. The data of
the object can be accessed only by the methods associated with that object. ObjectOriented databases are mainly used in applications like computer aided design (CAD),
multimedia, GUI based application and so on. These applications are made up of
fundamental objects which are basic building blocks of the application.
2.1 Object-Oriented Concepts:
Below are the concepts that come from Object-Oriented programming language,
i.
Fundamental building blocks in an object-oriented system - “Object”. Object
represents an instance. Object belongs to a particular type and type is known as
“Class”. We can now think objects are variables of the type class.
ii.
“Abstractions” are used in representing the necessary features without including
any detailed explanation about it. It tells about the attributes present and what can
they do?
iii.
An object can wrap attributes and methods in a single unit using “encapsulation”.
6
iv.
“Interface” (also known as Signature of Object) to object allows exposure to the
outside world. Any external entity can interactive with objects through the
interface by calling particular methods.
v.
An “attribute” in a class which hold the data or information.
vi.
“Object State” tells about the values of each attributes.
vii.
“Message Passing” - When an external entity invokes a method of object, it is
said to have pass a message to the object. And the message in turn will invoke
methods of the object.
viii.
“Inheritance” is process by which objects of one class get the properties of objects
of another class
ix.
“Polymorphism” is ability to take more than one form, i.e., an operation may
exhibit different behavior in different instances. This behavior depends upon the
types of data used in the operation.
2.2 Object-Oriented Database Concepts:
Object-oriented database supports all the features of object-oriented programming
language. Below are the additional features which are in terms of object-oriented
databases [4],
i.
Persistent Objects [4] - Persistent means which can be present permanently.
Objects which can exist even after the program has finished using the object. That
7
means the object exists on some persistent storage like disk and can be re-read
back from the disk whenever required.
ii.
Object Identifiers (OID) [4], [5] - It is important to uniquely identify each
persistent object that is stored in the database. It is mandatory and system
generated. User need not to be aware of it. OID does not depend on the value of
attributes. It is similar to primary key which is used in relational databases. But
there are difference between object identifiers and primary key [6]:
a. Object Identifiers are automatically created when a new object is added to
the system whether user specifies it or not. Whereas for primary key, user
needs to specify what forms the primary key in the databases.
b. In relational algebra, each tuple is unique with table representing set of
tuples. So in worst case we can consider entire tuple as the primary key for
the table. However, object identifiers are separate attributes that is entire
object cannot uniquely identify given object. This is because two or more
objects belonging to the same type can have same state and hence be
indistinguishable as far as attributes are concerned. But still they represent
two different objects.
iii.
Object Structure [4]: Objects which are stored in databases are with direct
association to real world objects. Every instance of an object is characterized by
the state/structure of an object. The state/structure of object is defined as triple (i,
c, v). “i” stands for object identifier (OID). “c” is the type constructor and
specifies what type of value object will have. There are different types of
8
constructors like atom, tuple, set, list, bag and array. “v” is the object state. Object
state “v” is usually interpreted based on the constructor “c” which is shown below
in the Table 2.1,
Table 2.1: Object state interpreted based on type constructor
iv.
Type ‘c’
Object State ‘v’
Atom
Value in domain of basic values.
Set
OID = {i1, i2, i3, … , in}
Tuple
<a1 : i1, a2 : i2, … , an : in >
List
Ordered list [i1, i2, i3, … , in]
Array
Array of OID’s.
Instance Variables [4]: Attributes are defined at the class level. When an object is
instantiated, it becomes instance variables. Instance variable of different object
could be different even though they represent same attribute.
v.
Signature and Methods of objects: Just like object oriented programming
language, objects are defined by signature and methods which are the interfaces
of the objects.
vi.
Referential Integrity: It uses object identifiers, suppose when an object A refers to
another object B, these references are been captured by putting the OID’s of
object B as an attribute of object A. Referential integrity is enforced by ensuring
9
that at any point in time OID is represented as an attribute which is a valid OID. It
also maintains the dependencies between objects and avoiding dangling
references [7].
vii.
Extends: It is the collection of objects of same type. Type definition + collection
of instance forms extend.
viii.
Interface: In object-oriented database, we have
built-in interfaces which are
shown below in figure 2.1,
Object
Timestamp
Date
Time
Collection
Set
List
Bag
Interval
Array
Figure 2.1: Built-in interface
Dictionary
10
2.3 Advantages of Object-Oriented Databases:
The advantages of having object-oriented databases are,
i.
Object Identifiers are automatically generated when new object is added in
system.
ii.
We can have the structure of the object and their behavior [4].
iii.
It interacts well with object-oriented languages like Smalltalk, C++, and Java etc
[4][8]. Due to it, there is no extra effort needed to design a data layer for
interaction with the object oriented programming languages. Also it gives higher
performance.
iv.
Less programming effort because of inheritance, re-usability and extensibility of
code [8].
v.
Object oriented databases combine object oriented features with the database
features.
vi.
It provides integrated storage area of information which can be used by multiple
users, applications and so on.
vii.
It supports complex objects, abstract and multimedia data types.
2.4 Object-Oriented Databases vs Relational Databases:
Below are some of the differences between object-oriented and relational databases,
i.
Relational databases are made up of tables which consist of rows and columns.
Each column has name and can store single data value.
11
In object-oriented databases, data is in the form of objects. Objects comprises of
structure (i.e. variables) and behavior (i.e. methods).
ii.
Relational databases supports basic datatypes like integers, floating-point,
characters, strings and so on.
Object-oriented databases support basic datatypes as well as large objects like
images, videos, and audio and so on.
iii.
In relational databases, user needs to specify primary key. Otherwise by default
entire row is considered as the primary key.
Whereas in object-oriented databases, the system automatically generates object
identifiers.
iv.
Suppose we have student table (shown in Table 2.2) which consists of columns
like studID, studName, gpa, deptName.
Table 2.2: Student table
studID
Studname
Gpa
deptName
s11
John
3.6
CS
s12
Megan
3.0
EE
s13
Amy
3.25
ME
s14
Kevin
3.9
CS
12
Suppose we need all students name who belong to ‘CS’ department and have gpa
>=3.5. And we have query for it is,
select s.name from student s where s.deptName = ‘CS’ and s.gpa >= 3.5.
In relational databases, the output of the query is shown in the Table 2.3
Table 2.3: Output from Relational Databases
Studname
John
Kevin
Whereas in object-oriented databases, the output of the query is shown in the
Table 2.4,
Table 2.4: Output from Object-Oriented Databases
v.
String
String
John
Kevin
In above output (i.e. from iv), relational database returns table with rows.
Whereas the object-oriented databases returns a collection of objects.
13
vi.
In above query (i.e. from iv),‘s’ in relational database represents alias name for
the student table. Whereas in object-oriented database, ‘s’ represents persistent
objects.
vii.
View is a virtual table which consists of fields based on the result of the query.
The fields in the view are from one or more tables. In relational databases, views
are created as [9],
create view view_name as select column_name from table_name where
conditions.
Invoking a view can be done as,
Select * from view_name.
Whereas in objected-oriented databases, views are created using method name
and parameters. If parameters are used, then these parameters are used in the
conditions.
define method_name (parameters) as select column_name from table_name where
conditions.
Invoking a view is done by,
method_name (parameters).
14
Chapter 3
DESIGN OF OBJECT-ORIENTED DATABASES
3.1 Project Requirements:
The requirement for this project requires working computers having Windows 7 or higher
version of OS, Microsoft Office 2007 or higher version, Microsoft SQL Server 2008 or
higher version and Microsoft Visual Studio 2008 or higher version.
3.2 Object Model:
A data model [8] is logic group of real world objects with constraints on them and
relationships among them. A database language is a concrete syntax for a data model [8].
A database system implements a data model. The major purpose of data models is to
support the development of databases by providing the description and format of data.
In object oriented databases, a data model is known as object model. OODB supports
modeling and creating of data as objects. It must have object-oriented features like
inheritance, polymorphism, and encapsulation. These features enable the storage and
retrieval of complex data objects. OODB handles complex data like graphics, video,
CAD application.
15
The implementation phase starts with design of the object model for analysis. This
project discusses design of object model for student database. Figure 3.1 shows object
model for student database.
Student
stud_id
name
ssn
dob
address
phone
Registration
1
1
term
Courses
*
checkEligibility()
course_no
title
faculty_name
course_hours
enrollment( )
gpacal( )
register_course
()
Graduate _Student
*
Undergrad_Stude
nt
undergrad_major
gre_score
unit_per_fee
major
unit_per_fee
cal_tution_fee( )
cal_tution_fee( )
Figure 3.1: Object Model
16
From Figure 3.1, each rectangle stands for a class. In each rectangle, it is divided into
three parts - First part stands for class name, Second part stands for attributes for that
class and Third part stands for the behavior or methods of the class. Here I have defined
five classes - Student, Registration, Courses, Graduate_Student and Undergrad_Student.
Lines represent relation between connected classes. For example, Student and
Registration class have 1 to 1 relationship that means Student first should check whether
student is eligible to register for that term. There is also many to many relationship
between Registration and Courses that means once the student is eligible to register for
the term, then student can register for courses.
In Figure 3.1, there exists parent and child relationship that means child can use the
properties of the parent. Student is considered a base class and Graduate and Undergrad
Student are considered derived classes. Each class has its own attributes and methods
defined. But Graduate and Undergrad Student class will also inherit properties of the
Student class.
3.3 Installation:
Next step in implementation requires installation of SQL Server, Visual Studio and
Microsoft Office. The software version of SQL Server is 2008, Visual Studio is 2008 and
Office is 2007. All software was obtained from msdn website.
17
After installation, the next task was to start creating a database. After successfully
creating databases, data insertion was started and different types of Select and Join
queries have been implemented on them.
3.4 Database Schema:
The next phase is to design the logical schema for the object model (from Figure 3.1).
This project uses C++ language for creation of classes, attributes and methods. This
design of schema is done in Visual Studio.
A class is specified using the “class” keyword which consists of attributes, relationships
and methods. Attribute is specified using “attribute” keyword, type and attribute name.
The type of attribute can be basic or structure type. Basic types are integer, float,
character, string etc. Structure types [14] are fixed set of labeled objects, possibly of
different types, into a single object. Structured types are defined using the “struct”
keyword, structure name, opening parentheses (i.e., { ), fixed set of objects, closing
parentheses (i.e.,}) and semicolon. Methods are specified by writing the return type (i.e.
integer, float etc.) along with the method name and parentheses. Relationships explain
how each class is related to each other. Relationship is specified using “relationship”
keyword and class name it is related to.
18
The logical schema for student database (from Figure 3.1) is shown below,
Class Student {
(extend students key stud_id)
attribute string stud_id;
attribute Name name;
attribute integer ssn;
attribute Date dob;
attribute Address address;
attribute integer phone;
relationship Registration belongs_to Student;
float gpacal( );
integer register_course( );
abstract float cal_tution_fee( );
};
struct Name {
string first;
string middle;
string last;
};
19
struct Address {
string street;
string number;
string city;
string state;
string zip;
};
class Registration {
attribute integer term;
relationship Student belongs_to Registration;
relationship set <Courses> takes inverse Courses : : taken;
integer checkEligibility( );
};
class Courses {
attribute string courseno;
attribute string title;
attribute Name faculty_name;
attribute interger course_hours;
relationship set <Registration> taken inverse Registration : : takes;
string enrollment;
20
};
class Graduate_Student extends Student {
attribute string undergrad_ major;
attribute integer gre_score;
attribute integer unit_per_fee;
float cal_tution_fee( );
};
class Undergrad_Student extends Student {
attribute string major;
attribute integer unit_per_fee;
float cal_tution_fee( );
};
3.5 Queries:
To query object-oriented databases, Object Query Language (OQL) was used. OQL is a
query language for OODB which is similar to Structured Query Language (SQL). Like
SQL, there is a Data Definition Language (DDL) and a Data Manipulation Language.
Similarly in OODB, there is an Object Definition Language (ODL) and an Object
Manipulation Language (OML). ODL is used to specify the logical schema for the object
21
database. OML is used to manipulate objects like inserting, deleting or updating in an
object database.
Before Implementation, one task that needs to be done is to choose types of queries for
analysis. This project focuses on the design of Select and Join queries based on logical
schema designed in Section 3.4.
Select Query:
Like SQL, even OQL uses select-from-where structure to write queries. Select query will
display all the records of the table from the database.
Using the schema from Section 3.4, below are some examples of the Select query [15]
[16]
1. Display title of the course whose course number is CS201.
select c.title from Courses c where c.courseno = “CS201”
2. Display the names of graduate student who live in “Sacramento” city and gpa is
greater than 3.5
select g.name,g.unit_per_fee from Graduate_Student g where g.address.city =
“Sacramento” and g.gpa >=3.5
22
3. Display all the course title, course number and faculty name that have enrollment
less than 20 students.
select
distinct
struct
(CourseNo:c.courseno,title:c.title,FacultyName:
c.faculty_name, (select x from c.offers x where x.enrollment < 20 )) from courses
c
4. Display all the names of undergraduate students whose major is CS and who have
registered for the course CS159.
select u.name.last, u.name.first, u.name.middle
from (select s from Undergrad_Student s where s.major = “CS”) as u
where u.register_course = “CS159”
5. Display name of the faculty who teaches the student name = David Lee
select distinct (c.last, c.first) from Courses c, c.teaches s where s in (select x from
sections x, x.is_taken_by t where t.last = “Lee” and c.first = “David”)
Join Query:
Like SQL, OQL query can join classes in the where clause. Join query will display data
from two or more classes based on relationship.
Using the schema from Section 3.4, below are some examples of the Join query [15] [16]
1. Display all course numbers and names which were offered in “Fall 2011”.
select distinct (c.courseno, c.title) from Courses c, Registration r
23
where r.belongs_to c and r.term = “Fall 2011”
2. List all courses taken by the student David Lee.
select c.courseno, c.title from students s
s.takes x, x.belongs_to c where s.last = “ Lee” and s.first = “David”
3. List all graduate students who have taken 3 courses exactly, less than 3 courses
and greater than 3 courses.
select s.last, s.first from Graduate_Student s s.takes x, x.belongs_to c
group by less: count (s.takes) < 3
equal: count (s.takes) = 3
greater: count(s.takes) > 3
4. Display number of students enrolled and eligible of taking course CS 206.
select c.enrollment
from Courses c,
c.belongs_to r
where c.courseno = “CS 206 ”
and r. checkEligibility = ‘Y’
24
Chapter 4
QUERY PROCESSING AND OPTIMIZATION IN OODB
Processing and Optimization of a query is very important for effective retrieval of data
from the databases. In today’s world, people do online transaction and usually they do not
like to wait; they need the response as soon as possible. Even though we have fast
databases, the response time may be longer as there will be many people at same time
visiting that particular site. So to reduce response time, we need to optimize the query.
4.1 Query Processing for OODB:
Once the query is submitted to the database, the next step is query processing. This
allows us to process a given query into algebraic form. In object-oriented databases, there
are two new approaches used for query processing,
i.
hybrid-hash pointer (HP) based approach
ii.
Multiwavefront (MW) approach.
4.2 Hybrid-Hash Pointer (HP) based approach:
In HP approach [5] [12], horizontal partitioning technique is used. In horizontal
partitioning [12], instances of an object class are divided horizontally into segments
25
which are stored across multiple nodes in the system. They are fetched from memory for
processing.
The steps for query processing are [5] [17],
i.
The query written in OQL (Object Query Language) is submitted to databases.
Then the scanner selects all the required tokens, checks for the syntax and
semantics of the query. Also it checks whether attributes and table names which
are given in the query match with the database.
ii.
Once the required checking of the query is completed, then the query is
represented in query tree. In the query tree, nodes represent algebraic operators
and leaves represent entity/object classes. This tree is processed in a leaves-toroot order.
Below is example of query tree using the Select and Join queries described in
Chapter 3.
Select Query: Display name of the faculty who teaches the student name = David
Lee
select distinct (c.last, c.first) from Courses c, c.teaches s where s in (select x from
sections x, x.is_taken_by t where t.last = “Lee” and t.first = “David”);
26
The query tree for the Select query is shown in Figure 4.1,
select
c.last, c.first
from
Courses C, Student S
where
in
s
Select
x
from
section x, Courses t
where
and
=
t.last
=
lee
t.first
David
Figure 4.1: Query Tree for Select query in OODB
Join Query: Display number of students enrolled and eligible of taking course CS
206.
select c.enrollment from Courses c, c.belongs_to r
where c.courseno = “CS 206 ” and r. checkEligibility = ‘Y’
The query tree for the Join query is shown in Figure 4.2,
27
select
c.enrollment( )
from
Courses C, Registration R
where
join
Courses
Registration
=
c.courseno
=
CS206
R.CheckEligibilty ( )
‘Y’
Figure 4.2: Query Tree for Join query in OODB
iii.
Once the query is represented in the query tree, processing phase starts. During
the processing, second condition in the query is selected and scanned. Then it is
hashed to their object instance identifiers. For each object instance selected, if the
instance identifiers are in the required bucket found, then it is placed in the
memory. Otherwise it is placed in the buffer and later written to the disk.
iv.
Then first condition is selected and scanned. Then pass on each of the first
condition instance to the nodes which store the second condition instances pointed
to by the first condition instance. Apparently when receiving a first condition
instance, a processing node is hashed. If the instance is hashed to the required
bucket of second condition, then join operation is performed. Otherwise it will be
stored in buffer and later written to the disk.
28
v.
During the processing of a query, all the temporary results which contain
instances of object references are constructed.
vi.
Then the instances of each object class are horizontally-partitioned and stored in
large number of processing nodes when a database is established. These
partitioned segments can thus be read from memory and be processed in parallel
by their corresponding processors. Instances of temporary results are transferred
among processors to perform a join operation or its equivalent.
Below is the pseudo code which I have designed for processing steps and horizontal
partition,
//Get table Definition
tableDef = objectDatabase.Tables (tableName)
//Create a new table based on table Definition
create tempName = new tableDef
/*Get Values of Current Table and check it with second condition. If found, hash it to the
required buffer*/
Foreach value in tableDef
{
currentValues = tableDef.values
If (currentValues == secondCondition)
{
objectTemp = hash(buffer, currentValues)
}
}
/*Check Constraint for first condition. If found, hash it with matching second condition
and perform join operation.*/
If (firstCondition==ObjectTemp)
29
{
newResults= hash(objectTemp.Name, firstCondition)
}
//Store new results in new table
tempName = newResults
//partition based on the rows in the new table
lengthTable = len(tempName)
if (k < lengthTable)
return tempName [k]
i=0
while ((k - i) >= 0)
{
j = (i-1)/2;
sum += j * ( tempName [k - i])
i++
}
return partition(k,sum)
4.3 Multiwavefront (MW) approach:
In MW approach [5] [12] [13], we use horizontal as well as vertical partition technique
for optimization. In vertical partitioning, attributes of an object class are divided into
vertical columns. Then group attributes in the same vertical column based on their
frequency of being used together. Attributes with complex data types such as video,
voice, image, or graph can be partitioned into separate columns, as they occupy a lot of
memory. These columns of data are stored separately and thus can be accessed from
memory independently.
30
The steps for query processing are [5] [12][13][17],
i.
The high level query which written in OQL (Object Query Language) is
submitted to databases. Then the scanner selects all the required tokens, checks
for the syntax and semantics of the query. Also it checks whether attributes and
table names which are given in the query match with the database.
ii.
MW approach uses a graph-based processing strategy.
Below is example of query graph using same Select and Join queries which are
used in HP approach.
Select Query: Display name of the faculty who teaches the student name = David
Lee
select distinct (c.last, c.first) from Courses c, c.teaches s where s in (select x from
sections x, x.is_taken_by t where t.last = “Lee” and t.first = “David”);
The query graph for the Select query is shown in Figure 4.3,
31
Lee
last
Student
Name
Section
x
Courses
t
and
first
David
Figure 4.3: Query Graph for Select query in OODB
Join Query: Display number of students enrolled and eligible of taking course CS
206.
select c.enrollment from Courses c, c.belongs_to r
where c.courseno = “CS 206 ” and r. checkEligibility = ‘Y’
The query graph for the Join query is shown in Figure 4.4,
32
CS 206
courseno
Course
s
enrollment( )
Join
registration
CheckEligibility( )
Y
Figure 4.4: Query Graph for Join query in OODB
iii.
Once the query is represented in the query graph, processing phase starts. MW
approach has processing steps in two phases. In first phase, selection operation is
performed and instance identifiers which satisfy the selection condition are
transmitted among processors. During the second phase, attributes for only those
objects which satisfy the search during the first phase are retrieved. This approach
avoids the creation of large temporary tables, reduces the amount of data transfer
among processors, and reduces the amount of I/O requirement.
iv.
In a large database, each class can have a large number of attributes and methods.
The instance of each class is first horizontally-partitioned and is stored in a
number of processing nodes (same as the HP approach); then each node is
vertically partitioned into columns. This vertical partitioning allows values of an
attribute having a complex data type to be independently accessed and processed
in the memory.
33
Below is the pseudo code which I have designed for processing steps and vertical
partition,
//Get table Definition
Set tableDef = objectDatabase.Tables(tableName)
//Get Values of Current Table and check it with condition. If found, store it in buffer
Foreach value in tableDef
{
currentValues = tableDef.values
If (currentValues == (Condition))
{
objectStore = currentValues
}
}
//using the same horizontal partition steps as defined in HP approach
lengthTable = len(tableDef)
if (k < lengthTable)
return tableDef [k]
i=0
while ((k - i) >= 0)
{
j = (i-1)/2;
sum += j * ( tableDef [k - i])
i++
}
return partition(k,sum)
//partition based on the columns after horizontal partition is done
x = sizeof(items)
i = x / items
if ((x % items) != 0)
j++
a = new array;
foreach (arrary_item in x)
{
a[i] = x;
i++
34
if (i == j)
{
divide array based on j items
}
}
4.4 Query Optimization:
The next phase is query optimization. This project uses two functions - Get and Priority
to optimize and rewrite the query. Get function is used for scanning tables and objects in
memory (i.e., after partition is completed in the processing step) which are needed and
stores table and object names. Then the Priority function gets all the inter-related objects
and tables together in a scope. Then it priorities objects and tables and calls them based
on priority. In the below examples, Course class is not called until Section and Student
condition are matched.
Below is the pseudo code which I have designed for Get and Priority function,
//Get function - takes OODB query has input
get(query)
{
word =0;
//scan the each word in query
foreach word in query
{
word++;
while(word >= “from” and word < “where”)
i = word +1;
}
//once the scanning of query is done, store table name in a container
ObjectContainer tableNames = Container.openquery(query);
for( j=1;j <= i; j++)
35
{
temp[j] = dbquery.delegate(query[i]);
tableNames = temp[j]. retrievetables();
}
}
// Priority function
//using pquery.h header file for assigning priority to the tables
#include “pquery.h”
Priority (query, tableNames)
{
pquery pq;
n = total_tables(tablesNames);
//scanning each word in query to see which data columns are output
word =0;
//scan the each word in query and put the tables columns in array
foreach word in query
{
while(word >= “select” and word < “from”)
i = word++;
array[i] = word;
}
//check tables columns to which it table it belongs to
foreach tablesNames
{
x = total_columns(i);
y = sizeof(tableNames[j])
for(j=0;j < y;j++)
{
for(i=0;i<y;i++)
{
If(array[i] == tablesNames)
{
flag = true;
return flag;
36
}
}
If(flag)
priority pg is assigned
}
}
Based on the get( ) and priority( ) function, first phase of optimization is shown below. It
uses same Select and Join queries described in HP and MW approach.
Display name of the faculty who teaches the student name = David Lee
select distinct (c.last, c.first) from Courses c, c.teaches s where s in (select x from
sections x, x.is_taken_by t where t.last = “Lee” and t.first = “David”);
The rewriting for the select query is shown in Figure 4.5,
Project
c.last, c.first
Mat
Course c
Mat
Student t
Select
t.last = lee and t.first = David
Mat
Section x
Get
Courses c, Section x, Student t
Figure 4.5: Rewriting phase
37
This rewriting of the query can be further optimized using operators like index scan,
assembly (used for joining table) and filter (sort based on condition). Figure 4.6 shows
further optimization for the Figure 4.5 using the scan operators.
Project
c.last, c.first
Assembly
Courses
Scan
Section
c
Scan
c.teaches s
Assembly
Student
Scan
s
Filter
s.last = lee
s.first = David
Figure 4.6: After Optimization phase
38
Chapter 5
RESULTS
This chapter presents the results of the execution time, memory usage and response time
for Select and Join queries.
5.1 Select Query Result:
Based on the two processing and optimization techniques, Figure 5.1 shows the execution
time needed to execute the Select query.
Figure 5.1: Execution time of Select query
From Figure 5.1, the x-axis represents Query number (from Section 3.5 using Select
Query) and the y-axis represents time in seconds. Execution time is the time taken by
39
CPU to execute the query. Execution time also includes time spent for run-time services,
system services and interrupts (if any comes in during execution). From Figure 5.1, it can
be seen that MW approach is faster than HP approach. MW approach is faster because it
selects data columns which are needed and loads them in memory.
Figure 5.2 shows the memory usage for the Select query.
Figure 5.2: Memory Usage
From Figure 5.2, the x-axis represents Query number (from Section 3.5 using Select
Query) and the y-axis represents memory in bits. From Figure 5.2, it can be seen that
MW approach requires less memory, as there is no creation of temporary tables or
40
objects. It can be stated that MW approach is more efficient in terms of memory
requirements.
Figure 5.3 shows the response time for the Select query.
Figure 5.3: Response Time
From Figure 5.3, the x-axis represents Query number (from Section 3.5 using Select
Query) and the y-axis represents response time in milliseconds. Response time is the time
taken by the system to respond to the user with the results of the query. From Figure 5.3,
there is little difference in response time for the MW and HP approach.
41
5.2 Join Query Result:
Based on the two processing and optimization techniques, Figure 5.4 shows the execution
time needed to execute the Join query.
Figure 5.4: Execution time of Join query
From Figure 5.4, the x-axis represents Query number (from Section 3.5 using Join Query)
and the y-axis represents time in seconds. It can be seen that the MW approach is faster
than HP approach. MW approach is faster because it selects data columns from two
tables which are necessary and loads them in memory.
42
Figure 5.5 shows the memory usage for the Join query.
Figure 5.5: Memory Usage
From Figure 5.5, the x-axis represents Query number (from Section 3.5 using Join Query)
and the y-axis represents memory in bits. From Figure 5.5, it can be seen that the MW
approach occupies less memory, as there is no creation of temporary tables or objects. It
can be stated that MW approach is more efficient in terms of memory requirements.
43
Figure 5.6 shows the response time for the Join query.
Figure 5.6: Response Time
From Figure 5.6, the x-axis represents Query number (from Section 3.5 using Join Query)
and the y-axis represents response time in milliseconds. From Figure 5.6, it obvious that
response time for MW approach is better than HP approach. It is easier for the system to
respond back as MW approach stores the selected columns which are needed in the
memory.
44
Chapter 6
CONCLUSION AND FUTURE WORK
In analyzing the results from Chapter 5, it is obvious that multiwavefront approach is
better than hydrid-hash pointer based approach. Below are the reasons for it i.
MW approach takes less time and memory when compared to HP approach
because,

During the processing of queries in HP approach, the temporary results which
contain instances of object references are constructed. Whereas in MW
approach, there is no creation of the temporary tables.

During the selection of data in MW approach, only needed data columns are
selected and loaded into memory. Whereas in HP approach, all the data
columns are loaded since data are not stored separately. Therefore, the I/O
cost for the MW approach is less than that of the HP approach.
ii.
HP approach supports only horizontal partition. Whereas in MW approach, it
supports horizontal partition as well as vertical partition. Vertical partition is
important as it allows the data values in the columns to have complex data and it
can be independently accessed and processed in the memory. It stores data in
small chunks of memory which will be easier during the scanning and searching
of data.
45
6.1 Conclusion:
Based on the results presented in Chapter 5 and above discussed reasons, the MW
approach is far better than the HP approach in terms of execution time and memory
usage. In the case of response time for Select query, the results were similar for both the
approaches. But the response time for Join query showed some difference in both the
approaches. For the applications in which execution time and memory usage is a concern,
the MW approach is a better choice over the HP approach based on performance analysis.
6.2 Future Work:
Future work may include the following:
i.
HP approach can be remodeled by removing the construction of temporary table.
The performance may not be equal to MW approach. But it will be better than the
current performance.
ii.
Someone can include analysis based on other parameters like security, portability
and resource management.
iii.
Someone can also design object-oriented queries for Delete, Update and Insert
and compare the performance.
46
BIBLIOGRAPHY
[1] The Wikipedia link for database index [Online]
http://en.wikipedia.org/wiki/Database_index
[2] Michael L. Rupley, Jr, “Introduction to Query Processing and Optimization”,
Technical Report in Indiana University at South Bend, p.2-4, Jan 2008.
[3] The Wikipedia link for object-oriented database management system[Online]
http://en.wikipedia.org/wiki/Object_database
[4] A presentation “Object Oriented Databases” by the students of University of
California, Berkley. Mathieu Metz, Palani Kumaresan, Napa Gavinlertvatana, Kristine
Pei Keow Lee, Prabhu Ramachandran, 8th December 2004[Online]
http://ieor.berkeley.edu/~goldberg/courses/F04/215/215-OODB.ppt
[5] Stanley Y.W. Su, Fellow, IEEE, Sanjay Ranka, Member, IEEE, and Xiang He,
“Performance Analysis of Parallel Query Processing Algorithms for Object-Oriented
Databases”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
p.356-375, VOL. 12, NO. 6, NOVEMBER/DECEMBER 2000
[6] Forum
for
differences
between
relational
databases
and
object-oriented
databases[Online]
http://forum.world.st/Object-IDs-vs-relational-keys-td1596002.html
[7] Referential
Integrity
Is
Important
For
Databases
-
Michael
(blaha@computer.org), Modelsoft Consulting Corp, www.modelsoftcorp.com .
Blaha
47
[8] White paper, “Object-Oriented Database Theory - An Introduction & Indexing in
OODBS” by David Maier, Ming-Ju Lee and Andreas Gruenhagen[Online]
http://www.csd.uoc.gr/~hy562/Papers/OODBMS.pdf
[9] SQL and Views[Online]
http://www.w3schools.com/sql/sql_view.asp
[10] Francis Chu,Joseph Y. Halpern,Praveen Seshadri y(Department of Computer
Science,Cornell University), “Least Expected Cost Query Optimization: An Exercise in
Utility” , the 25th International Conference on Very Large Data Bases, p.411-422, March
2005.
[11] Presentation on Fundamentals of Database Systems by Elmasri and Navathe[Online]
http://faculty.kfupm.edu.sa/ICS/mwaslam/ICS324/ENACh15final.ppt
[12] MSDN library for partition[Online]
http://msdn.microsoft.com/en-us/library/ms178148.aspx
[13] A.K. Thakore, S.Y.W. Su, and H. Lam, “Algorithms for Asynchronous Parallel
Processing of Object-Oriented Databases”, IEEE Trans. Knowledge and Data Eng.,
March 1995.
[14] The Wikipedia link for struct in programming language[Online]
http://en.wikipedia.org/wiki/Struct_(C_programming_language)
[15] Michael
Kifer, Wom Kima and Yehoshua Sagiv, “Querying Object-Oriented
Databases”, appeared in ACM SIGMOD Conference on Management of Data, San Diego,
CA, June 1992.
48
[16] Jay Banerjee, Wom Kim and Kyung-Chang kim, “Queries in Object-Oriented
Databases”, IEEE Trans. Knowledge and Data Eng., August 6th, 2002.
[17] M Tamer Ozsu and Jose A.Blakeley, “Query Processing in Object-Oriented
Databases Systems”, In Proc. ACM SIGMOD Int. Conf. on Management of Data, p.312–
321, October 2000.
[18] Hennie J.Steenhagen, Peter M.G. Apers, Henk M.Blanken, Rolf A. de By, “From
Nested Queries to Join Queries in OODB”, Proceedings of the 20th VLDB Conference
Santiago, Chile,1994.
Download