ID110_Chapter 21

advertisement
Summarization – CS 257
Chapter – 21
Database Systems: The Complete Book
Submitted by:
Nitin Mathur
Submitted to:
Dr.T.Y.Lin
WHY INFORMATION
INTEGRATION ?
Databases are created independently,
even if they later need to work together.
 The use of databases evolves, so we
can not design a database to support
every possible future use.
 We will understand Information
integration from an example of
University Database.

UNIVERSITY DATABASE

Earlier we had different databases for
different functions like;
 Registrar Database for keeping data about
courses and student grades for generating
transcripts.
 Bursar Database for keeping data about the
tuition payments by students.
 Human Resource Department Database for
recording employees including those students
with teaching assistantship jobs.
Continue………


Applications were build using these databases
like generation of payroll checks, calculation of
taxes and social security payments to
government.
But these databases independently were of no
use as a change in 1 database would not reflect
in the other database which had to be performed
manually. For e.g. we want to make sure that
Registrar does not
Continue………



record grades of the student who did not pay
the fees at Bursars office.
Building a whole new database for the system
again is a very expensive and time consuming
process.
In addition to paying for a very expensive
software the University will have to run both
the old and the new databases together for a
long time to see that the new system works
properly or not.
Continue………


A Solution for this is to build a layer of
abstraction, called middleware, on top of all
legacy databases, without disturbing the original
databases.
Now we can query this middleware layer to
retrieve or update data.
Continue………


Often this layer is defined by a collection of
classes and queried in an Object oriented
language.
New applications can be written to access this
layer for data, while the legacy applications
continue to run using the legacy database
THE HETEROGENEITY PROBLEM

When we try to connect information sources that
were developed independently, we invariably find
that sources differ in many ways. Such sources are
called Heterogeneous, and the problem of
integrating them is referred to as the Heterogeneity
Problem. There are different levels of heterogeneity
viz.
1. Communication Heterogeneity.
2. Query-Language Heterogeneity.
Continue………
Schema Heterogeneity.
4. Data type differences.
5. Value Heterogeneity.
6. Semantic Heterogeneity.
3.
COMMUNICATION
HETEROGENEITY


Today, it is common to allow access to your
information using HTTP protocols. However,
some dealers may not make their databases
available on net, but instead accept remote
accesses via anonymous FTP.
Suppose there are 1000 dealers of Aardvark
Automobile Co. out of which 900 use HTTP
while the remaining 100 use FTP, so there
might be problems of communication between
the dealers databases.
QUERY LANGUAGE
HETEROGENEITY


The manner in which we query or modify a
dealer’s database may vary.
For e.g. Some of the dealers may have
different versions of database like some might
use relational database some might not have
relational database, or some of the dealers
might be using SQL, some might be using
Excel spreadsheets or some other database.
SCHEMA HETEROGENEITY


Even assuming that the dealers use a
relational DBMS supporting SQL as the query
language there might be still some
heterogeneity at the highest level like
schemas can differ.
For e.g. one dealer might store cars in a
single relation while the other dealer might
use a schema in which options are separated
out into a second relation.
DATA TYPE DIFFERENCES

Serial Numbers might be represented by a
character strings of varying length at one
source and fixed length at another. The fixed
lengths could differ, and some sources might
use integers rather than character strings.
VALUE HETEROGENEITY

The same concept might be represented by
different constants at different sources. The
color Black might be represented by an
integer code at one source, the string BLACK
at another, and the code BL at a third.
SEMANTIC HETEROGENEITY

Terms might be given different interpretations
at different sources. One dealer might include
trucks in Cars relation, while the another puts
only automobile data in Cars relation. One
dealer might distinguish station wagons from
the minivans, while another doesn’t.
Buffer Management
Introduction

Buffer Manager manages the required memory for
the process with minimum delay.
Read/Write
Buffers
Buffer
Manager
Request
Buffer Management Architecture


Two types of architecture:
 Buffer Manager controls main memory
directly
 Buffer Manager allocates buffer in Virtual
Memory
In Each method, the Buffer Manager should limit
the number of buffers in use which fit in the
available main memory.
Continued ….


When Buffer Manager controls the main
memory directly, it selects the buffer to
empty by returning its content to disk. If it
fails, it may simply be erased from main
memory.
If all the buffers are really in use then very
little useful works gets done.
Buffer Management Strategies

LRU (Least Recent Used)
It makes buffer free from the block that has not
been read or write for the longest time.

FIFO(First In First Out)
It makes buffer free that has been occupied the
longest and assigned to new request.

The “Clock” Algorithm
0
0
1
1
0
0
1
1
The Relationship Between Physical
Operator
Selection and Buffer Management


The query optimizer will eventually select a
set of physical operators that will be used to
execute a given query.
The buffer manager may not be able to
guarantee the availability of the buffers when
the query is executed.
DATA CUBES


Data cube is a multi-dimensional structure , it
as a data abstraction that allows one to view
aggregated data from a number of
perspectives.
It is surrounded by a collection of subcubes/cuboids that represent the aggregation
of the base cuboid along one or more
dimensions.
DATA CUBE
CUBE OPERATOR


In cube operator, we can define an
augmented table CUBE(F) that add an
additional value, denoted *, to each
dimension.
The * has the intuitive meaning “any”, and it
represents aggregation along the dimension
in which it appears.
CUBE OPERATOR
EXAMPLE:
Sales(model , color, date, dealer, val , cnt).
In this query we specify conditions on certain
attributes of the sales relation and group by some other
attributes
In the relation CUBE(sales), we look for those tuples
t with the following properties:
1. If the query specifies a value v for attribute a, then tuple
t has v in its component for a.
2.If the query groups by an attribute a, then t has any non* value in its component for a.

Continued ….
3.If the query neither groups by attribute a nor
specifies a value for a, then t has * in its component
for a.
QUERY:
SELECT color, AVG(price)
FROM Sales
WHERE model=‘Gobi’
GROUP BY color;
Cube(Sales) Form is (‘Gobi’ ,c ,*,*,v ,n)
CUBE IMPLEMENTED BY
MATERIALIZED VIEWS

A materialized view is an object that stores the result of select
statement. These are mostly used in data warehousing to improve
the performance of the select statements that involve group and
aggregate functions we use materialized views .
EXAMPLE:
INSERT INTO salesV1
SELECT model, color, month, city, SUM(val) AS val, SUM(cnt) AS cnt
FROM sales JOIN Dealers ON dealer=name
GROUP BY model , color , month, city;
Sales(model, color, date, dealer, val, cnt)
Query:SELECT model ,SUM(val)
FROM sales
GROUP BY model;
-Can also be written as:
SELECT model, SUM(val)
FROM sales(v1)
GROUP BY model;


SELECT model, color, date, SUM(val)
FROM sales GROUP BY model ,color ,
date;
LATTICE OF VIEWS

In lattice view we can partition the values
of a dimension by grouping according to
one or more attributes of its dimensional
tables.
What is OLAP?
• On-Line Analytic Processing (OLAP)
• OLAP is used to query very large amount of data in
the data warehouse of company
• It involves highly complex queries that use one or
more aggregators
• OLAP queries are also called as decision support
queries



What is OLTP?
Common database operation touch very
small amount of data and they are referred
as
OLTP (online transaction processing)
OLAP queries are considered as long
transactions and long transactions locking the
entire database would shutdown the ordinary
OLTP transactions so OLAP data is stored
separately in data warehouse rather then in
ordinary database
Example of OLAP and OLTP queries
Consider data warehouse of automobile company
the schema can be as follows
Sales (serialNo, date, dealer, price)
Autos (serialNo, model, color)
Dealers( name, city, state, phone)

The typical OLAP query can be for
finding the average sales price by state
SELECT state, AVG (price)
FROM Sales, Dealers
WHERE Sales.dealer = Dealers.name AND date
>= ‘2001-01-04’
GROUP BY state;
In same example the typical OLTP query can
be for finding the price at which the auto with
serial number 123 was sold
MULTIDIMENSIONAL VIEW OF
OLTP DATA :
In typical OLAP application we have a central relation
called fact table.
• Fact table represents events or objects of interest.
• It is helpful to think that objects in the fact table are
arranged in the multidimensional space.
• consider the earlier example of automobile company
the fact table can be build for sales which is the object
of interest and is viewed as a 3 dimensional data cube .
MULTIDIMENSIONAL VIEW
OF OLAP DATA
Each single point in cube represents sales of single automobile and
dimension represents properties of sales.
STAR SCHEMAS
A star schema consist of schema for the fact table, which
links to several other relations called “dimension tables”.
SLICING AND DICING
The row data cube can be partitioned along each dimension
at some level of granularity for analysis this partitioning
operations are known as slicing and dicing.
• In SQL this partitioning is done by “ GROUP BY” clause.
• Lets consider the automobile example. suppose car named Gobi
is not selling well and we want to find exactly which colors are not
doing well
SQL query is as follows:
SELECT color, SUM (Price)
FROM Sales NATURAL JOIN Autos
WHERE model = ‘Gobi’
GROUP BY color;
This query dice by color and slice by model, Focusing on
particular model, the Gobi, and ignores other data.
LOCAL-AS-VIEW MEDIATORS



GAV: Global as view mediators are like view, it doesn’t
exist physically, but piece of it are constructed by the
mediator by asking queries
LAV: Local as view mediators, defines the global
predicates at the mediator, but we do not define
these predicates as views of the source of data
Global expressions are defined for each source
involving global predicates that describe the tuple that
source is able to produce and queries are answered at
mediator by discovering all possible ways to construct
the query using the views provided by sources
MOTIVATION FOR LAV MEDIATORS


LAV mediators help us to discover how and when
to use that source in a given query
Example: Par(c,p)-> GAV of Par(c,p) gives
information about the child and parent but does
not give information of grandparents
LAV Par(c,p) will help to get information of chlidparent and even grandparent
TERMINOLOGY FOR LAV
MEDIATION




It is in form of logic that serves as the language for
defining views.
Datalog is used which will remain common for the
queries of mediator and source which is known as
Conjunctive query.
LAV has global predicates which are the subgoals
of mediator queries
Conjunctive queries defines the views which has
unique view predicate and that view has Global
predicates and associated with particular view.

1.
2.

1.
2.
3.
Example: Par(c,p)->Global predicate
view defined by conjunctive query:
V1(c,p)<- Par(c,p)
Another source produces: V2(c,g)<-Par(c,p)
AND
Par(p,g)
Query at the mediator ask for great grandparents
facts:
Q(w,z)<-Par(w,x) AND Par(x,y) AND Par(y,z)
Or Q(w,z)<-V1(w,x) AND V2(x,z)
Or Q(w,z)<-V2(w,y) AND V1(y,z)
EXPANDING SOLUTIONS

1.
Query Q, Solution S, Sub goals : V(a1,a2,..,an)[can be
same]
V(b1,b2,..,bn)<-B (Entire Body)[distinct], we can
replace V(a1,..an) in solution S by a version of body B
that has the sub goals of B with variables possibly
altered.
Rules:
Find local variables of B which are there in the body
but not in the head, we can replace any local variables
within the conjunctive query if it does not appear
elsewhere in the conjunctive query.
• If there are any local variables of B that appear in B or in
S, replace each one by a distinct new variable that
appears nowhere in V or in S.
• In the body B, replace each bi, by ai, for i=1,2,..n.
• Example:
V(a,b,c,d)<-E(a,b,x,y) AND F(x,y,c,d)
here for V, x and y are local so,
x, y->e, f
so,
V(a,b,c,d)<-E(a,b,e,f) AND F(e,f,c,d)
a,d ->x, b->y and c->1
V(x,y,1,x) has two subgoals E(x,y,e,f) and F(e,f,1,x).
CONTAINMENT OF
CONJUNCTIVE QUERIES



Conjunctive query S be the solution to the
mediator Q,
Expansion of S->E, produces same answers that
Q produces, so, E subset Q.
A containment mapping from Q to E is function
Γ(x) is the ith argument of the head E.
Add to Γ the rule that Γ(c) =c for any constant c.
IF P(x1,x2,..xn) is a subgoal of Q, then P(Γ(x1),
Γ(x2),.., Γ(xn)) is a subgoal of E.
Example:

Queries:
P1: H(x,y)<-A(x,z) AND A(z,y)
P2: H(a,b)<-A(a,c) AND A(c,d) AND A(d,b)
consider Γ(x)=a and Γ(y)=b, first subgoal A(x,z)
can only map to A(a,c) of P2.
1. Γ(z) must be C as A(x,z) can map A(a,c) of P2.
2. Γ(z) must be d as Γ(y)=b, subgoal A(z,y) of P1
becomes A(d,b) in P2.
So, no containment mapping from P! and P2
exists.


Complexity of the containment Mapping Test :
It is NP-complete to decide whether there is an
containment mapping from one conjunctive
query to another.
Importance of containment mappings is
expressed by the theorem:
If Q1 and A2 are conjunctive queries, then Q2 is
subset or equal to Q1, if and only if there is a
containment mapping from Q1 and Q2.
WHY CONTAINMENT MAPPING
TEST WORKS:
Questions:
1. If there is containment mapping, why must
there be a containment of conjunctive
queries?
2. If there is containment, why must there be a
containment mapping?

FINDING SOLUTIONS TO A
MEDIATOR QUERY

Query Q, solutions S, Expansion E of S is
contained in Q.
“If a query Q has n subgoals, then any
answer produced by any solution is also
produced by a solution that has at most n
subgoals.
This is known by LMSS Theorem
Example:

Q1: Q(w,z)<-Par(w,x) AND Par(x,y) AND Par(y,z)
S1: Q(w,z)<-V1(w,x) AND V2(x,z)
S2: Q(w,z)<-V1(w,x) AND V2(x,z) AND V1(t,u) AND
V2(u,v)
by LMSS, E2: Q(w,z)<-Par(w,x) AND Par(x,p) AND
Par(t,u) AND Par(u,q) AND Par(q,v)
and E2 is subset or equal to E1 using containment
mapping that sends each vairable of E1 to the same
variable in E2.
WHY THE LMSS THEOREM
HOLDS





Query Q with n subgoals and S with n subgoals, E
of S must be contained in query Q, E is expansion of
Q.
S’ must be the solution got after removing all
subgoals from S those are not the target of Q.
E subset or equal to Q and also E’ is the expansion
of S’.
So, S is subser of S’ : identity mapping.
Thus there is no need for solution s among the
solution S among the solutions to query Q.
INTRODUCTION

Determining whether two records or tuples
do or do not represent the same person,
organization, place or other entity is called
ENTITY RESOLUTION.
DECIDING WHETHER RECORDS
REPRESENT A COMMON ENTITY


Two records represent the same individual if
the two records have similar values for each
of the fields associated with those records.
It is not sufficient that the values of
corresponding fields be identical because of
following reasons:
1. Misspellings
2. Variant Names
3. Misunderstanding of Names
Continue: Deciding whether Records
represent a Common Entity
4. Evolution of Values
5. Abbreviations
Thus when deciding whether two records
represent the same entity, we need to look
carefully at the kinds of discrepancies and
use the test that measures the similarity of
records.
DECIDING WHETHER RECORDS
REPRESENTS A COMMON ENTITY EDIT DISTANCE



First approach to measure the similarity of
records is Edit Distance.
Values that are strings can be compared by
counting the number of insertions and deletions
of characters it takes to turn one string into
another.
So the records represent the same entity if their
similarity measure is below a given threshold.
DECIDING WHETHER RECORDS
REPRESENTS A COMMON ENTITY NORMALIZATION


To normalize records by replacing certain
substrings by others. For instance: we can use
the table of abbreviations and replace
abbreviations by what they normally stand for.
Once normalize we can use the edit distance to
measure the difference between normalized
values in the fields.
MERGING SIMILAR RECORDS


Merging means replacing two records that are similar
enough to merge and replace by one single record
which contain information of both.
There are many merge rules:
1. Set the field in which the records disagree to the
empty string.
2. (i) Merge by taking the union of the values in each
field
(ii) Declare two records similar if at least two of the
three fields have a nonempty intersection.
Continue: Merging Similar
Records
Name
Address
Phone
1. Susan
123 Oak St.
818-555-1234
2. Susan
456 Maple St.
818-555-1234
3. Susan
456 Maple St.
213-555-5678
After Merging
Name
Address
Phone
(1-2-3) Susan
{123 Oak St.,456 Maple St} {818555-1234, 213555-5678}
USEFUL PROPERTIES OF SIMILARITY
AND MERGE FUNCTIONS
The following properties say that the merge
operation is a semi lattice :
1.
Idempotence : That is, the merge of a record
with itself should surely be that record.
2.
Commutativity : If we merge two records, the
order in which we list them should not matter.
3.
Associativity : The order in which we group
records for a merger should not matter.
CONTINUE: USEFUL PROPERTIES OF
SIMILARITY AND MERGE FUNCTIONS
There are some other properties that we expect similarity
relationship to have:
•
Idempotence for similarity : A record is always
similar to itself
•
Commutativity of similarity : In deciding whether
two records are similar it does not matter in which
order we list them
•
Representability : If r is similar to some other
record s, but s is instead merged with some other
record t, then r remains similar to the merger of s and
t and can be merged with that record.
R-SWOOSH ALGORITHM FOR ICAR
RECORDS



Input: A set of records I, similarity function and a
merge function.
Output: A set of merged records O.
Method:


O:= emptyset;
WHILE I is not empty DO BEGIN
Let r be any record in I;
Find, if possible, some record s in O that is similar to r;
IF no record s exists THEN
move r from I to O
ELSE BEGIN
delete r from I;
delete s from O;
add the merger of r and s to I;
END;
END;
OTHER APPROACHES TO ENTITY
RESOLUTION
The other approaches to entity
resolution are :



Non- ICAR Datasets
Clustering
Partitioning
OTHER APPROACHES TO ENTITY
RESOLUTION - NON ICAR DATASETS
Non ICAR Datasets : We can define a
dominance relation r<=s that means record s
contains all the information contained in
record r.
If so, then we can eliminate record r from
further consideration.
OTHER APPROACHES TO ENTITY
RESOLUTION - CLUSTERING
Clustering: Some time we group the records
into clusters such that members of a cluster
are in some sense similar to each other and
members of different clusters are not similar.
OTHER APPROACHES TO ENTITY
RESOLUTION - PARTITIONING
Partitioning: We can group the records,
perhaps several times, into groups that are
likely to contain similar records and look only
within each group for pairs of similar records.
Download