Summarization – CS 257 Chapter – 21 Database Systems: The Complete Book Submitted by: Nitin Mathur Submitted to: Dr.T.Y.Lin WHY INFORMATION INTEGRATION ? Databases are created independently, even if they later need to work together. The use of databases evolves, so we can not design a database to support every possible future use. We will understand Information integration from an example of University Database. UNIVERSITY DATABASE Earlier we had different databases for different functions like; Registrar Database for keeping data about courses and student grades for generating transcripts. Bursar Database for keeping data about the tuition payments by students. Human Resource Department Database for recording employees including those students with teaching assistantship jobs. Continue……… Applications were build using these databases like generation of payroll checks, calculation of taxes and social security payments to government. But these databases independently were of no use as a change in 1 database would not reflect in the other database which had to be performed manually. For e.g. we want to make sure that Registrar does not Continue……… record grades of the student who did not pay the fees at Bursars office. Building a whole new database for the system again is a very expensive and time consuming process. In addition to paying for a very expensive software the University will have to run both the old and the new databases together for a long time to see that the new system works properly or not. Continue……… A Solution for this is to build a layer of abstraction, called middleware, on top of all legacy databases, without disturbing the original databases. Now we can query this middleware layer to retrieve or update data. Continue……… Often this layer is defined by a collection of classes and queried in an Object oriented language. New applications can be written to access this layer for data, while the legacy applications continue to run using the legacy database THE HETEROGENEITY PROBLEM When we try to connect information sources that were developed independently, we invariably find that sources differ in many ways. Such sources are called Heterogeneous, and the problem of integrating them is referred to as the Heterogeneity Problem. There are different levels of heterogeneity viz. 1. Communication Heterogeneity. 2. Query-Language Heterogeneity. Continue……… Schema Heterogeneity. 4. Data type differences. 5. Value Heterogeneity. 6. Semantic Heterogeneity. 3. COMMUNICATION HETEROGENEITY Today, it is common to allow access to your information using HTTP protocols. However, some dealers may not make their databases available on net, but instead accept remote accesses via anonymous FTP. Suppose there are 1000 dealers of Aardvark Automobile Co. out of which 900 use HTTP while the remaining 100 use FTP, so there might be problems of communication between the dealers databases. QUERY LANGUAGE HETEROGENEITY The manner in which we query or modify a dealer’s database may vary. For e.g. Some of the dealers may have different versions of database like some might use relational database some might not have relational database, or some of the dealers might be using SQL, some might be using Excel spreadsheets or some other database. SCHEMA HETEROGENEITY Even assuming that the dealers use a relational DBMS supporting SQL as the query language there might be still some heterogeneity at the highest level like schemas can differ. For e.g. one dealer might store cars in a single relation while the other dealer might use a schema in which options are separated out into a second relation. DATA TYPE DIFFERENCES Serial Numbers might be represented by a character strings of varying length at one source and fixed length at another. The fixed lengths could differ, and some sources might use integers rather than character strings. VALUE HETEROGENEITY The same concept might be represented by different constants at different sources. The color Black might be represented by an integer code at one source, the string BLACK at another, and the code BL at a third. SEMANTIC HETEROGENEITY Terms might be given different interpretations at different sources. One dealer might include trucks in Cars relation, while the another puts only automobile data in Cars relation. One dealer might distinguish station wagons from the minivans, while another doesn’t. Buffer Management Introduction Buffer Manager manages the required memory for the process with minimum delay. Read/Write Buffers Buffer Manager Request Buffer Management Architecture Two types of architecture: Buffer Manager controls main memory directly Buffer Manager allocates buffer in Virtual Memory In Each method, the Buffer Manager should limit the number of buffers in use which fit in the available main memory. Continued …. When Buffer Manager controls the main memory directly, it selects the buffer to empty by returning its content to disk. If it fails, it may simply be erased from main memory. If all the buffers are really in use then very little useful works gets done. Buffer Management Strategies LRU (Least Recent Used) It makes buffer free from the block that has not been read or write for the longest time. FIFO(First In First Out) It makes buffer free that has been occupied the longest and assigned to new request. The “Clock” Algorithm 0 0 1 1 0 0 1 1 The Relationship Between Physical Operator Selection and Buffer Management The query optimizer will eventually select a set of physical operators that will be used to execute a given query. The buffer manager may not be able to guarantee the availability of the buffers when the query is executed. DATA CUBES Data cube is a multi-dimensional structure , it as a data abstraction that allows one to view aggregated data from a number of perspectives. It is surrounded by a collection of subcubes/cuboids that represent the aggregation of the base cuboid along one or more dimensions. DATA CUBE CUBE OPERATOR In cube operator, we can define an augmented table CUBE(F) that add an additional value, denoted *, to each dimension. The * has the intuitive meaning “any”, and it represents aggregation along the dimension in which it appears. CUBE OPERATOR EXAMPLE: Sales(model , color, date, dealer, val , cnt). In this query we specify conditions on certain attributes of the sales relation and group by some other attributes In the relation CUBE(sales), we look for those tuples t with the following properties: 1. If the query specifies a value v for attribute a, then tuple t has v in its component for a. 2.If the query groups by an attribute a, then t has any non* value in its component for a. Continued …. 3.If the query neither groups by attribute a nor specifies a value for a, then t has * in its component for a. QUERY: SELECT color, AVG(price) FROM Sales WHERE model=‘Gobi’ GROUP BY color; Cube(Sales) Form is (‘Gobi’ ,c ,*,*,v ,n) CUBE IMPLEMENTED BY MATERIALIZED VIEWS A materialized view is an object that stores the result of select statement. These are mostly used in data warehousing to improve the performance of the select statements that involve group and aggregate functions we use materialized views . EXAMPLE: INSERT INTO salesV1 SELECT model, color, month, city, SUM(val) AS val, SUM(cnt) AS cnt FROM sales JOIN Dealers ON dealer=name GROUP BY model , color , month, city; Sales(model, color, date, dealer, val, cnt) Query:SELECT model ,SUM(val) FROM sales GROUP BY model; -Can also be written as: SELECT model, SUM(val) FROM sales(v1) GROUP BY model; SELECT model, color, date, SUM(val) FROM sales GROUP BY model ,color , date; LATTICE OF VIEWS In lattice view we can partition the values of a dimension by grouping according to one or more attributes of its dimensional tables. What is OLAP? • On-Line Analytic Processing (OLAP) • OLAP is used to query very large amount of data in the data warehouse of company • It involves highly complex queries that use one or more aggregators • OLAP queries are also called as decision support queries What is OLTP? Common database operation touch very small amount of data and they are referred as OLTP (online transaction processing) OLAP queries are considered as long transactions and long transactions locking the entire database would shutdown the ordinary OLTP transactions so OLAP data is stored separately in data warehouse rather then in ordinary database Example of OLAP and OLTP queries Consider data warehouse of automobile company the schema can be as follows Sales (serialNo, date, dealer, price) Autos (serialNo, model, color) Dealers( name, city, state, phone) The typical OLAP query can be for finding the average sales price by state SELECT state, AVG (price) FROM Sales, Dealers WHERE Sales.dealer = Dealers.name AND date >= ‘2001-01-04’ GROUP BY state; In same example the typical OLTP query can be for finding the price at which the auto with serial number 123 was sold MULTIDIMENSIONAL VIEW OF OLTP DATA : In typical OLAP application we have a central relation called fact table. • Fact table represents events or objects of interest. • It is helpful to think that objects in the fact table are arranged in the multidimensional space. • consider the earlier example of automobile company the fact table can be build for sales which is the object of interest and is viewed as a 3 dimensional data cube . MULTIDIMENSIONAL VIEW OF OLAP DATA Each single point in cube represents sales of single automobile and dimension represents properties of sales. STAR SCHEMAS A star schema consist of schema for the fact table, which links to several other relations called “dimension tables”. SLICING AND DICING The row data cube can be partitioned along each dimension at some level of granularity for analysis this partitioning operations are known as slicing and dicing. • In SQL this partitioning is done by “ GROUP BY” clause. • Lets consider the automobile example. suppose car named Gobi is not selling well and we want to find exactly which colors are not doing well SQL query is as follows: SELECT color, SUM (Price) FROM Sales NATURAL JOIN Autos WHERE model = ‘Gobi’ GROUP BY color; This query dice by color and slice by model, Focusing on particular model, the Gobi, and ignores other data. LOCAL-AS-VIEW MEDIATORS GAV: Global as view mediators are like view, it doesn’t exist physically, but piece of it are constructed by the mediator by asking queries LAV: Local as view mediators, defines the global predicates at the mediator, but we do not define these predicates as views of the source of data Global expressions are defined for each source involving global predicates that describe the tuple that source is able to produce and queries are answered at mediator by discovering all possible ways to construct the query using the views provided by sources MOTIVATION FOR LAV MEDIATORS LAV mediators help us to discover how and when to use that source in a given query Example: Par(c,p)-> GAV of Par(c,p) gives information about the child and parent but does not give information of grandparents LAV Par(c,p) will help to get information of chlidparent and even grandparent TERMINOLOGY FOR LAV MEDIATION It is in form of logic that serves as the language for defining views. Datalog is used which will remain common for the queries of mediator and source which is known as Conjunctive query. LAV has global predicates which are the subgoals of mediator queries Conjunctive queries defines the views which has unique view predicate and that view has Global predicates and associated with particular view. 1. 2. 1. 2. 3. Example: Par(c,p)->Global predicate view defined by conjunctive query: V1(c,p)<- Par(c,p) Another source produces: V2(c,g)<-Par(c,p) AND Par(p,g) Query at the mediator ask for great grandparents facts: Q(w,z)<-Par(w,x) AND Par(x,y) AND Par(y,z) Or Q(w,z)<-V1(w,x) AND V2(x,z) Or Q(w,z)<-V2(w,y) AND V1(y,z) EXPANDING SOLUTIONS 1. Query Q, Solution S, Sub goals : V(a1,a2,..,an)[can be same] V(b1,b2,..,bn)<-B (Entire Body)[distinct], we can replace V(a1,..an) in solution S by a version of body B that has the sub goals of B with variables possibly altered. Rules: Find local variables of B which are there in the body but not in the head, we can replace any local variables within the conjunctive query if it does not appear elsewhere in the conjunctive query. • If there are any local variables of B that appear in B or in S, replace each one by a distinct new variable that appears nowhere in V or in S. • In the body B, replace each bi, by ai, for i=1,2,..n. • Example: V(a,b,c,d)<-E(a,b,x,y) AND F(x,y,c,d) here for V, x and y are local so, x, y->e, f so, V(a,b,c,d)<-E(a,b,e,f) AND F(e,f,c,d) a,d ->x, b->y and c->1 V(x,y,1,x) has two subgoals E(x,y,e,f) and F(e,f,1,x). CONTAINMENT OF CONJUNCTIVE QUERIES Conjunctive query S be the solution to the mediator Q, Expansion of S->E, produces same answers that Q produces, so, E subset Q. A containment mapping from Q to E is function Γ(x) is the ith argument of the head E. Add to Γ the rule that Γ(c) =c for any constant c. IF P(x1,x2,..xn) is a subgoal of Q, then P(Γ(x1), Γ(x2),.., Γ(xn)) is a subgoal of E. Example: Queries: P1: H(x,y)<-A(x,z) AND A(z,y) P2: H(a,b)<-A(a,c) AND A(c,d) AND A(d,b) consider Γ(x)=a and Γ(y)=b, first subgoal A(x,z) can only map to A(a,c) of P2. 1. Γ(z) must be C as A(x,z) can map A(a,c) of P2. 2. Γ(z) must be d as Γ(y)=b, subgoal A(z,y) of P1 becomes A(d,b) in P2. So, no containment mapping from P! and P2 exists. Complexity of the containment Mapping Test : It is NP-complete to decide whether there is an containment mapping from one conjunctive query to another. Importance of containment mappings is expressed by the theorem: If Q1 and A2 are conjunctive queries, then Q2 is subset or equal to Q1, if and only if there is a containment mapping from Q1 and Q2. WHY CONTAINMENT MAPPING TEST WORKS: Questions: 1. If there is containment mapping, why must there be a containment of conjunctive queries? 2. If there is containment, why must there be a containment mapping? FINDING SOLUTIONS TO A MEDIATOR QUERY Query Q, solutions S, Expansion E of S is contained in Q. “If a query Q has n subgoals, then any answer produced by any solution is also produced by a solution that has at most n subgoals. This is known by LMSS Theorem Example: Q1: Q(w,z)<-Par(w,x) AND Par(x,y) AND Par(y,z) S1: Q(w,z)<-V1(w,x) AND V2(x,z) S2: Q(w,z)<-V1(w,x) AND V2(x,z) AND V1(t,u) AND V2(u,v) by LMSS, E2: Q(w,z)<-Par(w,x) AND Par(x,p) AND Par(t,u) AND Par(u,q) AND Par(q,v) and E2 is subset or equal to E1 using containment mapping that sends each vairable of E1 to the same variable in E2. WHY THE LMSS THEOREM HOLDS Query Q with n subgoals and S with n subgoals, E of S must be contained in query Q, E is expansion of Q. S’ must be the solution got after removing all subgoals from S those are not the target of Q. E subset or equal to Q and also E’ is the expansion of S’. So, S is subser of S’ : identity mapping. Thus there is no need for solution s among the solution S among the solutions to query Q. INTRODUCTION Determining whether two records or tuples do or do not represent the same person, organization, place or other entity is called ENTITY RESOLUTION. DECIDING WHETHER RECORDS REPRESENT A COMMON ENTITY Two records represent the same individual if the two records have similar values for each of the fields associated with those records. It is not sufficient that the values of corresponding fields be identical because of following reasons: 1. Misspellings 2. Variant Names 3. Misunderstanding of Names Continue: Deciding whether Records represent a Common Entity 4. Evolution of Values 5. Abbreviations Thus when deciding whether two records represent the same entity, we need to look carefully at the kinds of discrepancies and use the test that measures the similarity of records. DECIDING WHETHER RECORDS REPRESENTS A COMMON ENTITY EDIT DISTANCE First approach to measure the similarity of records is Edit Distance. Values that are strings can be compared by counting the number of insertions and deletions of characters it takes to turn one string into another. So the records represent the same entity if their similarity measure is below a given threshold. DECIDING WHETHER RECORDS REPRESENTS A COMMON ENTITY NORMALIZATION To normalize records by replacing certain substrings by others. For instance: we can use the table of abbreviations and replace abbreviations by what they normally stand for. Once normalize we can use the edit distance to measure the difference between normalized values in the fields. MERGING SIMILAR RECORDS Merging means replacing two records that are similar enough to merge and replace by one single record which contain information of both. There are many merge rules: 1. Set the field in which the records disagree to the empty string. 2. (i) Merge by taking the union of the values in each field (ii) Declare two records similar if at least two of the three fields have a nonempty intersection. Continue: Merging Similar Records Name Address Phone 1. Susan 123 Oak St. 818-555-1234 2. Susan 456 Maple St. 818-555-1234 3. Susan 456 Maple St. 213-555-5678 After Merging Name Address Phone (1-2-3) Susan {123 Oak St.,456 Maple St} {818555-1234, 213555-5678} USEFUL PROPERTIES OF SIMILARITY AND MERGE FUNCTIONS The following properties say that the merge operation is a semi lattice : 1. Idempotence : That is, the merge of a record with itself should surely be that record. 2. Commutativity : If we merge two records, the order in which we list them should not matter. 3. Associativity : The order in which we group records for a merger should not matter. CONTINUE: USEFUL PROPERTIES OF SIMILARITY AND MERGE FUNCTIONS There are some other properties that we expect similarity relationship to have: • Idempotence for similarity : A record is always similar to itself • Commutativity of similarity : In deciding whether two records are similar it does not matter in which order we list them • Representability : If r is similar to some other record s, but s is instead merged with some other record t, then r remains similar to the merger of s and t and can be merged with that record. R-SWOOSH ALGORITHM FOR ICAR RECORDS Input: A set of records I, similarity function and a merge function. Output: A set of merged records O. Method: O:= emptyset; WHILE I is not empty DO BEGIN Let r be any record in I; Find, if possible, some record s in O that is similar to r; IF no record s exists THEN move r from I to O ELSE BEGIN delete r from I; delete s from O; add the merger of r and s to I; END; END; OTHER APPROACHES TO ENTITY RESOLUTION The other approaches to entity resolution are : Non- ICAR Datasets Clustering Partitioning OTHER APPROACHES TO ENTITY RESOLUTION - NON ICAR DATASETS Non ICAR Datasets : We can define a dominance relation r<=s that means record s contains all the information contained in record r. If so, then we can eliminate record r from further consideration. OTHER APPROACHES TO ENTITY RESOLUTION - CLUSTERING Clustering: Some time we group the records into clusters such that members of a cluster are in some sense similar to each other and members of different clusters are not similar. OTHER APPROACHES TO ENTITY RESOLUTION - PARTITIONING Partitioning: We can group the records, perhaps several times, into groups that are likely to contain similar records and look only within each group for pairs of similar records.