Student Projects Degree projects available in Contact Lena Strömbäck, lestr@ida.liu.se ¾ XML standardisation ¾ XML and XML databases ¾ Workflow and Discovery Summary Contact Patrick Lambrix, patla@ida.liu.se ¾ Ontology alignment (SAMBO, KitAMO, new algorithms, visualization, connection to ontology editor) Lena Strömbäck Contact He Tan, hetan@ida.liu.se ¾ Text mining oktober 2008 Summarizing the course 2 Exam Jour – if you have questions. ¾ Friday 10/10 ¾ 9-12 – José ¾ 13-16 – Patrick ¾ Formalities regarding the exam ¾ Summary of the course ¾ Some example problems ¾ Monday 13/10 ¾ 9-12 - Lena ¾ 13-16 - He oktober 2008 3 oktober 2008 4 Exam EER-modelling Practical and theoretical part ¾ Translate a mini world description into an EER-model ¾ Entities, relationships, cardinalities, subclasses, weak entities, identifying relationships, total participation, ternary relationships English dictionary allowed. (Not electronic!) No other books, no calculator. ¾ Typical mistakes: Mix EER-notation with relational notation: ¾ No foreign keys as attributes in the EER-diagram, they are represented by relationships! ¾ Forgotten cardinalities. oktober 2008 5 oktober 2008 6 1 oktober 2008 The relational model Definitions ¾ ¾ ¾ ¾ ¾ Superkey: a set of attributes uniquely identifying a tuple of a relation. (A superkey does not have to be minimal!) ¾ Candidate key:: A set of attributes that uniquely and minimally identifies a tuple of a relation. ¾ Primary key: One candidate key is chosen to be the primary key. ¾ Prime attribute: An attribute A that is part of a candidate key X (vs. nonprime attribute) Basic notation: relation, tuple, attribute Keys Foreign keys Basic operations: select, project, cross-product, join, aggregation …. 7 oktober 2008 8 Translation of EER-notation into relational notation SQL ¾ Know how to translate: Entities, weak entities, relationships (1:N, N:M, 1:1), subclasses (all four ways and when to use which), n-ary relationships, union types, multi-valued attributes. ¾ ¾ ¾ ¾ ¾ ¾ Typical mistakes: forgotten primary/foreign key marking, wrong translation of subclass or N:M-relationship. Select Set-functions (union, …) Where-conditions Group by Joins ¾ No outer join syntax but the concept of. ¾ PSQL ¾ Stored procedures ¾ Triggers oktober 2008 oktober 2008 9 oktober 2008 10 Normalisation Data Structures ¾ ¾ ¾ ¾ ¾ ¾ What are indexes? What are they good for? What types of indexes do you know? When can they be used? ¾ How much memory are needed? ¾ Show that one or another index type performs better or is more suitable. ¾ Know how to calculate log2 N, logx N. 11 Why is normalisation useful? Definitions of 1NF, 2NF, 3NF, BCNF Recognize the NF of a relation. Bring the relation in a higher normal form. (Concepts of 4NF, 5NF) oktober 2008 12 2 Transactions Transaction schedule – interleaving ¾ What is a transaction? ¾ Properties of transaction schedules ¾ What operations does it consist of? ¾ Serial, serializable ¾ (Recoverable, cascadeless and strict) ¾ What are important properties of transactions? ¾ Atomicity, Consistency, Isolation, Durable ¾ How are these properties achieved ¾ Implementation of serialisation ¾ Locking, 2PL ¾ How does a transaction update the database? ¾ Deadlock ¾ Read_item, write_item ¾ Interleaving transactions ¾ What is it ¾ Protocols for detection and prevention of deadlock ¾ Transaction schedule ¾ Problems with interleaving ¾ Starvation ¾ Lost update, dirty read, incorrect summary, unrepeatable read. oktober 2008 13 oktober 2008 14 Database recovery Query optimisation ¾ Main reasons for database failure ¾ Understand: Backup, Logfile, Checkpoint, Commit, Rollback ¾ Principles for recovery ¾ Relational algebra ¾ Costs in query processing ¾ Heuristic query optimisation ¾ Main failure: Backup+Logfile ¾ Minor failure: Logfile (Undo/Redo) ¾ ¾ ¾ ¾ ¾ Use of cache memory ¾ ¾ ¾ ¾ oktober 2008 Why and how? Update strategies (deferred and immediate) In-place and shadow paging How does this affect database recovery? 15 ¾ Query plans and algorithms oktober 2008 Indexes Study the following log file, origin from a database manager using immediate update: Start-transaction T1 Write-item T1, B, 60 Start-transaction T3 Write-item T1, A, 50 Commit T1 Write-item T3, C, 25 Checkpoint Write-item T3, D, 10 Commit T3 Start-transaction T4 Write-item T4, B, 70 Start-transaction T5 Write-item T5, D, 10 Commit T5 System crash ¾ How many blocks are needed to store the file? ¾ The database designer wants to make an index on the key field. Which kind of index is suitable? Make a sketch of the index and calculate the number of blocks needed. ¾ What happens if we want to make the index on another field that is not the key? ¾ To further speed up the data access, the database designer want to organize the index in b) as a B+-tree. What is a suitable order of the tree? How many data accesses will be needed using the B+tree? 17 16 Database recovery Assume an ordered file whose ordering field is a key. The file has 15000 records of size 150 bytes each. The disk block is of size 512 bytes (unspanned allocation). The key field is 10 bytes, block and record pointer sizes are both 40 bytes. oktober 2008 What does it optimise? How does it work? Demonstrate by example Estimate efficiency of the optimisation Which variant of immediate update must have been used? Why? Describe what happens to the four transactions during the recovery. (UNDO, REDO or nothing) What is the value of each of the four variables, A, B, C, D after the recovery? oktober 2008 18 3 Heuristic Optimization Definitions ¾ Superkey: a set of attributes uniquely identifying a tuple of a relation. (A superkey does not have to be minimal!) ¾ Key: A set of attributes that uniquely and minimally identifies a tuple of a relation. ¾ Candidate key: If there is more than one key in a relation, the keys are called candidate keys. ¾ Primary key: One candidate key is chosen to be the primary key. ¾ Prime attribute: An attribute A that is part of a candidate key X (vs. nonprime attribute) SQL-example query SELECT E.LNAME FROM EMPLOYEE E, WORKS_ON W, PROJECT P WHERE P.PNAME = ‘Aquarius’ AND P.PNUMBER = W.PNO AND W.ESSN = E.SSN AND E.BDATE > ‘1957-12-31’ oktober 2008 19 oktober 2008 1NF 20 2NF 1NF: The relation should have no non-atomic values. ¾ 2NF: no nonprime attribute should be functionally dependent on a part of any candidate key. Rnon1NF Rnon2NF ID Name LivesIn 100 Pettersson {Stockholm, Linköping} 101 Andersson {Linköping} 102 Svensson {Ystad, Hjo, Berlin} R1NF2 R1NF1 Normalization oktober 2008 ID LivesIn 100 Stockholm ID Name 100 Linköping 100 Pettersson 101 Linköping 101 Andersson 102 Ystad 102 Svensson 102 Hjo 102 Berlin 21 EmpID Dept Work% EmpName 100 Dev 50 Baker 100 Support 50 Baker 200 Dev 80 Miller R22NF R12NF Normalization oktober 2008 Dept Work% EmpID EmpName 100 Dev 50 100 Baker 100 Support 50 200 Miller 200 Dev 80 22 3NF Boyce-Codd Normal Form ¾ 3NF: 2NF + no nonprime attribute should be functionally dependent on another nonprime attribute (= no transitive dependency) BCNF: Every determinant is a superkey At a gym, an activity takes places in a certain room at a certain time. For each activity it allways take place in the same room. Rnon3NF ID Name Zip City 100 Andersson 58214 Linköping 101 Björk 10223 Stockholm 102 Carlsson 58214 Linköping RnonBCNF R13NF ID Normalization oktober 2008 EmpID 23 Name R23NF Zip Zip Time Room Activity Mon 17.00 Gym IronWoman Mon 17.00 Mirrors Aerobics Tue 17.00 Gym Intro Tue 17.00 Mirrors Aerobics Wed 18.00 Gym IronWoman City 100 Andersson 58214 58214 Linköping 101 Björk 10223 10223 Stockholm 102 Carlsson 58214 oktober 2008 24 4 Normalisation Given the universal relation R(PID, PersonName, Country, Continent, ContinentArea, NoOfVisitsInCountry) How does one find the functional dependencies? What is a key of R? oktober 2008 25 oktober 2008 26 Boyce-Codd Normal Form Normalisation – 4NF BCNF: Every determinant is a superkey ¾ 4NF: A relation should not contain two or more independent multi-valued facts about an entity. At a gym, an activity takes places in a certain room at a certain time. For each activity it allways take place in the same room. RnonBCNF Time Room Activity Mon 17.00 Gym IronWoman Mon 17.00 Mirrors Aerobics Tue 17.00 Gym Intro Tue 17.00 Mirrors Aerobics Wed 18.00 Gym IronWoman Person Skill Language John Cooking English John Cooking French Mary Nothing English Person Æ {Skill}, {Language} should result in R1(Person, Skill) and R2(Person, Language) Which it does if your relations come from the ER model: Person Skill n has m oktober 2008 27 oktober 2008 n has m Language Language Skill oktober 2008 Person 28 Normalisation – 5NF Normalisation – 5NF ¾ 5NF: A relation is in 5NF when its information content cannot be reconstructed from several smaller relations. ¾ Usually relevant if the table has the form: R(A, B, C, …) and there are subtle dependencies between the attributes. ¾ Example: Agents sell products, Companies sell products, and Agents represent Companies. Subtle dependency: Agents sell only products from the companies that they represent. ¾ Example: Agents sell products, Companies sell products, and Agents represent Companies. Subtle dependency: Agents sell only products from the companies that they represent. 29 oktober 2008 Agent Company Smith GM Car Smith GM Truck Baker Ford Car Miller GM Miller Ford Car Miller Ford Truck 30 Product Car The relation is in BCNF and 4NF but not 5NF. 4NF because it is not enough to split it in two relations. We do not only have that AgentÆ{Company}, {Product} xor Company Æ{Product}, {Agent} xor Product Æ {Company}, {Agent}. We have all of them. All three attributes are dependent on each other (symmetric constraint). But there is still redundancy: Miller repeats car, Smith repeats GM. Not allowed because of subtle dependency. 5 Normalisation – 5NF Agent Company Product Smith GM Car Smith GM Truck Baker Ford Car Miller GM Car Miller Ford Car Split in three relations. Looks more but check that if the new agent Johnson sells everything from GM and Ford, 2*3 rows have to be added to R1, but only 2+3 rows have to be added in R2+R3. Company oktober 2008 PID Æ PersonName PID, Country Æ NoOfVisitsInCountry Country Æ Continent Continent Æ ContinentArea Given these FDs what is the key for R? (=Can we find a FD that contains all attributes?) Use the inference rules and go ahead. Product Agent Company GM Car Agent Smith GM GM Truck Smith Product Car Baker Ford GM Caterpillar Smith Truck Miller GM Ford Car Baker Car Miller Ford Ford Caterpillar Miller Car 31 oktober 2008 32 Is R(PID, Country, Continent, ContinentArea, PersonName, NoOfVisitsInCountry) in 2NF? Country Æ Continent and Continent Æ ContinentArea lead to Country Æ Continent, ContinentArea (transitive rule) PID, Country Æ Continent, ContinentArea (augmentation rule) PID, Country Æ PersonName (augmentation rule) No, because PersonName is only FFD on PID, thus R1(PID, PersonName) R2(PID, Country, Continent, ContinentArea, NoOfVisitsInCountry) PID, Country Æ NoOfVisitsInCountry lead to Is R2 in 2NF? No, because Continent and ContinentArea are only FFD on Country, thus R1(PID, PersonName) R21(Country, Continent, ContinentArea) R22(PID, Country, NoOfVisitsInCountry) Æ R1, R21, R22 are now in 2NF PID, Country Æ Continent, ContinentArea, PersonName, NoOfVisitsInCountry (additive rule) Thus Person, Country is a key of R. This step was already given in the normalisation lab. oktober 2008 33 oktober 2008 Are R1, R21, R22 in 3NF? Are R1, R22, R211, R212 in BCNF? R22(PID, Country, NoOfVisitsInCountry), R1(PID, PersonName): Yes, because there is only one non-key attribute. R22(PID, Country, NoOfVisitsInCountry), R1(PID, PersonName): R211(Country, Continent) R212(Continent, ContinentArea) R21(Country, Continent, ContinentArea): No, because Continent determines ContinentArea, thus R211(Country, Continent) R212(Continent, ContinentArea) oktober 2008 34 35 Æ Yes (do not be fooled by candidate keys!) oktober 2008 36 6 Normalization - practical tblInvoice Normalization - practical tblInvoiceRow InvoiceID number(10) NOT NULL; CustomerID ... InvoiceRowID InvoiceID Item ItemCost (PK) tblInvoice number(10) NOT NULL; number(10) NOT NULL; varchar2(100); number(5); (PK) (FK) tblInvoiceRow InvoiceID number(10) NOT NULL; TotalCost number(10); CustomerID ... (PK) InvoiceRowID InvoiceID Item ItemCost number(10) NOT NULL; number(10) NOT NULL; varchar2(100); number(5); (PK) (FK) TRIGGER SOM UPPDATERAR TotalCost Antal fakturarader: 1,6 miljoner SELECT InvoiceID, (SELECT SUM(ItemCost) FROM tblInvoiceRow WHERE tblInvoiceRow.InvoiceID=tblInvoice.InvoiceID) AS TotalCost FROM tblInvoice; SELECT InvoiceID, TotalCost FROM tblInvoice; Antal fakturarader: 1,6 miljoner Execution Plan ---------------------------------------------------------Plan hash value: 2416057354 Execution Plan ---------------------------------------------------------Plan hash value: 2165970884 -----------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -----------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 5 | 65 | 2 (0)| 00:00:01 | | 1 | SORT AGGREGATE | | 1 | 26 | | | |* 2 | TABLE ACCESS FULL| TBLINVOICEROW | 13739 | 348K| 519 (4)| 00:00:08 | | 3 | TABLE ACCESS FULL | TBLINVOICE | 5 | 65 | 2 (0)| 00:00:01 | ------------------------------------------------------------------------------------ oktober 2008 -------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 5 | 130 | 2 (0)| 00:00:01 | | 1 | TABLE ACCESS FULL| TBLINVOICE | 5 | 130 | 2 (0)| 00:00:01 | -------------------------------------------------------------------------------- 37 oktober 2008 38 Heuristic Optimization – Canonical Form π Transaction schedules – properties LNAME ¾ Serial: Transactions are executed after each other ¾ Serialisable: Konflict equivalent to a serial schedule ¾ Look for write-read, write-write conflicts! ¾ Recoverable: Never rollback a sommitted transaction σPNAME=‘Aquarius’ AND PNUMBER=PNO AND ESSN=SSN AND BDATE>’1957-12-31’ ¾ Look for the commit points when one transaction reads after another transaction writes. ¾ Cascadeless: Never cascading rollback ¾ If any transaction writes no other transaction is allowed to read until after commit of the first transaction. X ¾ Strict: As cascadeles, but also look for write ¾ Read/Write – always on the same data item. PROJECT X WORKS_ON EMPLOYEE oktober 2008 39 oktober 2008 Heuristic Optimization – Move Select Down π 40 Heuristic Optimization – Apply Most Restrictive Select πLNAME First LNAME σPNUMBER=PNO σESSN=SSN X X σESSN=SSN σPNAME=‘Aquarius’ σPNUMBER=PNO X PROJECT X σBDATE>’1957-12-31’ σPNAME=‘Aquarius’ WORKS_ON 41 EMPLOYEE WORKS_ON PROJECT EMPLOYEE oktober 2008 σBDATE>’1957-12-31’ oktober 2008 42 7 Heuristic Optimization – Convert Cartesian Product/Select π with Join Heuristic Optimization – Move Projections Downπthe Tree LNAME LNAME ESSN=SSN ESSN=SSN πESSN PNUMBER=PNO σBDATE>’1957-12-31’ PNUMBER=PNO πPNUMBER EMPLOYEE σPNAME=‘Aquarius’ WORKS_ON σPNAME=‘Aquarius’ PROJECT oktober 2008 43 πSSN,LNAME πESSN,PNO σBDATE>’1957-12-31’ EMPLOYEE WORKS_ON PROJECT oktober 2008 44 8