Summary Student Projects

advertisement
Student Projects
Degree projects available in
Contact Lena Strömbäck, lestr@ida.liu.se
¾ XML standardisation
¾ XML and XML databases
¾ Workflow and Discovery
Summary
Contact Patrick Lambrix, patla@ida.liu.se
¾ Ontology alignment (SAMBO, KitAMO, new algorithms, visualization, connection
to ontology editor)
Lena Strömbäck
Contact He Tan, hetan@ida.liu.se
¾ Text mining
oktober 2008
Summarizing the course
2
Exam Jour – if you have questions.
¾ Friday 10/10
¾ 9-12 – José
¾ 13-16 – Patrick
¾ Formalities regarding the exam
¾ Summary of the course
¾ Some example problems
¾ Monday 13/10
¾ 9-12 - Lena
¾ 13-16 - He
oktober 2008
3
oktober 2008
4
Exam
EER-modelling
Practical and theoretical part
¾ Translate a mini world description into an
EER-model
¾ Entities, relationships, cardinalities,
subclasses, weak entities, identifying
relationships, total participation, ternary
relationships
English dictionary allowed. (Not electronic!)
No other books, no calculator.
¾ Typical mistakes: Mix EER-notation with
relational notation:
¾ No foreign keys as attributes in the EER-diagram,
they are represented by relationships!
¾ Forgotten cardinalities.
oktober 2008
5
oktober 2008
6
1
oktober 2008
The relational model
Definitions
¾
¾
¾
¾
¾ Superkey: a set of attributes uniquely identifying a tuple of a
relation. (A superkey does not have to be minimal!)
¾ Candidate key:: A set of attributes that uniquely and minimally
identifies a tuple of a relation.
¾ Primary key: One candidate key is chosen to be the primary
key.
¾ Prime attribute: An attribute A that is part of a candidate key X
(vs. nonprime attribute)
Basic notation: relation, tuple, attribute
Keys
Foreign keys
Basic operations: select, project, cross-product, join,
aggregation ….
7
oktober 2008
8
Translation of EER-notation into
relational notation
SQL
¾ Know how to translate: Entities, weak entities, relationships
(1:N, N:M, 1:1), subclasses (all four ways and when to use
which), n-ary relationships, union types, multi-valued attributes.
¾
¾
¾
¾
¾
¾ Typical mistakes: forgotten primary/foreign key marking, wrong
translation of subclass or N:M-relationship.
Select
Set-functions (union, …)
Where-conditions
Group by
Joins
¾ No outer join syntax but the concept of.
¾ PSQL
¾ Stored procedures
¾ Triggers
oktober 2008
oktober 2008
9
oktober 2008
10
Normalisation
Data Structures
¾
¾
¾
¾
¾
¾ What are indexes? What are they good for? What types of
indexes do you know? When can they be used?
¾ How much memory are needed?
¾ Show that one or another index type performs better or is more
suitable.
¾ Know how to calculate log2 N, logx N.
11
Why is normalisation useful?
Definitions of 1NF, 2NF, 3NF, BCNF
Recognize the NF of a relation.
Bring the relation in a higher normal form.
(Concepts of 4NF, 5NF)
oktober 2008
12
2
Transactions
Transaction schedule – interleaving
¾ What is a transaction?
¾ Properties of transaction schedules
¾ What operations does it consist of?
¾ Serial, serializable
¾ (Recoverable, cascadeless and strict)
¾ What are important properties of transactions?
¾ Atomicity, Consistency, Isolation, Durable
¾ How are these properties achieved
¾ Implementation of serialisation
¾ Locking, 2PL
¾ How does a transaction update the database?
¾ Deadlock
¾ Read_item, write_item
¾ Interleaving transactions
¾ What is it
¾ Protocols for detection and prevention of deadlock
¾ Transaction schedule
¾ Problems with interleaving
¾ Starvation
¾ Lost update, dirty read, incorrect summary, unrepeatable read.
oktober 2008
13
oktober 2008
14
Database recovery
Query optimisation
¾ Main reasons for database failure
¾ Understand: Backup, Logfile, Checkpoint, Commit, Rollback
¾ Principles for recovery
¾ Relational algebra
¾ Costs in query processing
¾ Heuristic query optimisation
¾ Main failure: Backup+Logfile
¾ Minor failure: Logfile (Undo/Redo)
¾
¾
¾
¾
¾ Use of cache memory
¾
¾
¾
¾
oktober 2008
Why and how?
Update strategies (deferred and immediate)
In-place and shadow paging
How does this affect database recovery?
15
¾ Query plans and algorithms
oktober 2008
Indexes
Study the following log file, origin from a database manager using immediate
update:
Start-transaction T1
Write-item T1, B, 60
Start-transaction T3
Write-item T1, A, 50
Commit T1
Write-item T3, C, 25
Checkpoint
Write-item T3, D, 10
Commit T3
Start-transaction T4
Write-item T4, B, 70
Start-transaction T5
Write-item T5, D, 10
Commit T5
System crash
¾ How many blocks are needed to store the file?
¾ The database designer wants to make an index on the key field.
Which kind of index is suitable? Make a sketch of the index and
calculate the number of blocks needed.
¾ What happens if we want to make the index on another field that is
not the key?
¾ To further speed up the data access, the database designer want to
organize the index in b) as a B+-tree. What is a suitable order of
the tree? How many data accesses will be needed using the B+tree?
17
16
Database recovery
Assume an ordered file whose ordering field is a key. The file
has 15000 records of size 150 bytes each. The disk block is of
size 512 bytes (unspanned allocation). The key field is 10
bytes, block and record pointer sizes are both 40 bytes.
oktober 2008
What does it optimise?
How does it work?
Demonstrate by example
Estimate efficiency of the optimisation
Which variant of immediate update must have been used? Why?
Describe what happens to the four transactions during the recovery. (UNDO,
REDO or nothing)
What is the value of each of the four variables, A, B, C, D after the recovery?
oktober 2008
18
3
Heuristic Optimization
Definitions
¾ Superkey: a set of attributes uniquely identifying a tuple of a
relation. (A superkey does not have to be minimal!)
¾ Key: A set of attributes that uniquely and minimally identifies a
tuple of a relation.
¾ Candidate key: If there is more than one key in a relation, the
keys are called candidate keys.
¾ Primary key: One candidate key is chosen to be the primary
key.
¾ Prime attribute: An attribute A that is part of a candidate key X
(vs. nonprime attribute)
SQL-example query
SELECT E.LNAME
FROM EMPLOYEE E, WORKS_ON W, PROJECT P
WHERE P.PNAME = ‘Aquarius’
AND P.PNUMBER = W.PNO
AND W.ESSN = E.SSN
AND E.BDATE > ‘1957-12-31’
oktober 2008
19
oktober 2008
1NF
20
2NF
1NF: The relation should have no
non-atomic values.
¾ 2NF: no nonprime attribute should be functionally dependent on
a part of any candidate key.
Rnon1NF
Rnon2NF
ID
Name
LivesIn
100
Pettersson
{Stockholm, Linköping}
101
Andersson
{Linköping}
102
Svensson
{Ystad, Hjo, Berlin}
R1NF2
R1NF1
Normalization
oktober 2008
ID
LivesIn
100
Stockholm
ID
Name
100
Linköping
100
Pettersson
101
Linköping
101
Andersson
102
Ystad
102
Svensson
102
Hjo
102
Berlin
21
EmpID
Dept
Work%
EmpName
100
Dev
50
Baker
100
Support
50
Baker
200
Dev
80
Miller
R22NF
R12NF
Normalization
oktober 2008
Dept
Work%
EmpID
EmpName
100
Dev
50
100
Baker
100
Support
50
200
Miller
200
Dev
80
22
3NF
Boyce-Codd Normal Form
¾ 3NF: 2NF + no nonprime attribute should be functionally
dependent on another nonprime attribute (= no transitive
dependency)
BCNF: Every determinant is a superkey
At a gym, an activity takes places in
a certain room at a certain time.
For each activity it allways take place in the same room.
Rnon3NF
ID
Name
Zip
City
100
Andersson
58214
Linköping
101
Björk
10223
Stockholm
102
Carlsson
58214
Linköping
RnonBCNF
R13NF
ID
Normalization
oktober 2008
EmpID
23
Name
R23NF
Zip
Zip
Time
Room
Activity
Mon 17.00
Gym
IronWoman
Mon 17.00
Mirrors
Aerobics
Tue 17.00
Gym
Intro
Tue 17.00
Mirrors
Aerobics
Wed 18.00
Gym
IronWoman
City
100
Andersson
58214
58214
Linköping
101
Björk
10223
10223
Stockholm
102
Carlsson
58214
oktober 2008
24
4
Normalisation
Given the universal relation
R(PID, PersonName,
Country, Continent, ContinentArea, NoOfVisitsInCountry)
How does one find the functional dependencies?
What is a key of R?
oktober 2008
25
oktober 2008
26
Boyce-Codd Normal Form
Normalisation – 4NF
BCNF: Every determinant is a superkey
¾ 4NF: A relation should not contain two or more independent
multi-valued facts about an entity.
At a gym, an activity takes places in
a certain room at a certain time.
For each activity it allways take place in the same room.
RnonBCNF
Time
Room
Activity
Mon 17.00
Gym
IronWoman
Mon 17.00
Mirrors
Aerobics
Tue 17.00
Gym
Intro
Tue 17.00
Mirrors
Aerobics
Wed 18.00
Gym
IronWoman
Person
Skill
Language
John
Cooking
English
John
Cooking
French
Mary
Nothing
English
Person Æ {Skill}, {Language} should result in
R1(Person, Skill) and R2(Person, Language)
Which it does if your relations come from the
ER model:
Person
Skill
n
has
m
oktober 2008
27
oktober 2008
n
has
m
Language
Language
Skill
oktober 2008
Person
28
Normalisation – 5NF
Normalisation – 5NF
¾ 5NF: A relation is in 5NF when its information content cannot
be reconstructed from several smaller relations.
¾ Usually relevant if the table has the form:
R(A, B, C, …) and there are subtle dependencies between the
attributes.
¾ Example: Agents sell products, Companies sell products, and
Agents represent Companies. Subtle dependency: Agents sell
only products from the companies that they represent.
¾ Example: Agents sell products, Companies sell products, and
Agents represent Companies. Subtle dependency: Agents sell
only products from the companies that they represent.
29
oktober 2008
Agent
Company
Smith
GM
Car
Smith
GM
Truck
Baker
Ford
Car
Miller
GM
Miller
Ford
Car
Miller
Ford
Truck
30
Product
Car
The relation is in BCNF and 4NF but not 5NF.
4NF because it is not enough to split it in two relations.
We do not only have that AgentÆ{Company}, {Product} xor
Company Æ{Product}, {Agent} xor Product Æ {Company},
{Agent}. We have all of them.
All three attributes are dependent on each other
(symmetric constraint).
But there is still redundancy: Miller repeats car, Smith repeats
GM.
Not allowed because of subtle dependency.
5
Normalisation – 5NF
Agent
Company
Product
Smith
GM
Car
Smith
GM
Truck
Baker
Ford
Car
Miller
GM
Car
Miller
Ford
Car
Split in three relations.
Looks more but check that if the
new agent Johnson sells
everything from GM and Ford,
2*3 rows have to be added to R1,
but only 2+3 rows have to be
added in R2+R3.
Company
oktober 2008
PID Æ PersonName
PID, Country Æ NoOfVisitsInCountry
Country Æ Continent
Continent Æ ContinentArea
Given these FDs what is the key for R? (=Can we find a
FD that contains all attributes?)
Use the inference rules and go ahead.
Product
Agent
Company
GM
Car
Agent
Smith
GM
GM
Truck
Smith
Product
Car
Baker
Ford
GM
Caterpillar
Smith
Truck
Miller
GM
Ford
Car
Baker
Car
Miller
Ford
Ford
Caterpillar
Miller
Car
31
oktober 2008
32
Is R(PID, Country, Continent, ContinentArea, PersonName,
NoOfVisitsInCountry) in 2NF?
Country Æ Continent and Continent Æ ContinentArea lead to
Country Æ Continent, ContinentArea (transitive rule)
PID, Country Æ Continent, ContinentArea (augmentation rule)
PID, Country Æ PersonName (augmentation rule)
No, because PersonName is only FFD on PID, thus
R1(PID, PersonName)
R2(PID, Country, Continent, ContinentArea, NoOfVisitsInCountry)
PID, Country Æ NoOfVisitsInCountry
lead to
Is R2 in 2NF?
No, because Continent and ContinentArea are only FFD on Country, thus
R1(PID, PersonName)
R21(Country, Continent, ContinentArea)
R22(PID, Country, NoOfVisitsInCountry)
Æ R1, R21, R22 are now in 2NF
PID, Country Æ Continent, ContinentArea, PersonName, NoOfVisitsInCountry
(additive rule)
Thus Person, Country is a key of R.
This step was already given in the normalisation lab.
oktober 2008
33
oktober 2008
Are R1, R21, R22 in 3NF?
Are R1, R22, R211, R212 in BCNF?
R22(PID, Country, NoOfVisitsInCountry),
R1(PID, PersonName):
Yes, because there is only one non-key attribute.
R22(PID, Country, NoOfVisitsInCountry),
R1(PID, PersonName):
R211(Country, Continent)
R212(Continent, ContinentArea)
R21(Country, Continent, ContinentArea):
No, because Continent determines ContinentArea, thus
R211(Country, Continent)
R212(Continent, ContinentArea)
oktober 2008
34
35
Æ Yes (do not be fooled by candidate keys!)
oktober 2008
36
6
Normalization - practical
tblInvoice
Normalization - practical
tblInvoiceRow
InvoiceID
number(10) NOT NULL;
CustomerID ...
InvoiceRowID
InvoiceID
Item
ItemCost
(PK)
tblInvoice
number(10) NOT NULL;
number(10) NOT NULL;
varchar2(100);
number(5);
(PK)
(FK)
tblInvoiceRow
InvoiceID
number(10) NOT NULL;
TotalCost
number(10);
CustomerID ...
(PK)
InvoiceRowID
InvoiceID
Item
ItemCost
number(10) NOT NULL;
number(10) NOT NULL;
varchar2(100);
number(5);
(PK)
(FK)
TRIGGER SOM UPPDATERAR TotalCost
Antal fakturarader:
1,6 miljoner
SELECT InvoiceID,
(SELECT SUM(ItemCost) FROM tblInvoiceRow
WHERE tblInvoiceRow.InvoiceID=tblInvoice.InvoiceID)
AS TotalCost
FROM tblInvoice;
SELECT InvoiceID, TotalCost FROM tblInvoice;
Antal fakturarader:
1,6 miljoner
Execution Plan
---------------------------------------------------------Plan hash value: 2416057354
Execution Plan
---------------------------------------------------------Plan hash value: 2165970884
-----------------------------------------------------------------------------------| Id | Operation
| Name
| Rows | Bytes | Cost (%CPU)| Time
|
-----------------------------------------------------------------------------------|
0 | SELECT STATEMENT
|
|
5 |
65 |
2
(0)| 00:00:01 |
|
1 | SORT AGGREGATE
|
|
1 |
26 |
|
|
|* 2 |
TABLE ACCESS FULL| TBLINVOICEROW | 13739 |
348K|
519
(4)| 00:00:08 |
|
3 | TABLE ACCESS FULL | TBLINVOICE
|
5 |
65 |
2
(0)| 00:00:01 |
------------------------------------------------------------------------------------
oktober 2008
-------------------------------------------------------------------------------| Id | Operation
| Name
| Rows | Bytes | Cost (%CPU)| Time
|
-------------------------------------------------------------------------------|
0 | SELECT STATEMENT |
|
5 |
130 |
2
(0)| 00:00:01 |
|
1 | TABLE ACCESS FULL| TBLINVOICE |
5 |
130 |
2
(0)| 00:00:01 |
--------------------------------------------------------------------------------
37
oktober 2008
38
Heuristic Optimization –
Canonical Form
π
Transaction schedules – properties
LNAME
¾ Serial: Transactions are executed after each other
¾ Serialisable: Konflict equivalent to a serial schedule
¾ Look for write-read, write-write conflicts!
¾ Recoverable: Never rollback a sommitted transaction
σPNAME=‘Aquarius’ AND PNUMBER=PNO AND ESSN=SSN AND BDATE>’1957-12-31’
¾ Look for the commit points when one transaction reads after
another transaction writes.
¾ Cascadeless: Never cascading rollback
¾ If any transaction writes no other transaction is allowed to read
until after commit of the first transaction.
X
¾ Strict: As cascadeles, but also look for write
¾ Read/Write – always on the same data item.
PROJECT
X
WORKS_ON
EMPLOYEE
oktober 2008
39
oktober 2008
Heuristic Optimization –
Move Select Down
π
40
Heuristic Optimization –
Apply Most Restrictive Select
πLNAME First
LNAME
σPNUMBER=PNO
σESSN=SSN
X
X
σESSN=SSN
σPNAME=‘Aquarius’
σPNUMBER=PNO
X
PROJECT
X
σBDATE>’1957-12-31’
σPNAME=‘Aquarius’
WORKS_ON
41
EMPLOYEE
WORKS_ON
PROJECT
EMPLOYEE
oktober 2008
σBDATE>’1957-12-31’
oktober 2008
42
7
Heuristic Optimization – Convert
Cartesian Product/Select
π with Join
Heuristic Optimization –
Move Projections Downπthe
Tree
LNAME
LNAME
ESSN=SSN
ESSN=SSN
πESSN
PNUMBER=PNO
σBDATE>’1957-12-31’
PNUMBER=PNO
πPNUMBER
EMPLOYEE
σPNAME=‘Aquarius’
WORKS_ON
σPNAME=‘Aquarius’
PROJECT
oktober 2008
43
πSSN,LNAME
πESSN,PNO
σBDATE>’1957-12-31’
EMPLOYEE
WORKS_ON
PROJECT
oktober 2008
44
8
Download