Outlook and Summary Todays lecture

advertisement
Todays lecture
¾
¾
¾
¾
Outlook and Summary
Research outlook – What do we do?
Formalities regarding the exam
Summary of the course
Some example problems
Lena Strömbäck
oktober 2007
2
Semantic Web, Databases
and Information Integration
GET THAT PROTEIN!
Patrick Lambrix, Lena Strömbäck, He Tan
Databases and Web Information Systems group
Institutionen för datavetenskap
Linköpings universitet
oktober 2007
3
oktober 2007
Accessing data sources on the web
Which?
Where?
How?
Genomics
4
The Semantic Web
Clinical
trials
Disease
information
DISCOVERY
Chemical
structure
W3C: Facilities to put machine-understandable data on the Web are
Metabolism,
toxicology
becoming a high priority for many communities. The Web can reach its
full potential only if it becomes a place where data can be shared and
processed by automated tools as well as by people. For the Web to
scale, tomorrow's programs must be able to share and process data
even when these programs have been designed totally independently.
The Semantic Web is a vision: the idea of having data on the web
defined and linked in a way that it can be used by machines not just for
display purposes, but for automation, integration and reuse of data
across various applications.
Disease
models
Target
structure
oktober 2007
5
oktober 2007
6
1
Ontologies and data representation standards
Ontologies define the basic terms and relations comprising the
vocabulary of a topic area, as well as the rules for combining
terms and relations to define extensions to the vocabulary.
(Neches, Fikes, Finin, Gruber, Patil, Senator)
Key mechanisms for the Semantic Web:
Ontologies and standards
To allow reuse of actual data, standardized representations are
often defined for different areas. This correspond to the data
model of a database.
oktober 2007
7
oktober 2007
8
Overlapping ontologies
GENE ONTOLOGY (GO)
SIGNAL-ONTOLOGY (SigO)
immune response
i- acute-phase response
i- anaphylaxis
i- antigen presentation
i- antigen processing
i- cellular defense response
i- cytokine metabolism
i- cytokine biosynthesis
synonym cytokine production
…
p- regulation of cytokine
biosynthesis
…
…
i- B-cell activation
i- B-cell differentiation
i- B-cell proliferation
i- cellular defense response
…
i- T-cell activation
i- activation of natural killer
cell activity
…
Immune Response
i- Allergic Response
i- Antigen Processing and Presentation
i- B Cell Activation
i- B Cell Development
i- Complement Signaling
synonym complement activation
i- Cytokine Response
i- Immune Suppression
i- Inflammation
i- Intestinal Immunity
i- Leukotriene Response
i- Leukotriene Metabolism
i- Natural Killer Cell Response
i- T Cell Activation
i- T Cell Development
i- T Cell Selection in Thymus
Ontologies and ontology alignment
equivalent concepts
equivalent relations
is-a relation
oktober 2007
9
oktober 2007
Merging ontologies
Research at ADIT produced …
GENE ONTOLOGY (GO)
SIGNAL-ONTOLOGY (SigO)
The New Ontology
immune response
i- acute-phase response
i- anaphylaxis
i- antigen presentation
i- antigen processing
i- cellular defense response
i- cytokine metabolism
i- cytokine biosynthesis
synonym cytokine production
…
p- regulation of cytokine
biosynthesis
…
…
i- B-cell activation
i- B-cell differentiation
i- B-cell proliferation
i- cellular defense response
…
i- T-cell activation
i- activation of natural killer
cell activity
…
Immune Response
i- Allergic Response
i- Antigen Processing and Presentation
i- B Cell Activation
i- B Cell Development
i- Complement Signaling
synonym complement activation
i- Cytokine Response
i- Immune Suppression
i- Inflammation
i- Intestinal Immunity
i- Leukotriene Response
i- Leukotriene Metabolism
i- Natural Killer Cell Response
i- T Cell Activation
i- T Cell Development
i- T Cell Selection in Thymus
immune response
i- acute-phase response
i- Allergic Response
i- anaphylaxis
i- Antigen Processing and Presentation
i- antigen presentation
i- antigen processing
i- cellular defense response
i- Complement Signaling
synonym complement activation
i- Cytokine Response
i- cytokine metabolism
i- cytokine biosynthesis
synonym cytokine production
…
p- regulation of cytokine_biosynthesis
…
…
i- B-cell activation
i- B-cell differentiation
i- B-cell proliferation
i- cellular defense response
…
i- T-cell activation
i- Natural Killer Cell Response
i- activation of natural killer cell activity
…
equivalent concepts
equivalent relations
10
¾
¾
¾
¾
¾
General framework for alignment systems
Classification of current algorithms and systems
SAMBO – System for Aligning and Merging Biomedical Ontologies
KitAMO – Toolkit for aligning and merging ontologies
Strategy for recommendation of alignment systems
State of the art research with some of the best systems that currently exist.
is-a relation
oktober 2007
11
oktober 2007
12
2
Example TV-domain
Representation and management of data
oktober 2007
13
oktober 2007
Example: Molecular interactions
SBML
PSI MI
KEGGML
<model
<entry>
<interactorList>
<proteinInteractor id="S1">
<pathway
name="Example" ...>
<entry id="1"
name="Succ“
…>
<graphics
name="3.1.3.67"
….>
</entry>
<reaction
substrate="1"
product="2"
type=…>
</reaction>
…
</pathway>
name="Example">
<listOfSpecies>
<species
name="Succinate"
id="S1">
</listOfSpecies>
<listOfReactions>
<reaction
name="Suc Cat"
id="R1">
<listOfReactants>
<speciesReference
species="S1" />
</listOfReactants>
<listOfProducts>
</listOfProducts>
<listOfModifiers>
</listOfModifiers>
</reaction>
</listOfReactions>
</model>
<names> ….
</names>
</proteinInteractor>
</interactorList>
<interactionList>
<interaction>
<names>… </names>
<participantList>
<proteinParticipant>
<proteinInteractorRef
ref="S1">
<role>neutral</role>
</proteinParticipant>
</participantList>
</interaction>
</interactionList>
TV-Anytime
XMLTV
<TVAMain>
<ProgramDescription>
<ProgramLocationTable>
<Schedule serviceIDRef='SVT1'
start='...' end='...'>
<ScheduleEvent>
<Program crid='crid:...'/>
</ScheduleEvent>
</Schedule>
</ProgramLocationTable>
<ProgramInformationTable>
<ProgramInformation
programId='crid:...'>
<BasicDescription>
<Title>SVT News</Title>
<Genre href=... >
<Name>News</Name>
</Genre> ...
</BasicDescription>
</ProgramInformation>
</ProgramInformationTable>
</ProgramDescription>
</TVAMain>
<tv>
<channel id="C1">
<display-name lang="se">
SVT1</display-name>
</channel>
<programme
start="200006031633"
channel="3sat.de">
<title lang="sv">Nyheterna</title>
<title lang="en">News</title>
<desc lang="sv">… </desc>
<category>News</category>
<country>SE</country>
</programme>
</tv>
14
Research topic
How can we provide the user with tools so that he can
conveniently work with data independent of format?
SBML PSI MI
BioPAX
KEGGML
BINDML
oktober 2007
15
oktober 2007
16
Research at ADIT produced
¾ Thorough evaluation and comparison of available representation
formats.
¾ A taxonomy for finding equivalent concepts in different
standards.
¾ A method for translation between XML-schema and OWL.
oktober 2007
17
Efficient storage of web data:
XML-databases
oktober 2007
18
3
Expressiveness of the models
Easiness of understanding
the models
¾ XML’s tree structure
vs Relational tables
¾ XML impose order
between the objects
¾ XML: only one structure
for the user to understand
¾ XML: all information about
a structure kept in one place
<listOfStudents>
<student name=“Stina" id="1">
<listOfCurrentcourses>
<courseReference course=“TDDB38" />
</listOfCurrentcourses>
<listOfFuturecourses>
<courseReference course=“TGTU09" />
</listOfFuturecourses>
<listOfTakencourses>
<courseReference species=“TDDA12" />
<courseReference species=“TDDB34" />
</listOfTakencourses>
</student>
</listOfStudents>
Student(ID,name)
Currentcourses(student, course)
Futurecourses(student, course)
Takencourses(student, course)
oktober 2007
19
Student(ID,name)
Currentcourses(student, course)
Futurecourses(student, course)
Takencourses(student, course)
oktober 2007
20
XML and storage
Efficiency of the models
¾ XML can be seen as a data model.
¾ Three options for storage of XML
¾ Relational model needs translation
¾ Similar loading times for the
approaches.
¾ Sizes of data:
XML
Rel
SBML 3,2 M
1,9 M
PSI MI
31 M
9,1 M
¾ Manual translation of data model to a relational model.
¾ Automatic translation to the relational model.
¾ Special native XML databases.
oktober 2007
21
oktober 2007
SBML:
<listOfSpecies>
<species name="Succinate" id="Succ" />
<species name="Fumarate" id="Fum" />
</listOfSpecies>
PSI MI:
<interactorList>
<proteinInteractor id="S1">
<names>
<shortLabel>Succ</shortLabel>
<fullName>Succinate</fullName>
</names>
</proteinInteractor>
….
</interactorList>
22
Research at ADIT produced
Integration of data
¾ Evaluations and comparison of the different storage
approaches.
¾ Graph handling extension for native XML databases.
¾ Can people make benefit from each others work?
However, many challanges remain:
¾ Work with
¾ Juliana Freire (Univ of Utah)
¾ Lena Strömbäck
¾ Tommy Ellkvist
¾ Further comparison of storage models. Our aim is to go towards
hybrid storage, to be as efficient as possible.
¾ Evaluation and further improvment of the graph extension.
¾ Comparison with network databases
oktober 2007
<listOfStudents>
<student name=“Stina" id="1">
<listOfCurrentcourses>
<courseReference course=“TDDB38" />
</listOfCurrentcourses>
<listOfFuturecourses>
<courseReference course=“TGTU09" />
</listOfFuturecourses>
<listOfTakencourses>
<courseReference species=“TDDA12" />
<courseReference species=“TDDB34" />
</listOfTakencourses>
</student>
</listOfStudents>
23
oktober 2007
24
4
Reuse of workflows between researchers
Scientific workflows
We use Vistrails for management of workflows (University of Utah)
Many challenges to aid the user:
¾ How to make use of ontologies?
¾ How to go between ontologies?
¾ Data representation between web services – data conversion?
¾ Finding relevant web sources.
¾ Annotations of web sources.
oktober 2007
25
oktober 2007
26
Student Projects
Exam Jour – if you have questions.
Degree projects available in
¾ Friday 12/10
Contact Lena Strömbäck, lestr@ida.liu.se
¾ Representation of pathway data
¾ Efficient storage for XML
¾ Workflow and Discovery
¾ 8.30-11.30 – Lena
¾ 13.00-15.00 – He
¾ 15.00-17.00 - José
Contact Patrick Lambrix, patla@ida.liu.se
¾ Ontology alignment (SAMBO, KitAMO, new algorithms, visualization, connection
to ontology editor)
¾ Data mining of patient data
¾ Analysis of blood cell data
Contact Nahid Shahmehri, nahsh@ida.liu.se
¾ Database construction for traffic safety research
oktober 2007
27
oktober 2007
Exam
28
EER-modelling
¾ Translate a mini world description into an
EER-model
TDDC94: Practical and theoretical part
TDDI60: One part. 1/3 theoretical questions
¾ Entities, relationships, cardinalities,
subclasses, weak entities, identifying
relationships, total participation, ternary
relationships
English dictionary allowed. (Not electronic!)
No other books, no calculator.
¾ Typical mistakes: Mix EER-notation with
relational notation:
¾ No foreign keys as attributes in the EER-diagram,
they are represented by relationships!
¾ Forgotten cardinalities.
oktober 2007
29
oktober 2007
30
5
oktober 2007
The relational model
Definitions
¾
¾
¾
¾
¾ Superkey: a set of attributes uniquely identifying a tuple of a
relation. (A superkey does not have to be minimal!)
¾ Candidate key:: A set of attributes that uniquely and minimally
identifies a tuple of a relation.
¾ Primary key: One candidate key is chosen to be the primary
key.
¾ Prime attribute: An attribute A that is part of a candidate key X
(vs. nonprime attribute)
Basic notation: relation, tuple, attribute
Keys
Foreign keys
Basic operations: select, project, cross-product, join,
aggregation ….
31
oktober 2007
32
Translation of EER-notation into
relational notation
SQL
¾ Know how to translate: Entities, weak entities, relationships
(1:N, N:M, 1:1), subclasses (all four ways and when to use
which), n-ary relationships, union types, multi-valued attributes.
¾
¾
¾
¾
¾
¾ Typical mistakes: forgotten primary/foreign key marking, wrong
translation of subclass or N:M-relationship.
Select
Set-functions (union, …)
Where-conditions
Group by
Joins
¾ No outer join syntax but the concept of.
¾ PSQL
¾ Stored procedures
¾ Triggers
oktober 2007
oktober 2007
33
oktober 2007
34
Normalisation
Data Structures
¾
¾
¾
¾
¾
¾ What are indexes? What are they good for? What types of
indexes do you know? When can they be used?
¾ How much memory are needed?
¾ Show that one or another index type performs better or is more
suitable.
¾ Know how to calculate log2 N, logx N.
35
Why is normalisation useful?
Definitions of 1NF, 2NF, 3NF, BCNF
Recognize the NF of a relation.
Bring the relation in a higher normal form.
Concepts of 4NF, 5NF
oktober 2007
36
6
Transactions
Transaction schedule – interleaving
(Not TDDI60)
¾ What is a transaction?
¾ Properties of transaction schedules
¾ What operations does it consist of?
¾ Serial, serializable
¾ Recoverable, cascadeless and strict
¾ What are important properties of transactions?
¾ Atomicity, Consistency, Isolation, Durable
¾ How are these properties achieved
¾ Implementation of serialisation
¾ Locking, 2PL
¾ How does a transaction update the database?
¾ Deadlock
¾ Read_item, write_item (Not TDDI60)
¾ Interleaving transactions
¾ What is it
¾ Protocols for detection and prevention of deadlock
¾ Transaction schedule (Not TDDI60)
¾ Problems with interleaving (Not TDDI60)
¾ Starvation
¾ Lost update, dirty read, incorrect summary, unrepeatable read.
oktober 2007
37
oktober 2007
38
Database recovery (Not TDDI60)
Query optimisation (Not TDDI60)
¾ Main reasons for database failure
¾ Understand: Backup, Logfile, Checkpoint, Commit, Rollback
¾ Principles for recovery
¾ Relational algebra
¾ Costs in query processing
¾ Heuristic query optimisation
¾ Main failure: Backup+Logfile
¾ Minor failure: Logfile (Undo/Redo)
¾
¾
¾
¾
¾ Use of cache memory
¾
¾
¾
¾
oktober 2007
Why and how?
Update strategies (deferred and immediate)
In-place and shadow paging
How does this affect database recovery?
¾ Query plans and algorithms
39
oktober 2007
40
Architecture - three levels of data
Database recovery
Data independence between the levels
Study the following log file, origin from a database manager using immediate
update:
View
View
Start-transaction T1
Write-item T1, B, 60
Start-transaction T3
Write-item T1, A, 50
Commit T1
Write-item T3, C, 25
Checkpoint
Write-item T3, D, 10
Commit T3
Start-transaction T4
Write-item T4, B, 70
Start-transaction T5
Write-item T5, D, 10
Commit T5
System crash
View
Conceptual level
Which variant of immediate update must have been used? Why?
Describe what happens to the four transactions during the recovery. (UNDO,
REDO or nothing)
What is the value of each of the four variables, A, B, C, D after the recovery?
Physical level
oktober 2007
What does it optimise?
How does it work?
Demonstrate by example
Estimate efficiency of the optimisation
41
oktober 2007
42
7
Indexes
Heuristic Optimization
SQL-example query
Assume an ordered file whose ordering field is a key. The file
has 15000 records of size 150 bytes each. The disk block is of
size 512 bytes (unspanned allocation). The key field is 10
bytes, block and record pointer sizes are both 40 bytes.
SELECT E.LNAME
FROM EMPLOYEE E, WORKS_ON W, PROJECT P
WHERE P.PNAME = ‘Aquarius’
AND P.PNUMBER = W.PNO
AND W.ESSN = E.SSN
AND E.BDATE > ‘1957-12-31’
¾ How many blocks are needed to store the file?
¾ The database designer wants to make an index on the key field.
Which kind of index is suitable? Make a sketch of the index and
calculate the number of blocks needed.
¾ What happens if we want to make the index on another field that is
not the key?
¾ To further speed up the data access, the database designer want to
organize the index in b) as a B+-tree. What is a suitable order of
the tree? How many data accesses will be needed using the B+tree?
oktober 2007
43
oktober 2007
Heuristic Optimization –
Canonical Form
π
44
Heuristic Optimization –
Move Select Down
π
LNAME
LNAME
σPNUMBER=PNO
σPNAME=‘Aquarius’ AND PNUMBER=PNO AND ESSN=SSN AND BDATE>’1957-12-31’
X
X
PROJECT
X
σPNAME=‘Aquarius’
X
PROJECT
σBDATE>’1957-12-31’
WORKS_ON
EMPLOYEE
oktober 2007
σESSN=SSN
WORKS_ON
EMPLOYEE
45
oktober 2007
Heuristic Optimization –
Apply Most Restrictive Select
πLNAME First
46
Heuristic Optimization – Convert
Cartesian Product/Select
π with Join
LNAME
σESSN=SSN
ESSN=SSN
X
σPNUMBER=PNO
X
σPNAME=‘Aquarius’
σBDATE>’1957-12-31’
PNUMBER=PNO
EMPLOYEE
EMPLOYEE
σPNAME=‘Aquarius’
WORKS_ON
PROJECT
oktober 2007
47
σBDATE>’1957-12-31’
WORKS_ON
PROJECT
oktober 2007
48
8
Heuristic Optimization –
Move Projections Downπthe
Tree
LNAME
ESSN=SSN
πESSN
πSSN,LNAME
PNUMBER=PNO
πPNUMBER
σPNAME=‘Aquarius’
πESSN,PNO
σBDATE>’1957-12-31’
EMPLOYEE
WORKS_ON
PROJECT
oktober 2007
49
9
Download