Todays lecture ¾ ¾ ¾ ¾ Outlook and Summary Research outlook – What do we do? Formalities regarding the exam Summary of the course Some example problems Lena Strömbäck oktober 2007 2 Semantic Web, Databases and Information Integration GET THAT PROTEIN! Patrick Lambrix, Lena Strömbäck, He Tan Databases and Web Information Systems group Institutionen för datavetenskap Linköpings universitet oktober 2007 3 oktober 2007 Accessing data sources on the web Which? Where? How? Genomics 4 The Semantic Web Clinical trials Disease information DISCOVERY Chemical structure W3C: Facilities to put machine-understandable data on the Web are Metabolism, toxicology becoming a high priority for many communities. The Web can reach its full potential only if it becomes a place where data can be shared and processed by automated tools as well as by people. For the Web to scale, tomorrow's programs must be able to share and process data even when these programs have been designed totally independently. The Semantic Web is a vision: the idea of having data on the web defined and linked in a way that it can be used by machines not just for display purposes, but for automation, integration and reuse of data across various applications. Disease models Target structure oktober 2007 5 oktober 2007 6 1 Ontologies and data representation standards Ontologies define the basic terms and relations comprising the vocabulary of a topic area, as well as the rules for combining terms and relations to define extensions to the vocabulary. (Neches, Fikes, Finin, Gruber, Patil, Senator) Key mechanisms for the Semantic Web: Ontologies and standards To allow reuse of actual data, standardized representations are often defined for different areas. This correspond to the data model of a database. oktober 2007 7 oktober 2007 8 Overlapping ontologies GENE ONTOLOGY (GO) SIGNAL-ONTOLOGY (SigO) immune response i- acute-phase response i- anaphylaxis i- antigen presentation i- antigen processing i- cellular defense response i- cytokine metabolism i- cytokine biosynthesis synonym cytokine production … p- regulation of cytokine biosynthesis … … i- B-cell activation i- B-cell differentiation i- B-cell proliferation i- cellular defense response … i- T-cell activation i- activation of natural killer cell activity … Immune Response i- Allergic Response i- Antigen Processing and Presentation i- B Cell Activation i- B Cell Development i- Complement Signaling synonym complement activation i- Cytokine Response i- Immune Suppression i- Inflammation i- Intestinal Immunity i- Leukotriene Response i- Leukotriene Metabolism i- Natural Killer Cell Response i- T Cell Activation i- T Cell Development i- T Cell Selection in Thymus Ontologies and ontology alignment equivalent concepts equivalent relations is-a relation oktober 2007 9 oktober 2007 Merging ontologies Research at ADIT produced … GENE ONTOLOGY (GO) SIGNAL-ONTOLOGY (SigO) The New Ontology immune response i- acute-phase response i- anaphylaxis i- antigen presentation i- antigen processing i- cellular defense response i- cytokine metabolism i- cytokine biosynthesis synonym cytokine production … p- regulation of cytokine biosynthesis … … i- B-cell activation i- B-cell differentiation i- B-cell proliferation i- cellular defense response … i- T-cell activation i- activation of natural killer cell activity … Immune Response i- Allergic Response i- Antigen Processing and Presentation i- B Cell Activation i- B Cell Development i- Complement Signaling synonym complement activation i- Cytokine Response i- Immune Suppression i- Inflammation i- Intestinal Immunity i- Leukotriene Response i- Leukotriene Metabolism i- Natural Killer Cell Response i- T Cell Activation i- T Cell Development i- T Cell Selection in Thymus immune response i- acute-phase response i- Allergic Response i- anaphylaxis i- Antigen Processing and Presentation i- antigen presentation i- antigen processing i- cellular defense response i- Complement Signaling synonym complement activation i- Cytokine Response i- cytokine metabolism i- cytokine biosynthesis synonym cytokine production … p- regulation of cytokine_biosynthesis … … i- B-cell activation i- B-cell differentiation i- B-cell proliferation i- cellular defense response … i- T-cell activation i- Natural Killer Cell Response i- activation of natural killer cell activity … equivalent concepts equivalent relations 10 ¾ ¾ ¾ ¾ ¾ General framework for alignment systems Classification of current algorithms and systems SAMBO – System for Aligning and Merging Biomedical Ontologies KitAMO – Toolkit for aligning and merging ontologies Strategy for recommendation of alignment systems State of the art research with some of the best systems that currently exist. is-a relation oktober 2007 11 oktober 2007 12 2 Example TV-domain Representation and management of data oktober 2007 13 oktober 2007 Example: Molecular interactions SBML PSI MI KEGGML <model <entry> <interactorList> <proteinInteractor id="S1"> <pathway name="Example" ...> <entry id="1" name="Succ“ …> <graphics name="3.1.3.67" ….> </entry> <reaction substrate="1" product="2" type=…> </reaction> … </pathway> name="Example"> <listOfSpecies> <species name="Succinate" id="S1"> </listOfSpecies> <listOfReactions> <reaction name="Suc Cat" id="R1"> <listOfReactants> <speciesReference species="S1" /> </listOfReactants> <listOfProducts> </listOfProducts> <listOfModifiers> </listOfModifiers> </reaction> </listOfReactions> </model> <names> …. </names> </proteinInteractor> </interactorList> <interactionList> <interaction> <names>… </names> <participantList> <proteinParticipant> <proteinInteractorRef ref="S1"> <role>neutral</role> </proteinParticipant> </participantList> </interaction> </interactionList> TV-Anytime XMLTV <TVAMain> <ProgramDescription> <ProgramLocationTable> <Schedule serviceIDRef='SVT1' start='...' end='...'> <ScheduleEvent> <Program crid='crid:...'/> </ScheduleEvent> </Schedule> </ProgramLocationTable> <ProgramInformationTable> <ProgramInformation programId='crid:...'> <BasicDescription> <Title>SVT News</Title> <Genre href=... > <Name>News</Name> </Genre> ... </BasicDescription> </ProgramInformation> </ProgramInformationTable> </ProgramDescription> </TVAMain> <tv> <channel id="C1"> <display-name lang="se"> SVT1</display-name> </channel> <programme start="200006031633" channel="3sat.de"> <title lang="sv">Nyheterna</title> <title lang="en">News</title> <desc lang="sv">… </desc> <category>News</category> <country>SE</country> </programme> </tv> 14 Research topic How can we provide the user with tools so that he can conveniently work with data independent of format? SBML PSI MI BioPAX KEGGML BINDML oktober 2007 15 oktober 2007 16 Research at ADIT produced ¾ Thorough evaluation and comparison of available representation formats. ¾ A taxonomy for finding equivalent concepts in different standards. ¾ A method for translation between XML-schema and OWL. oktober 2007 17 Efficient storage of web data: XML-databases oktober 2007 18 3 Expressiveness of the models Easiness of understanding the models ¾ XML’s tree structure vs Relational tables ¾ XML impose order between the objects ¾ XML: only one structure for the user to understand ¾ XML: all information about a structure kept in one place <listOfStudents> <student name=“Stina" id="1"> <listOfCurrentcourses> <courseReference course=“TDDB38" /> </listOfCurrentcourses> <listOfFuturecourses> <courseReference course=“TGTU09" /> </listOfFuturecourses> <listOfTakencourses> <courseReference species=“TDDA12" /> <courseReference species=“TDDB34" /> </listOfTakencourses> </student> </listOfStudents> Student(ID,name) Currentcourses(student, course) Futurecourses(student, course) Takencourses(student, course) oktober 2007 19 Student(ID,name) Currentcourses(student, course) Futurecourses(student, course) Takencourses(student, course) oktober 2007 20 XML and storage Efficiency of the models ¾ XML can be seen as a data model. ¾ Three options for storage of XML ¾ Relational model needs translation ¾ Similar loading times for the approaches. ¾ Sizes of data: XML Rel SBML 3,2 M 1,9 M PSI MI 31 M 9,1 M ¾ Manual translation of data model to a relational model. ¾ Automatic translation to the relational model. ¾ Special native XML databases. oktober 2007 21 oktober 2007 SBML: <listOfSpecies> <species name="Succinate" id="Succ" /> <species name="Fumarate" id="Fum" /> </listOfSpecies> PSI MI: <interactorList> <proteinInteractor id="S1"> <names> <shortLabel>Succ</shortLabel> <fullName>Succinate</fullName> </names> </proteinInteractor> …. </interactorList> 22 Research at ADIT produced Integration of data ¾ Evaluations and comparison of the different storage approaches. ¾ Graph handling extension for native XML databases. ¾ Can people make benefit from each others work? However, many challanges remain: ¾ Work with ¾ Juliana Freire (Univ of Utah) ¾ Lena Strömbäck ¾ Tommy Ellkvist ¾ Further comparison of storage models. Our aim is to go towards hybrid storage, to be as efficient as possible. ¾ Evaluation and further improvment of the graph extension. ¾ Comparison with network databases oktober 2007 <listOfStudents> <student name=“Stina" id="1"> <listOfCurrentcourses> <courseReference course=“TDDB38" /> </listOfCurrentcourses> <listOfFuturecourses> <courseReference course=“TGTU09" /> </listOfFuturecourses> <listOfTakencourses> <courseReference species=“TDDA12" /> <courseReference species=“TDDB34" /> </listOfTakencourses> </student> </listOfStudents> 23 oktober 2007 24 4 Reuse of workflows between researchers Scientific workflows We use Vistrails for management of workflows (University of Utah) Many challenges to aid the user: ¾ How to make use of ontologies? ¾ How to go between ontologies? ¾ Data representation between web services – data conversion? ¾ Finding relevant web sources. ¾ Annotations of web sources. oktober 2007 25 oktober 2007 26 Student Projects Exam Jour – if you have questions. Degree projects available in ¾ Friday 12/10 Contact Lena Strömbäck, lestr@ida.liu.se ¾ Representation of pathway data ¾ Efficient storage for XML ¾ Workflow and Discovery ¾ 8.30-11.30 – Lena ¾ 13.00-15.00 – He ¾ 15.00-17.00 - José Contact Patrick Lambrix, patla@ida.liu.se ¾ Ontology alignment (SAMBO, KitAMO, new algorithms, visualization, connection to ontology editor) ¾ Data mining of patient data ¾ Analysis of blood cell data Contact Nahid Shahmehri, nahsh@ida.liu.se ¾ Database construction for traffic safety research oktober 2007 27 oktober 2007 Exam 28 EER-modelling ¾ Translate a mini world description into an EER-model TDDC94: Practical and theoretical part TDDI60: One part. 1/3 theoretical questions ¾ Entities, relationships, cardinalities, subclasses, weak entities, identifying relationships, total participation, ternary relationships English dictionary allowed. (Not electronic!) No other books, no calculator. ¾ Typical mistakes: Mix EER-notation with relational notation: ¾ No foreign keys as attributes in the EER-diagram, they are represented by relationships! ¾ Forgotten cardinalities. oktober 2007 29 oktober 2007 30 5 oktober 2007 The relational model Definitions ¾ ¾ ¾ ¾ ¾ Superkey: a set of attributes uniquely identifying a tuple of a relation. (A superkey does not have to be minimal!) ¾ Candidate key:: A set of attributes that uniquely and minimally identifies a tuple of a relation. ¾ Primary key: One candidate key is chosen to be the primary key. ¾ Prime attribute: An attribute A that is part of a candidate key X (vs. nonprime attribute) Basic notation: relation, tuple, attribute Keys Foreign keys Basic operations: select, project, cross-product, join, aggregation …. 31 oktober 2007 32 Translation of EER-notation into relational notation SQL ¾ Know how to translate: Entities, weak entities, relationships (1:N, N:M, 1:1), subclasses (all four ways and when to use which), n-ary relationships, union types, multi-valued attributes. ¾ ¾ ¾ ¾ ¾ ¾ Typical mistakes: forgotten primary/foreign key marking, wrong translation of subclass or N:M-relationship. Select Set-functions (union, …) Where-conditions Group by Joins ¾ No outer join syntax but the concept of. ¾ PSQL ¾ Stored procedures ¾ Triggers oktober 2007 oktober 2007 33 oktober 2007 34 Normalisation Data Structures ¾ ¾ ¾ ¾ ¾ ¾ What are indexes? What are they good for? What types of indexes do you know? When can they be used? ¾ How much memory are needed? ¾ Show that one or another index type performs better or is more suitable. ¾ Know how to calculate log2 N, logx N. 35 Why is normalisation useful? Definitions of 1NF, 2NF, 3NF, BCNF Recognize the NF of a relation. Bring the relation in a higher normal form. Concepts of 4NF, 5NF oktober 2007 36 6 Transactions Transaction schedule – interleaving (Not TDDI60) ¾ What is a transaction? ¾ Properties of transaction schedules ¾ What operations does it consist of? ¾ Serial, serializable ¾ Recoverable, cascadeless and strict ¾ What are important properties of transactions? ¾ Atomicity, Consistency, Isolation, Durable ¾ How are these properties achieved ¾ Implementation of serialisation ¾ Locking, 2PL ¾ How does a transaction update the database? ¾ Deadlock ¾ Read_item, write_item (Not TDDI60) ¾ Interleaving transactions ¾ What is it ¾ Protocols for detection and prevention of deadlock ¾ Transaction schedule (Not TDDI60) ¾ Problems with interleaving (Not TDDI60) ¾ Starvation ¾ Lost update, dirty read, incorrect summary, unrepeatable read. oktober 2007 37 oktober 2007 38 Database recovery (Not TDDI60) Query optimisation (Not TDDI60) ¾ Main reasons for database failure ¾ Understand: Backup, Logfile, Checkpoint, Commit, Rollback ¾ Principles for recovery ¾ Relational algebra ¾ Costs in query processing ¾ Heuristic query optimisation ¾ Main failure: Backup+Logfile ¾ Minor failure: Logfile (Undo/Redo) ¾ ¾ ¾ ¾ ¾ Use of cache memory ¾ ¾ ¾ ¾ oktober 2007 Why and how? Update strategies (deferred and immediate) In-place and shadow paging How does this affect database recovery? ¾ Query plans and algorithms 39 oktober 2007 40 Architecture - three levels of data Database recovery Data independence between the levels Study the following log file, origin from a database manager using immediate update: View View Start-transaction T1 Write-item T1, B, 60 Start-transaction T3 Write-item T1, A, 50 Commit T1 Write-item T3, C, 25 Checkpoint Write-item T3, D, 10 Commit T3 Start-transaction T4 Write-item T4, B, 70 Start-transaction T5 Write-item T5, D, 10 Commit T5 System crash View Conceptual level Which variant of immediate update must have been used? Why? Describe what happens to the four transactions during the recovery. (UNDO, REDO or nothing) What is the value of each of the four variables, A, B, C, D after the recovery? Physical level oktober 2007 What does it optimise? How does it work? Demonstrate by example Estimate efficiency of the optimisation 41 oktober 2007 42 7 Indexes Heuristic Optimization SQL-example query Assume an ordered file whose ordering field is a key. The file has 15000 records of size 150 bytes each. The disk block is of size 512 bytes (unspanned allocation). The key field is 10 bytes, block and record pointer sizes are both 40 bytes. SELECT E.LNAME FROM EMPLOYEE E, WORKS_ON W, PROJECT P WHERE P.PNAME = ‘Aquarius’ AND P.PNUMBER = W.PNO AND W.ESSN = E.SSN AND E.BDATE > ‘1957-12-31’ ¾ How many blocks are needed to store the file? ¾ The database designer wants to make an index on the key field. Which kind of index is suitable? Make a sketch of the index and calculate the number of blocks needed. ¾ What happens if we want to make the index on another field that is not the key? ¾ To further speed up the data access, the database designer want to organize the index in b) as a B+-tree. What is a suitable order of the tree? How many data accesses will be needed using the B+tree? oktober 2007 43 oktober 2007 Heuristic Optimization – Canonical Form π 44 Heuristic Optimization – Move Select Down π LNAME LNAME σPNUMBER=PNO σPNAME=‘Aquarius’ AND PNUMBER=PNO AND ESSN=SSN AND BDATE>’1957-12-31’ X X PROJECT X σPNAME=‘Aquarius’ X PROJECT σBDATE>’1957-12-31’ WORKS_ON EMPLOYEE oktober 2007 σESSN=SSN WORKS_ON EMPLOYEE 45 oktober 2007 Heuristic Optimization – Apply Most Restrictive Select πLNAME First 46 Heuristic Optimization – Convert Cartesian Product/Select π with Join LNAME σESSN=SSN ESSN=SSN X σPNUMBER=PNO X σPNAME=‘Aquarius’ σBDATE>’1957-12-31’ PNUMBER=PNO EMPLOYEE EMPLOYEE σPNAME=‘Aquarius’ WORKS_ON PROJECT oktober 2007 47 σBDATE>’1957-12-31’ WORKS_ON PROJECT oktober 2007 48 8 Heuristic Optimization – Move Projections Downπthe Tree LNAME ESSN=SSN πESSN πSSN,LNAME PNUMBER=PNO πPNUMBER σPNAME=‘Aquarius’ πESSN,PNO σBDATE>’1957-12-31’ EMPLOYEE WORKS_ON PROJECT oktober 2007 49 9