Accessing multiple data sources Genomics Which? Where? How? Disease information Integration of databanks Clinical trials Metabolism DISCOVERY toxicology Chemical structure Target structure 1 Access to multiple databanksProblems Disease models 2 Queries over multiple databanks Find PubMed publications on diseases of certain insulin sequences • Users need good knowledge on where the required information is stored and how it can be accessed • Representation of an entity in different databanks can be different. Same name in different databanks can refer to different entities. get relevant diseases SWISS-PROT get IDs of publications • • • • Find Divide & order Execute Combine get publications OMIM PubMed 3 4 Access to multiple databanks steps • Decide which databanks should be used • Divide query into sub-queries to the databanks • Decide in which order to send sub-queries to the databanks • Send sub-queries to the databanks - use the terminology of the databanks • Merge results from the databanks to an answer for the original query Æ mistake in any step can lead to inefficient processing of the query or failure to get a result 5 query Answer1 Answer2 Answer3 Sub-query1 Sub-query1 Sub-query1 Answer1 Answer2 Answer3 Answer1 Answer2 Answer3 6 query Answer1 Answer2 Answer3 query Answer1.1 Answer1.2 Answer1 Answer2 Answer3 Sub-query2(answer1) Sub-query2(answer2) Answer1.1 Sub-query2(answer1) Answer1.2 Answer2.1 Sub-query2(answer2) Answer2.2 Sub-query2(answer1) Answer1.1 Answer1.2 Sub-query2(answer2) Answer2.1 Answer2.2 Answer1.1 Answer1.2 Answer2.1 Answer2.2 7 8 result query Answer1 Answer2 Answer3 query Answer1.1 Answer1.2 Answer1 Answer2 Answer3 Answer2.1 Answer2.2 Answer1.1 Answer1.2 Answer2.1 Answer2.2 Answer3.1 Answer3.1 Answer.a Answer.b Answer.c Answer.d Answer.e Answer.f Subquery3(Answer1.1,Answer1.2, Answer.a Answer2.1,Answer2.2,Answer3.1) Answer.b Sub-query2(answer3) Answer.c Answer.d Subquery3(Answer1.1,Answer1.2, Answer.e Answer2.1,Answer2.2,Answer3.1) Answer.f Answer3.1 Sub-query2(answer3) Sub-query2(answer3) Answer3.1 9 Problem formulation Methods for integration • Databank properties – – – – Protein(name, date, organism) ProteinStructure(name, structure) Data source 1 Protein(name, authors, date, organism) Article(authors, title, year) date>1995 Autonomous databanks Different data models Differences in terminology Overlapping, redundant data • Integration aims to provide transparent access to multiple heterogenous databanks Data source 2 Structure(name, structure, organism) date>2000 10 – uniform query language – uniform representation of results 11 – Link driven federations • Explicit links between databanks. – Warehousing • Data is downloaded, filtered, integrated and stored in a warehouse. Answers to queries are taken from the warehouse. – Mediation or View integration • A global schema is defined over all databanks. 12 Link driven federations Link driven federations • Creates explicit links between databanks • query: get interesting results and use web links to reach related data in other databanks User Interface Answer Assembler Query interpreter Retrieval Engine wrapper wrapper • systems: SRS, Entrez databank A Index A->B Index B->A databank B 13 Link driven federations 14 View integration • Advantages - complex queries - fast • Disadvantages - require good knowledge - syntax based - terminology problem not solved • Define a global schema over the databanks • high level query language • systems: BioKleisli, K2, TAMBIS, BioTRIFU 15 View integration View integration User Interface Query Interpreter and Expander Answer Assembler Retrieval Engine wrapper Ontology Base Databank Knowledge Base databank 16 wrapper • Advantages - complex queries - requires less knowledge - solution for terminology problem - semantics based databank 17 18 View integration Mediation Query Application • Disadvantages - more computation - view maintenance • Query problem How to answer queries expressed using the global schema. Mediator Global schema • Modeling problem Wrapper Local schema Source Wrapper Local schema How to model the global schema, databanks och mappings. Source 19 20 Queries Mediator • Queries use the global schema • Conjunctive queries Query – select-project-join queries if head Mediator body/subgoals Query Reformulation p(X,Z) :- a(X,Y), b(Y,Z) Query Execution Engine • Mediator reformulates queries in terms of a set of queries that use the local schema. Equivalence and containment of queries needs to be preserved. – global schema (domain model/ontology) – local schema (databank model) • Information for integration • Capabilities Mappings Wrapper Local schema Wrapper Local schema Source Source – Access only relevant databanks 22 • Relation between domain and databank content Global schema: Protein(name, date, organism) ProteinStructure(name, structure) – mapping Mediator Global schema Wrapper Mapping • Description of databank content Application Wrapper Issues: – Semantically correct reformulation 21 Knowledge Query – reformulation of queries decide query plan – query optimization – execution of query plan, assemble results into final answer Query Optimization q(name, structure) :- Protein(name, 2001, ‘human’), ProteinStructure(name, structure) - Q1 is contained in Q2 if the result of Q1 is a subset of the result of Q2. • Mediator is responsible for query processing – attributes and constraints – processing capabilities – completeness – cost of query answering – reliability Data source local schema: DS1(name, authors, date, organism) DS2(name, structure, organism) – Global as view The global schema is defined in terms of source terminology • Used for Source Source – selection of relevant databanks – query plan formulation – query plan optimization Protein(name, date, organism) :- DS1(name, authors, date, organism) ProteinStructure(name, structure) :- DS2(name, structure, organism) 23 24 Mapping Query processing in GAV • Relation between domain and databank content Global schema: Protein(name, date, organism) ProteinStructure(name, structure) Query: give name and structure for human proteins with date ‘2001’. q(name, structure) :- Protein(name, 2001, ‘human’), ProteinStructure(name, structure) Data source local schema: DS1(name, authors, date, organism) DS2(name, structure, organism) – Local as view The sources are defined in terms of the global schema. DS1(name, authors, date, organism) -: Protein(name, date, organism), date >1995 DS2(name, structure, organism) -: Protein(name, date, organism), ProteinStructure(name, structure), date >2000 25 GAV: Protein(name, date, organism) :- DS1(name, authors, date, organism) ProteinStructure(name, structure) :- DS2(name, structure, organism) • No explicit representation of databank content • Mapping gives direct information about which data satisfies the global schema. • Query is processed by expanding the query atoms according to their definitions. New query: q(name, structure) :DS1(name, authors, 2001, ’human’), DS2(name, structure, organism) 26 Query processing in LAV Query processing in LAV Query: give name and structure for human proteins with date ‘2001’. q(name, structure) :- Protein(name, 2001, ‘human’), ProteinStructure(name, structure) Query: give name and structure for human proteins with date ‘2001’. q(name, structure) :- Protein(name, 2001, ‘human’), ProteinStructure(name, structure) LAV: LAV: DS1(name, authors, date, organism) -: Protein(name, date, organism), date >1995 DS2(name, structure, organism) -: Protein(name, date, organism), ProteinStructure(name, structure), date >2000 • Mapping does not give direct information about which data satisfies the global schema. • To answer the query it needs to be inferred how the mappings should be used. 27 Comparison GAV - LAV • Bucket algoritm (Information Manifold) – For each sub-goal in query create bucket of relevant views. – Define rewritings of query. Each rewriting consists of one conjunct from every bucket. Check whether the resulting conjunction is contained in the query. – The result is the union of the rewritings. New query: q(name, structure) :DS1(name, authors, 2001, ’human’), DS2(name, structure, organism) 28 Capabilities • Most common capabilities describe attributes • Global as view – Clear how databanks interact – When a databank is added, the global schema can change – Query processing is easy • Local as view – – – – DS1(name, authors, date, organism) -: Protein(name, date, organism), date >1995 DS2(name, structure, organism) -: Protein(name, date, organism), ProteinStructure(name, structure), date >2000 Each databank is specified in isolation Easy to add databanks Easier to specify constraints on the contents of sources Query processing requires reasoning 29 – f - free, attribute can be specified or not – b - bound, a value must be specified for the attribute, all values are permitted – u - unspecified, not permitted to specify a value for the attribute – c[S] - value should be one of the values in finite set S – o[S] - value is not specified or one of the values in finite set S DS1: (name, authors, date, organism) f f b c[human mouse] 30 View integration View integration User Interface Research issues: - knowledge representation Answer Assembler Query Interpreter and Expander - creation and maintenance of global schema Retrieval Engine User Interface Research issues: - knowledge representation Query Interpreter and Expander Answer Assembler - creation of ontologies - merging of ontologies Retrieval Engine wrapper Ontology Base Databank Knowledge Base databank wrapper Databank Knowledge Base databank 31 View integration databank 32 User Interface Answer Assembler Research issue: Query Interpreter and Expander generationAnswer Assembler - semi-automatic of wrappers Retrieval Engine Ontology Base databank View integration User Interface Query Interpreter and Expander wrapper wrapper Ontology Base Retrieval Engine wrapper Research issues: wrapper - unified query language - knowledge representation Databank Biological - strategies for query expansion databank Knowledge - query optimizationdatabank Base 33 wrapper wrapper Ontology Base Databank Knowledge Base databank databank 34 Integration - exercises 35 36 Integration systems Discuss MedMaker, Information Manifold, SIMS and InfoSleuth • • • • • • • What does the system focus on? Describe the query language. Describe the knowledge representation. Describe the query processing approach. Describe the system architecture. What are the advantages of the system? Interesting features? 37