: Outline • • • • CUGS Core - Databases Patrick Lambrix Linköpings universitet Introduction: storing and accessing data Semi-structured data Information integration Object-oriented and object-relational databases 1 Work method 2 Requirements For each topic: • introductory presentation by topic responsible • in smaller groups: reading papers, discussion guided by predefined questions, summary • each smaller group presents their summary, final discussion moderated by topic responsible 3 Databanks/Databases • Responsible for a topic (presentation + questions) (ca 60 hours) • Participation in smaller discussion groups • Take-home exam (ca 40 hours) 4 Databank • One of many ways to store data in electronic form • used in every-day life: bank, reservation of hotel or travel, library search, bar codes • new applications : multimedia databases, geografic information systems, real-time databases • DataBank Management System (DBMS): a collection of programs that allows a user to create and maintain a databank • databank system = physical databank + DBMS 5 6 1 : Issues Databanks Real life information Databank Model Databank Management System Queries/ updates • What information is stored? • How is the information stored? (high and low level) • How is the information accessed? (user level, system level) • How is a databank recovered after a crash? Answers Processing of queries and updates Access to stored data Physical databank 7 8 Issues Persons • How to keep track of changes of the data over time? • • • • • How can several users access and update information in a databank at the same time? • How can a user access information in several databanks at the same time? databank administrator databank designer ’end user’ application programmer • DBMS designer • developer of tools • operator, maintenance 9 DEFINITION ACCESSION SOURCE ORGANISM REFERENCE AUTHORS TITLE REFERENCE AUTHORS TITLE Homo sapiens adrenergic, beta-1-, receptor NM_000684 human 1 Frielle, Collins, Daniel, Caron, Lefkowitz, Kobilka Cloning of the cDNA for the human beta 1-adrenergic receptor 2 Frielle, Kobilka, Lefkowitz, Caron Human beta 1- and beta 2-adrenergic receptors: structurally and functionally related receptors derived from distinct genes 11 10 What information is stored? • Model of reality - Entity-Relationship model (ER) - Unified Modeling Language (UML) 12 2 : Entity-relationship Entity-Relationship • • • • • protein-id source PROTEIN entities and attributes entity types key attributes relations cardinality constraints accession m definition Reference n article-id title ARTICLE author 13 How is the information stored? (high level) How is the information accessed? (user level) • • • • structure Text (IR) Semi-structured data Data models (DB) Rules + Facts (KB) precision Text - Information Retrieval • Search based on words • conceptual models: boolean, vector, probabilistic, … • file models: flat files, inverted files, ... 15 IR – File model: inverted file inverted file postings file WORD HITS LINK … … … adrenergic 32 … … cloning … receptor 22 … … … … DOCUMENTS 16 Vector model (simplified) Doc1 (1,1,0) Doc2 (0,1,0) cloning Doc1 Q (1,1,1) 1 … 53 … DOC# LINK document file … … 14 5 … … Doc2 adrenergic 1 2 … 5 … … receptor 17 sim(d,q) = d . q |d| x |q| 18 3 : Relational databases PROTEIN Databases REFERENCE PROTEIN-ID 1 • Relational databases: - model: tables + relational algebra - query language (SQL) • Object-oriented databases: - model: persistent objects, messages, encapsulation, inheritance - query language (t.ex. OQL) ACCESSION DEFINITION SOURCE PROTEIN-ID ARTICLE-ID NM_000684 Homo sapiens adrenergic, beta-1-, receptor human 1 1 1 2 ARTICLE ARTICLE-ID 1 1 1 1 1 1 2 2 2 2 AUTHOR Frielle Collins Daniel Caron Lefkowitz Kobilka Frielle Kobilka Lefkowitz Caron TITLE Cloning of the cDNA for the human …. Cloning of the cDNA for the human …. Cloning of the cDNA for the human …. Cloning of the cDNA for the human …. Cloning of the cDNA for the human …. Cloning of the cDNA for the human …. Human beta 1- and beta 2-adrenergic receptors Human beta 1- and beta 2-adrenergic receptors Human beta 1- and beta 2-adrenergic receptors Human beta 1- and beta 2-adrenergic receptors 19 20 Relational databases PROTEIN REFERENCE PROTEIN-ID 1 ACCESSION DEFINITION SOURCE PROTEIN-ID ARTICLE-ID NM_000684 Homo sapiens adrenergic, beta-1-, receptor human 1 1 1 2 ARTICLE-AUTHOR ARTICLE-ID 1 1 1 1 1 1 2 2 2 2 SQL select source from protein where accession = NM_000684; ARTICLE-TITLE AUTHOR ARTICLE-ID Frielle Collins Daniel Caron Lefkowitz Kobilka Frielle Kobilka Lefkowitz Caron TITLE 1 Cloning of the cDNA for the human beta 1-adrenergic receptor 2 Human beta 1- and beta 2adrenergic receptors: structurally and functionally related receptors derived from distinct genes PROTEIN PROTEIN-ID 1 ACCESSION DEFINITION SOURCE NM_000684 Homo sapiens adrenergic, beta-1-, receptor human 21 22 SQL select title from protein, article-title, reference where protein.accession = NM_000684 and protein.protein-id = reference.protein-id and reference.article-id = article-title.article-id; PROTEIN PROTEIN-ID 1 From relational to object model REFERENCE PROTEIN-ID ARTICLE-ID 1 1 1 2 • • • • CASE CAD office automation multimedia applications ARTICLE-TITLE ACCESSION DEFINITION SOURCE NM_000684 Homo sapiens adrenergic, beta-1-, receptor human ARTICLE-ID TITLE 1 Cloning of the … 2 Human beta 1- … 23 24 4 : Object-Oriented Databases (OODB) Object • World is modeled using objects. • An object has a state (value) and a behavior (operations). • Persistent objects - permanent storage (sometimes transient objects are allowed) • An object has an object identifier (OID) that is not visible to the user. • OID cannot be changed. • object versus value (a value has no OID) • object structure can be arbitrarily complex (atom, tuple, set, list, bag, array) 25 Example - object state Example - object state • o3(id3, tuple, <title: `Cloning of …’, author: o5 >) • o4(id4, tuple, <title: `Human beta-1 …’, author: o6 >) • o5(id5, list, [Frielle, Collins, Daniel, Caron, Lefkowitz, Kobilka]) • o6(id6, list, [Frielle, Kobilka , Lefkowitz, Caron]) • o1(id1, tuple, <accession: NM_000684, source : human, definition: ’Homo sapiens adrenergic …’, reference: o2>) • o2(id2, set, {o3,o4}) Remark: These examples do not use a standard syntax human NM_000684 27 ”Homo sapiens adrenergic, beta-1-, receptor” DEFINITION REFERENCE AUTHOR set TITLE list ”Cloning of …” Frielle AUTHOR TITLE list Collins Frielle Daniel Kobilka Caron Lefkowitz Lefkowitz Kobilka 28 Classes SOURCE ACCESSION 26 ”Human beta-1 …” define class protein type tuple ( accession: string; source : string; definition: string; reference: set(article); ); operations create-protein(string,string,string,set(article)): protein; get-accession: string; get-source: string; get-definition: string; get-references: set(article); add-reference(article): void; end protein; Caron 29 30 5 : Classes Example program program variables: article1, article2, protein1; begin article1 := create-article(’Cloning….’, list(Frielle, Collins, Daniel, Caron, Lefkowitz, Kobilka)); protein1 := create-protein(NM_000684, human,’Homo sapiens adrenergic …’, set(article1)); article2 := create-article(’ Human beta-1….’, list(Frielle, Kobilka , Lefkowitz, Caron]); protein1.add-reference(article2); end; define class article type tuple ( title: string; author: list(string); ); operations create-article(string, list(string)): article; get-title string; get-authors: list(string); print-article-info string; end article; 31 Operations 32 Inheritance • encapsulation: operation = interface + body - interface: how is the operation called? What is the result of the operation? > visible to user, used in programs - body: how is the operation implemented? > invisible for user • program is based on message passing • journal-article subtype-of article: journal-name journal-volume page-numbers journal-article inherits all attributes and operations from article and has in addition also journal-name, journalvolume and page-numbers as attributes • human-protein subtype-of protein (source = ’human’) 33 Operator overloading 34 Query language OQL • The same operator name can be used for different implementations • select … from … where select distinct … from … where • iterator variables • path expressions • struct • example: print-article-info for article prints information on title and author. print-article-info for journal-article prints information on title, author and also on the journal’s name, volume and page number.. 35 36 6 : Queries select o.source from o in protein where o.accession = NM_000684; Queries NM_000684 ACCESSION select struct (accession: o.accession, source: o.source) from o in protein where Frielle in (select a.author from a in o.reference); human SOURCE NM_000684 human ACCESSION SOURCE REFERENCE set AUTHOR Frielle AUTHOR Frielle 37 Query language OQL 38 Third-Generation DB Manifesto OQL also allows: • views • aggregation • special operations for list and array (first, last, nth) • order-by • group-by • Objects and Rules - rich type system - inheritance - methods and encapsulation - unique identifiers - rules (triggers, constraints) 39 Third-Generation DB Manifesto 40 Third-Generation DB Manifesto • DBMS functionality - access through non-procedural high-level language - specify collections intensionally and extensionally - updatable views - no performance indicators in the model • Open systems - accessible via several high-level languages - persistency - SQL-like language - queries and answers are the lowest level of communication between client and server 41 42 7 : OODBS Manifesto OODBS Manifesto Thou shalt ... • complex objects • object identity • encapsulation • types and classes • inheritance • overriding, overloading, late binding • • • • • • • computational completeness extensibility persistence secondary storage management concurrency recovery query facility 43 OODBS Manifesto 44 Semi-structured data Optional • multiple inheritance • distribution • long and nested transactions • versions • data that is not just text, but that is not as well structured as data in databases • often seen in web databanks and when integrating databanks Thou shalt question the golden rules. 45 Semi-structured data - properties • • • • • irregular structure implicit structure partial structure ’data guide’ vs schema large data guides 46 Semi-structured data - model • Network of nodes • object model (oid) • query: path search in the network 47 48 8 : ACCESSION ”Homo sapiens adrenergic, beta-1-, receptor” human NM_000684 Queries SOURCE DEFINITION human NM_000684 REFERENCE select source where accession = NM_000684; REFERENCE TITLE AUTHOR AUTHOR ”Cloning of …” ACCESSION SOURCE TITLE AUTHOR AUTHOR Frielle Collins AUTHOR ”Human beta-1 …” AUTHOR AUTHOR Daniel AUTHOR Caron AUTHOR AUTHOR Lefkowitz Kobilka 49 50 Queries select reference.title where accession = NM_000684; Knowledge bases NM_000684 ACCESSION REFERENCE select #p.title where accession = NM_000684; TITLE ”Cloning of …” REFERENCE TITLE ”Human beta-1 …” • Often based on a logic • Query answering based on inference mechanism • Knowledge bases often fit in main memory • Useful for ontologies 51 52 How is the information stored? (low level) Knowledge bases (F) source(NM_000684, Human) (R) source(P?,Human) => source(P?,Mammal) (R) source(P?,Mammal) => source(P?,Vertebrate) Q: ?- source(NM_000684, Vertebrate) A: yes Q: ?- source(x?, Mammal) A: x? = NM_000684 53 Real life information Databank Model Databank management system Queries Answers Processing of queries and updates Access to stored data Physical databank 54 9 : How is the information accessed? (system level) Real life information Model Databank Answers Processing of queries and updates Databank management system Access to stored data Physical databank 55 How is a databank recovered after a crash? Queries 56 How to keep track of changes of the data over time? Real life information Recovery when • system crash • system error • concurrency error • disk error • catastrophy Databank Model Databank Management System Queries Answers Processing of queries and updates Access to stored data 57 How to keep track of changes of the data over time? • data evolution (versioning) 58 How can several users access and update information in a databank at the same time? Real life information • schema evolution Databank 59 Model Processing of Databank queries and updates management system Access to stored data Physical databank 60 10 : Transactions Administrator 1 TIME Authorization Administrator 2 Read(Number-of-proteins) Number-of-proteins = Number-of-proteins + 30 Read(Number-of-proteins) Number-of-proteins = Number-of-proteins + 25 Write(Number-of-proteins) • Authorization mechanisms (implicit/explicit, strong/weak, positive/negative) • Extensions for the OO model (class, inheritance, composite objects, versions) Write(Number-of-proteins) 61 62 How can a user access information in several databanks at the same time? Access to multiple databanks problems • User needs to know where to find information and how to retrieve it (UI, QL). • Terminology Different databanks may store different information about an entity. Same names in different databanks may refer to different entities. • Query planning is difficult. query 63 64 Object-oriented and object-relational databases Integration methods • Federations • Data warehouses • Mediators • • • • • • • 65 Data models and query languages Query processing and optimization Versioning and schema evolution Authorization Storage management and indexing ----------------------------------------------------(XML and databases) 66 11