Outline CUGS Core - Databases

advertisement
:
Outline
•
•
•
•
CUGS Core - Databases
Patrick Lambrix
Linköpings universitet
Introduction: storing and accessing data
Semi-structured data
Information integration
Object-oriented and object-relational
databases
1
Work method
2
Requirements
For each topic:
• introductory presentation by topic
responsible
• in smaller groups: reading papers,
discussion guided by predefined questions,
summary
• each smaller group presents their summary,
final discussion moderated by topic
responsible
3
Databanks/Databases
• Responsible for a topic (presentation +
questions) (ca 60 hours)
• Participation in smaller discussion groups
• Take-home exam (ca 40 hours)
4
Databank
• One of many ways to store data in
electronic form
• used in every-day life: bank, reservation of
hotel or travel, library search, bar codes
• new applications : multimedia databases,
geografic information systems, real-time
databases
• DataBank Management System (DBMS): a
collection of programs that allows a user to
create and maintain a databank
• databank system = physical databank +
DBMS
5
6
1
:
Issues
Databanks
Real life
information
Databank
Model
Databank
Management
System
Queries/
updates
• What information is stored?
• How is the information stored?
(high and low level)
• How is the information accessed?
(user level, system level)
• How is a databank recovered after a crash?
Answers
Processing of
queries and updates
Access to stored data
Physical
databank
7
8
Issues
Persons
• How to keep track of changes of the data
over time?
•
•
•
•
• How can several users access and update
information in a databank at the same time?
• How can a user access information in several
databanks at the same time?
databank administrator
databank designer
’end user’
application programmer
• DBMS designer
• developer of tools
• operator, maintenance
9
DEFINITION
ACCESSION
SOURCE ORGANISM
REFERENCE
AUTHORS
TITLE
REFERENCE
AUTHORS
TITLE
Homo sapiens adrenergic, beta-1-, receptor
NM_000684
human
1
Frielle, Collins, Daniel, Caron, Lefkowitz,
Kobilka
Cloning of the cDNA for the human
beta 1-adrenergic receptor
2
Frielle, Kobilka, Lefkowitz, Caron
Human beta 1- and beta 2-adrenergic
receptors: structurally and functionally
related receptors derived from distinct
genes
11
10
What information is stored?
• Model of reality
- Entity-Relationship model (ER)
- Unified Modeling Language (UML)
12
2
:
Entity-relationship
Entity-Relationship
•
•
•
•
•
protein-id
source
PROTEIN
entities and attributes
entity types
key attributes
relations
cardinality constraints
accession
m
definition
Reference
n
article-id
title
ARTICLE
author
13
How is the information stored?
(high level)
How is the information accessed?
(user level)
•
•
•
•
structure
Text (IR)
Semi-structured data
Data models (DB)
Rules + Facts (KB)
precision
Text - Information Retrieval
• Search based on words
• conceptual models:
boolean, vector, probabilistic, …
• file models:
flat files, inverted files, ...
15
IR – File model: inverted file
inverted file
postings file
WORD
HITS
LINK
…
…
…
adrenergic
32
…
…
cloning
…
receptor
22
…
…
…
…
DOCUMENTS
16
Vector model (simplified)
Doc1 (1,1,0)
Doc2 (0,1,0)
cloning
Doc1
Q (1,1,1)
1
…
53
…
DOC# LINK
document file
…
…
14
5
…
…
Doc2
adrenergic
1
2
…
5
…
…
receptor
17
sim(d,q) = d . q
|d| x |q|
18
3
:
Relational databases
PROTEIN
Databases
REFERENCE
PROTEIN-ID
1
• Relational databases:
- model: tables + relational algebra
- query language (SQL)
• Object-oriented databases:
- model: persistent objects,
messages, encapsulation, inheritance
- query language (t.ex. OQL)
ACCESSION
DEFINITION
SOURCE
PROTEIN-ID
ARTICLE-ID
NM_000684
Homo sapiens
adrenergic,
beta-1-, receptor
human
1
1
1
2
ARTICLE
ARTICLE-ID
1
1
1
1
1
1
2
2
2
2
AUTHOR
Frielle
Collins
Daniel
Caron
Lefkowitz
Kobilka
Frielle
Kobilka
Lefkowitz
Caron
TITLE
Cloning of the cDNA for the human ….
Cloning of the cDNA for the human ….
Cloning of the cDNA for the human ….
Cloning of the cDNA for the human ….
Cloning of the cDNA for the human ….
Cloning of the cDNA for the human ….
Human beta 1- and beta 2-adrenergic receptors
Human beta 1- and beta 2-adrenergic receptors
Human beta 1- and beta 2-adrenergic receptors
Human beta 1- and beta 2-adrenergic receptors
19
20
Relational databases
PROTEIN
REFERENCE
PROTEIN-ID
1
ACCESSION
DEFINITION
SOURCE
PROTEIN-ID
ARTICLE-ID
NM_000684
Homo sapiens
adrenergic,
beta-1-, receptor
human
1
1
1
2
ARTICLE-AUTHOR
ARTICLE-ID
1
1
1
1
1
1
2
2
2
2
SQL
select source
from protein
where accession = NM_000684;
ARTICLE-TITLE
AUTHOR
ARTICLE-ID
Frielle
Collins
Daniel
Caron
Lefkowitz
Kobilka
Frielle
Kobilka
Lefkowitz
Caron
TITLE
1
Cloning of the cDNA for the human
beta 1-adrenergic receptor
2
Human beta 1- and beta 2adrenergic receptors: structurally
and functionally related
receptors derived from distinct
genes
PROTEIN
PROTEIN-ID
1
ACCESSION
DEFINITION
SOURCE
NM_000684
Homo sapiens
adrenergic,
beta-1-, receptor
human
21
22
SQL
select title
from protein, article-title, reference
where protein.accession = NM_000684
and protein.protein-id
= reference.protein-id
and reference.article-id
= article-title.article-id;
PROTEIN
PROTEIN-ID
1
From relational to object model
REFERENCE
PROTEIN-ID
ARTICLE-ID
1
1
1
2
•
•
•
•
CASE
CAD
office automation
multimedia applications
ARTICLE-TITLE
ACCESSION
DEFINITION
SOURCE
NM_000684
Homo sapiens
adrenergic,
beta-1-, receptor
human
ARTICLE-ID
TITLE
1
Cloning of the …
2
Human beta 1- …
23
24
4
:
Object-Oriented Databases
(OODB)
Object
• World is modeled using objects.
• An object has a state (value) and a behavior
(operations).
• Persistent objects - permanent storage
(sometimes transient objects are allowed)
• An object has an object identifier (OID) that
is not visible to the user.
• OID cannot be changed.
• object versus value
(a value has no OID)
• object structure can be arbitrarily complex
(atom, tuple, set, list, bag, array)
25
Example - object state
Example - object state
• o3(id3, tuple,
<title: `Cloning of …’, author: o5 >)
• o4(id4, tuple,
<title: `Human beta-1 …’, author: o6 >)
• o5(id5, list, [Frielle, Collins, Daniel, Caron,
Lefkowitz, Kobilka])
• o6(id6, list, [Frielle, Kobilka , Lefkowitz,
Caron])
• o1(id1, tuple,
<accession: NM_000684,
source : human,
definition: ’Homo sapiens adrenergic …’,
reference: o2>)
• o2(id2, set, {o3,o4})
Remark: These examples do not use a standard syntax
human
NM_000684
27
”Homo sapiens adrenergic,
beta-1-, receptor”
DEFINITION
REFERENCE
AUTHOR
set
TITLE
list
”Cloning of …”
Frielle
AUTHOR
TITLE
list
Collins
Frielle
Daniel
Kobilka
Caron
Lefkowitz
Lefkowitz
Kobilka
28
Classes
SOURCE
ACCESSION
26
”Human beta-1 …”
define class protein
type tuple (
accession: string;
source : string;
definition: string;
reference: set(article); );
operations
create-protein(string,string,string,set(article)): protein;
get-accession: string;
get-source: string;
get-definition: string;
get-references: set(article);
add-reference(article): void;
end protein;
Caron
29
30
5
:
Classes
Example program
program
variables: article1, article2, protein1;
begin
article1 := create-article(’Cloning….’, list(Frielle, Collins,
Daniel, Caron, Lefkowitz, Kobilka));
protein1 := create-protein(NM_000684, human,’Homo
sapiens adrenergic …’, set(article1));
article2 := create-article(’ Human beta-1….’, list(Frielle,
Kobilka , Lefkowitz, Caron]);
protein1.add-reference(article2);
end;
define class article
type tuple (
title: string;
author: list(string); );
operations
create-article(string, list(string)): article;
get-title string;
get-authors: list(string);
print-article-info string;
end article;
31
Operations
32
Inheritance
• encapsulation: operation = interface + body
- interface: how is the operation called?
What is the result of the operation?
> visible to user, used in programs
- body: how is the operation implemented?
> invisible for user
• program is based on message passing
• journal-article subtype-of article:
journal-name journal-volume page-numbers
journal-article inherits all attributes and operations from
article and has in addition also journal-name, journalvolume and page-numbers as attributes
• human-protein subtype-of protein (source = ’human’)
33
Operator overloading
34
Query language OQL
• The same operator name can be used for
different implementations
• select … from … where
select distinct … from … where
• iterator variables
• path expressions
• struct
• example:
print-article-info for article prints information on title and
author.
print-article-info for journal-article prints information on
title, author and also on the journal’s name, volume and
page number..
35
36
6
:
Queries
select o.source
from o in protein
where o.accession = NM_000684;
Queries
NM_000684
ACCESSION
select struct (accession: o.accession, source: o.source)
from o in protein
where Frielle in
(select a.author
from a in o.reference);
human
SOURCE
NM_000684
human
ACCESSION
SOURCE
REFERENCE
set
AUTHOR
Frielle
AUTHOR
Frielle
37
Query language OQL
38
Third-Generation DB Manifesto
OQL also allows:
• views
• aggregation
• special operations for list and array
(first, last, nth)
• order-by
• group-by
• Objects and Rules
- rich type system
- inheritance
- methods and encapsulation
- unique identifiers
- rules (triggers, constraints)
39
Third-Generation DB Manifesto
40
Third-Generation DB Manifesto
• DBMS functionality
- access through non-procedural high-level
language
- specify collections intensionally and
extensionally
- updatable views
- no performance indicators in the model
• Open systems
- accessible via several high-level languages
- persistency
- SQL-like language
- queries and answers are the lowest level of
communication between client and server
41
42
7
:
OODBS Manifesto
OODBS Manifesto
Thou shalt ...
• complex objects
• object identity
• encapsulation
• types and classes
• inheritance
• overriding, overloading, late binding
•
•
•
•
•
•
•
computational completeness
extensibility
persistence
secondary storage management
concurrency
recovery
query facility
43
OODBS Manifesto
44
Semi-structured data
Optional
• multiple inheritance
• distribution
• long and nested transactions
• versions
• data that is not just text, but that is not as
well structured as data in databases
• often seen in web databanks and when
integrating databanks
Thou shalt question the golden rules.
45
Semi-structured data - properties
•
•
•
•
•
irregular structure
implicit structure
partial structure
’data guide’ vs schema
large data guides
46
Semi-structured data - model
• Network of nodes
• object model (oid)
• query: path search in the network
47
48
8
:
ACCESSION
”Homo sapiens adrenergic,
beta-1-, receptor”
human
NM_000684
Queries
SOURCE
DEFINITION
human
NM_000684
REFERENCE
select source
where accession = NM_000684;
REFERENCE
TITLE
AUTHOR
AUTHOR
”Cloning of …”
ACCESSION
SOURCE
TITLE
AUTHOR
AUTHOR
Frielle
Collins
AUTHOR
”Human beta-1 …”
AUTHOR
AUTHOR
Daniel
AUTHOR
Caron
AUTHOR
AUTHOR
Lefkowitz
Kobilka
49
50
Queries
select reference.title
where accession = NM_000684;
Knowledge bases
NM_000684
ACCESSION
REFERENCE
select #p.title
where accession = NM_000684;
TITLE
”Cloning of …”
REFERENCE
TITLE
”Human
beta-1 …”
• Often based on a logic
• Query answering based on inference
mechanism
• Knowledge bases often fit in main memory
• Useful for ontologies
51
52
How is the information stored?
(low level)
Knowledge bases
(F) source(NM_000684, Human)
(R) source(P?,Human) => source(P?,Mammal)
(R) source(P?,Mammal) => source(P?,Vertebrate)
Q: ?- source(NM_000684, Vertebrate)
A: yes
Q: ?- source(x?, Mammal)
A: x? = NM_000684
53
Real life
information
Databank
Model
Databank
management
system
Queries
Answers
Processing of queries and
updates
Access to stored data
Physical
databank
54
9
:
How is the information accessed?
(system level)
Real life
information
Model
Databank
Answers
Processing of queries and
updates
Databank
management
system
Access to stored data
Physical
databank
55
How is a databank recovered
after a crash?
Queries
56
How to keep track of changes of
the data over time?
Real life
information
Recovery when
• system crash
• system error
• concurrency error
• disk error
• catastrophy
Databank
Model
Databank
Management
System
Queries Answers
Processing of
queries and updates
Access to stored data
57
How to keep track of changes of
the data over time?
• data evolution (versioning)
58
How can several users access
and update information in a
databank at the same time?
Real life
information
• schema evolution
Databank
59
Model
Processing of
Databank
queries and updates
management
system
Access to stored data
Physical
databank
60
10
:
Transactions
Administrator 1
TIME
Authorization
Administrator 2
Read(Number-of-proteins)
Number-of-proteins =
Number-of-proteins + 30
Read(Number-of-proteins)
Number-of-proteins =
Number-of-proteins + 25
Write(Number-of-proteins)
• Authorization mechanisms
(implicit/explicit, strong/weak,
positive/negative)
• Extensions for the OO model
(class, inheritance, composite objects,
versions)
Write(Number-of-proteins)
61
62
How can a user access
information in several databanks
at the same time?
Access to multiple databanks problems
• User needs to know where to find
information and how to retrieve it (UI, QL).
• Terminology
Different databanks may store different
information about an entity.
Same names in different databanks may refer
to different entities.
• Query planning is difficult.
query
63
64
Object-oriented and
object-relational databases
Integration methods
• Federations
• Data warehouses
• Mediators
•
•
•
•
•
•
•
65
Data models and query languages
Query processing and optimization
Versioning and schema evolution
Authorization
Storage management and indexing
----------------------------------------------------(XML and databases)
66
11
Download