L1 - Intro

advertisement
Lecture 1: Introduction
Faculty of Computer Science
Technion – Israel Institute of Technology
Spring 2015
Assumed Background
• Databases
– Relational model, database querying, SQL, relational
algebra, schema, integrity constraints (e.g., functional
dependencies)
• Algorithms and complexity
– Asymptotic running time, ptime, NP, completeness,
reduction
• Basic probability theory
– Probability space, event, random variable, conditional
probability
2
Attendance Requirement
• 4 mandatory assignments, no exam
– Theoretical (20%), programmatic (30%), theoretical
(20%), programmatic (30%)
• To get a grade, students must submit all
assignments and attend lectures
– <=2 misses is fine, >5 misses is unacceptable
– Exception: students who miss 3-5 lectures can get a
grade by attending an easy exam on the course
material
• Must pass, 10% of the grade (other grades normalized
accordingly)
3
Lecture 1: Introduction
UNCERTAINTY IN DATABASES
4
Some Modern Database Content
Knowledge Bases
Business
DBs
Sensing Data
5
Integration
Signal / Image Processing
Text Analytics / NLP
Web Pages
Social Media
Financial Reports
OCR / Image
Gov Reports
Med Reports
Knowledge Bases
Attribute
Concept
Value
Instance
Concept
Instance
country
0.4
Probability
Relationship
Relationship
Israel
location
0.35
Person
0.2
• Microsoft Probase
• MPI YAGO
• Google Knowledge Graph
• CMU NELL
• Google Knowledge Vault
• Freebase
• Stanford DeepDive
• ...
6
Relating to Big Data
• Missing information
• Conflicting Information
• Probabilistic information
7
Popular Topics in DB Research
• VLDB 2014 Ten Year Best Paper
– Nilesh Dalvi and Dan Suciu: Efficient Query Evaluation on
Probabilistic Databases
• PODS 2014 Keynote
– Leonid Libkin: Incomplete data: what went wrong, and how to
fix it
• SIGMOD/PODS 2014 Workshop on Big
Uncertain Data
– Kimelfeld (DB) and Kersting (AI)
• ICDT 2013 Test-of-Time Award
– Ronald Fagin, Phokion Kolaitis, Renee Miller, and Lucian
Popa: Data Exchange: Semantics and Query Answering
8
What’s in the Course?
• Principled, application-independent paradigms
to managing uncertainty in data
– Incomplete / inconsistent / probabilistic databases
• Two key aspects for every paradigm:
– Representation
• How do we represent what we know, what is missing, and
what is our confidence?
– Query evaluation
• What is the meaning of query answering in the presence of
uncertainty? What is the involved computational complexity?
9
Lecture 1: Introduction
INCOMPLETE DATABASES
10
Missing Information
• Problem: pieces of data missing, but we need to
keep whatever partial knowledge we have
Registrations
Courses
student
course
course
lecturer
Ahuva
PL
PL
Eran
• A source tells us that Alon is a student of Keren
– How can we represent it in our DB?
Registrations
⊥=NULL
Courses
student
course
course
lecturer
Ahuva
PL
PL
Eran
Alon
⊥
⊥
Keren
11
SQL’s NULL
• NULL is SQL’s special “missing value”
• Same queries as complete tables, but SQL
assigns a special behavior to logic over NULL
– “Three-valued logic”: true, false, unknown
• Alas, there are some issues...
12
Try It Yourself (psql)
CREATE TABLE Registrations(
student varchar(40),
course varchar(40));
CREATE TABLE Courses(
course varchar(40),
lecturer varchar(40));
INSERT INTO Registrations VALUES
('Ahuva','PL'), ('Alon',NULL);
INSERT INTO Courses VALUES
('PL','Eran'), (NULL,'Keren');
Registrations
Courses
student
course
course
lecturer
Ahuva
PL
PL
Eran
Alon
⊥
⊥
Keren
SELECT student, lecturer
FROM Registrations R, Courses C
WHERE R.course = C.course;
student
lecturer
Ahuva
Eran
Of course, we've lost our initial association (join)...
13
Try More Yourself (psql)
Courses
Registrations
student
course
course
lecturer
Ahuva
PL
PL
Eran
Alon
⊥
⊥
Keren
SELECT student
FROM Registrations;
student
SELECT student
FROM Registrations
WHERE course='PL';
Ahuva
student
Alon
Ahuva
Inconsistent logic...
real problem!
SELECT student
FROM Registrations
WHERE course!='PL';
student
SELECT student
FROM Registrations
WHERE course='PL' OR course!='PL';
student
Ahuva
Alon??
14
Labeled Nulls in “Naive” Tables
• Just like nulls, but each null has a name
– We do not know what the value is, but we do know
that two nulls with the same name are the same
Registrations
Courses
student
course
course
lecturer
Ahuva
PL
PL
Eran
Alon
⊥1
⊥1
Keren
Ahuva
⊥2
⊥2
Shaul
⨝
=
student
course
lecture
r
Ahuva
PL
Eran
Alon
⊥1
Keren
Ahuva
⊥2
Shaul
?
?
?
?
?
?
15
Possible Worlds
Registrations
Closed-World
Assumption:
Registrations
student course
Ahuva
PL
Alon
⊥1
Ahuva
⊥2
student course
Registrations
Open-World
Assumption:
student
course
Ahuva
PL
Ahuva
PL
Alon
PL
Alon
PL
Ahuva
DB
Ahuva
DB
Anna
AI
Registrations
student course
Ahuva
PL
Alon
DB
Ahuva
DB
Registrations
student course
Ahuva
PL
Alon
Ahuva
Registrations
course
⊥1
Ahuva
PL
⊥2
Alon
DB
Ahuva
DB
Ahuva
AI
Avi
ML
...
...
student
16
Semantics of Query Answering
Incomplete DB
Possible Worlds
17
Semantics of Query Answering
Incomplete DB
Possible Worlds
18
Semantics of Query Answering
Incomplete DB
Certain answers
(“weak)
Represent as an
incomplete relation
(“strong”)
Possible Worlds
19
FQL Table Schema
Application: Data Exchange
status
link
PK
uid
status_id
time
source
message
group
PK
nid
pic_small
pic_big
pic
description
group_type
group_subtype
recent_news
creator
update_time
office
website
venue
privacy
uid
name
value
expires
path
note
PK
note_id
uid
created_time
updated_time
content
title
comment
xid
post_id
fromid
time
text
id
username
reply_xid
Messages
Users
Associations
Global Schema
friend_request
uid_from
uid_to
friend
source_id
target_id
target_type
is_following
updated_time
is_deleted
user
PK
Mappingname
link_id
owner
created_time
title
summary
url
image_urls
cookies
gid
connection
page
uid
first_name
last_name
name
pic_small
pic_big
pic_square
pic
affiliations
profile_update_time
timezone
religion
birthday
birthday_date
sex
hometown_location
meeting_sex
meeting_for
relationship_status
significant_other_id
political
current_location
activities
interests
is_app_user
music
tv
movies
books
quotes
about_me
hs_info
education_history
work_history
notes_count
wall_count
status
has_added_app
online_presence
locale
proxied_email
profile_url
email_hashes
pic_small_with_logo
PK
standard_user_info
uid
first_name
last_name
name
locale
affiliations
profile_url
timezone
birthday
sex
proxied_email
profile
id
name
url
pic
pic_square
pic_small
pic_big
type
page_admin
uid
page_id
type
20
page_id
name
pic_small
pic_big
pic_square
pic
pic_large
page_url
type
website
has_added
founded
company_o
mission
products
location
parking
public_tran
hours
attire
payment_o
culinary_te
general_ma
price_range
restaurant_
restaurant_
release_da
genre
starring
screenplay
directed_by
produced_b
studio
awards
plot_outline
network
season
schedule
written_by
band_mem
hometown
current_loc
record_labe
booking_ag
The Clio Project
IBM + U. Toronto – tool for data exchange
Commercialized in IBM DB2
21
Formalism [Fagin et al. 05]
A schema mapping is defined by a source schema S, a target schema
T, and a set Σ of logical assertions stating how S relates to T
S
T
StudLecturer
student
lecturer
Courses
Registrations
student
course
course
lecturer
Σ
StudLecturer(x,y)  ∃z Registrations(x,z) ⋀ Courses(z,y)
TaughtBy
student
course
Ahuva
Shaul
Alon
Keren
source instance
?? We don’t have z! So 2 options:
1) Abort
2) Do our best to max usability
Formalism [Fagin et al. 05]
A schema mapping is defined by a source schema S, a target schema
T, and a set Σ of logical assertions stating how S relates to T
S
T
StudLecturer
student
lecturer
Courses
Registrations
student
course
course
lecturer
Σ
StudLecturer(x,y)  ∃z Registrations(x,z) ⋀ Courses(z,y)
TaughtBy
Courses
Registrations
student
course
student
course
course
lecturer
Ahuva
Shaul
Ahuva
⊥1
⊥1
Shaul
Alon
Keren
Alon
⊥2
⊥2
Keren
source instance
solution
23
Problems Studied in Data Exchange
• Materialization
– Many solutions exist; what makes one solution
“better” than another? If there a “best” solution? How
can we find it?
• Target query answering
– Given a source instance and a query over the target,
evaluate the query (semantics / complexity)
• Manipulating schema mappings
– Composition and inversion of mappings
24
Lecture 1: Introduction
INCONSISTENT DATABASES
25
Inconsistency
• An inconsistent database contains inconsistent
(or impossible) information
– Two students have the same ID
– A student gets credit for the same course twice
– A student takes a course that is not listed in the
course database
– A student has a grade for this course but a grade is
missing for an assignment 
• Modeling: (D,Σ) where D is a database and Σ is
a set of required logical integrity constraints over
DBs; alas, D violates Σ
26
Query Answering
Grades
Courses
student
course
grade
course
lecturer
Ahuva
PL
90
PL
Eran
Alon
PL
86
DC
Keren
Alon
PL
81
Database D
Functional Dependency:
student, course  grade
Integrity Constraints Σ
SELECT student
FROM Grades G, Courses C
WHERE
G.grade >= 85 AND
G.course = C.course AND
C.lecturer=‘Eran’
Ahuva
Alon
27
Query Answering
Grades
Courses
student
course
grade
course
lecturer
Ahuva
PL
90
PL
Eran
Alon
PL
86
DC
Keren
Alon
PL
81
Database D
Functional Dependency:
Student, Course  Grade
Integrity Constraints Σ
SELECT student
FROM Grades G, Courses C
WHERE
G.grade >= 87 AND
G.course = C.course AND
C.lecturer=‘Eran’
Ahuva
Alon
28
Query Answering
Grades
Courses
student
course
grade
course
lecturer
Ahuva
PL
90
PL
Eran
Alon
PL
86
DC
Keren
Alon
PL
81
Database D
Functional Dependency:
Student, Course  Grade
Integrity Constraints Σ
SELECT student
FROM Grades G, Courses C
WHERE
G.grade >= 80 AND
G.course = C.course AND
C.lecturer=‘Eran’
Ahuva
Alon
29
Minimal Repairs
[Arenas, Bertossi, Chomicki 99]:
DEFINITION: Let (D,Σ) be an inconsistent DB. A
repair is a DB D', such that:
1. DB D' is consistent (with respect to Σ)
2. DB D' differs from D in a “minimal way”
Grades
Grades
student
course
grade
Ahuva
PL
90
Alon
PL
86
Alon
PL
81
Inconsistent database D
student
course
grade
Ahuva
PL
90
Alon
PL
86
Repair
D'1
Grades
student
course
grade
Ahuva
PL
90
Alon
PL
81
Repair
D'2
30
Semantics of Query Answering
Inconsistent DB
Repairs (consistent DBs)
31
Semantics of Query Answering
Inconsistent DB
Repairs (consistent DBs)
32
Semantics of Query Answering
Inconsistent DB
Consistent Answers
Repairs (consistent DBs)
33
Algorithms / Complexity
Very recent result by Koutris & Wijsen: For consistent query
answering with key constraints, Select-Project-Join (SPJ)
queries w/o repeated relations can be classified into three
categories:
1.
Inconsistent DB
2.
3.
Inconsistent DB
coNP-complete
(exptime under standard
complexity assumptions)
ignore
inconsistency
Rewriting
Graph algorithm
34
Incorporating Preferences
Functional dependencies:
course  lecturer
lecturer  course
Courses
course
lecturer
DB
Keren
DC
Keren
DC
Eran
What if we
trust tuple 2
more than
tuple 1?
Staworko, Chomicki, Marcinkowski: Prioritized repairing and
consistent query answering in relational databases. Ann. Math. Artif.
Intell. 64(2-3): 209-246 (2012)
35
Lecture 1: Introduction
PROBABILISTIC DATABASES
36
How to accommodate the probabilistic nature
of data at the database & query level?
Student
University
Ahuva
Technion
Alon
Technion
Employee
Employer
Role
Eng
Ahuva
Intel
PM
VP
HaifaU
Alon
Yahoo!
Eng
Google
Eng
Intel
PM
• Find the students that are employed as engineers
• How many students work at Intel?
• Is any PM a Technion student?
37
How to accommodate the probabilistic nature
of data at the database & query level?
Student
University
Pr
Ahuva
Technion
1.0
Technion
0.7
HaifaU
0.3
Alon
Employee
Ahuva
Alon
Employer
Role
Pr
Eng
0.7
PM
0.2
VP
0.1
Yahoo!
Eng
0.4
Google
Eng
0.4
Intel
PM
0.1
Intel
• Find the students that are employed as engineers
- Ahuva (0.7), Alon (0.8)
• How many students work at Intel?
- Expectation = 1 + 0.1
• Is any PM a Technion student?
- Yes w/ prob 1-((1-0.2)*(1-0.7*0.1))
38
Semantics
Probabilistic DB
p1
p2
p3
p4
pn
Space of ordinary DBs
39
Semantics of Query Answering
Probabilistic Database
p1
p2
p3
p4
pn
Space of ordinary DBs
40
Semantics of Query Answering
Probabilistic Database
p1
p1
p2
p2
p3
p3
p4
p4
pn
pn
Space of ordinary DBs
41
Semantics of Query Answering
Probabilistic Database
p1
p1
p2
p2
p3
p3
p4
Rep of the
probability space
Mapping tuple 
marginal probability
p4
pn
pn
Space of ordinary DBs
42
Algorithms for Query Answering
• Dalvi & Suciu dichotomy: SPJ queries can be
fully classified into:
– Queries that can be solved in polynomial time
• By repeated decomposition into simpler queries
– Queries for which answering is #P-hard
• Hence, cannot be computed in polynomial time under
standard complexity assumptions
• Heuristic via BDDs [Olteanu+]
• Guaranteed approximation via sampling
– Additive approx. p±𝜀 is simple
– Multiplicative approx. (1±𝜀)p requires more work
43
Probabilistic XML
university
department
0 .8
0.9
position
name
position
Paul
3
0.
chair f. prof a. prof
0.5
0.
6
name
ph.d. studs
Nicole
4
0.
0.8
0.5
0.
7
member
0.6
0.7
member
chair f. prof a. prof name name name
David
Amy
Emily
[Abiteboul, Kimelfeld, Sagiv, Senellart]:
Representation systems and XPath evaluation
44
Lecture 1: Introduction
PLANNED SCHEDULE
45
1
2
3
4
5
6
7
8
9
10
11
12
24/03
Intro
31/03
DB Essentials
07/04
Passover
12/04* (comp)
Incompleteness
14/04
Data Exchange
21/04
Inconsistent DBs
Assignment 1 due
28/04
Consistent Q Answering
05/05
Consistent Q Answering
12/05
Pref. Repairs + Misc
Assignment 2 due
19/05
Probabilistic DB
26/05
Query Inference
02/06
No Lecture
Assignment 3 due
09/06
Query Inference
16/06
Guest Lecture
23/06
Extras
Assignment 4 due
46
Download