Lecture 1: Introduction Faculty of Computer Science Technion – Israel Institute of Technology Spring 2015 Assumed Background • Databases – Relational model, database querying, SQL, relational algebra, schema, integrity constraints (e.g., functional dependencies) • Algorithms and complexity – Asymptotic running time, ptime, NP, completeness, reduction • Basic probability theory – Probability space, event, random variable, conditional probability 2 Attendance Requirement • 4 mandatory assignments, no exam – Theoretical (20%), programmatic (30%), theoretical (20%), programmatic (30%) • To get a grade, students must submit all assignments and attend lectures – <=2 misses is fine, >5 misses is unacceptable – Exception: students who miss 3-5 lectures can get a grade by attending an easy exam on the course material • Must pass, 10% of the grade (other grades normalized accordingly) 3 Lecture 1: Introduction UNCERTAINTY IN DATABASES 4 Some Modern Database Content Knowledge Bases Business DBs Sensing Data 5 Integration Signal / Image Processing Text Analytics / NLP Web Pages Social Media Financial Reports OCR / Image Gov Reports Med Reports Knowledge Bases Attribute Concept Value Instance Concept Instance country 0.4 Probability Relationship Relationship Israel location 0.35 Person 0.2 • Microsoft Probase • MPI YAGO • Google Knowledge Graph • CMU NELL • Google Knowledge Vault • Freebase • Stanford DeepDive • ... 6 Relating to Big Data • Missing information • Conflicting Information • Probabilistic information 7 Popular Topics in DB Research • VLDB 2014 Ten Year Best Paper – Nilesh Dalvi and Dan Suciu: Efficient Query Evaluation on Probabilistic Databases • PODS 2014 Keynote – Leonid Libkin: Incomplete data: what went wrong, and how to fix it • SIGMOD/PODS 2014 Workshop on Big Uncertain Data – Kimelfeld (DB) and Kersting (AI) • ICDT 2013 Test-of-Time Award – Ronald Fagin, Phokion Kolaitis, Renee Miller, and Lucian Popa: Data Exchange: Semantics and Query Answering 8 What’s in the Course? • Principled, application-independent paradigms to managing uncertainty in data – Incomplete / inconsistent / probabilistic databases • Two key aspects for every paradigm: – Representation • How do we represent what we know, what is missing, and what is our confidence? – Query evaluation • What is the meaning of query answering in the presence of uncertainty? What is the involved computational complexity? 9 Lecture 1: Introduction INCOMPLETE DATABASES 10 Missing Information • Problem: pieces of data missing, but we need to keep whatever partial knowledge we have Registrations Courses student course course lecturer Ahuva PL PL Eran • A source tells us that Alon is a student of Keren – How can we represent it in our DB? Registrations ⊥=NULL Courses student course course lecturer Ahuva PL PL Eran Alon ⊥ ⊥ Keren 11 SQL’s NULL • NULL is SQL’s special “missing value” • Same queries as complete tables, but SQL assigns a special behavior to logic over NULL – “Three-valued logic”: true, false, unknown • Alas, there are some issues... 12 Try It Yourself (psql) CREATE TABLE Registrations( student varchar(40), course varchar(40)); CREATE TABLE Courses( course varchar(40), lecturer varchar(40)); INSERT INTO Registrations VALUES ('Ahuva','PL'), ('Alon',NULL); INSERT INTO Courses VALUES ('PL','Eran'), (NULL,'Keren'); Registrations Courses student course course lecturer Ahuva PL PL Eran Alon ⊥ ⊥ Keren SELECT student, lecturer FROM Registrations R, Courses C WHERE R.course = C.course; student lecturer Ahuva Eran Of course, we've lost our initial association (join)... 13 Try More Yourself (psql) Courses Registrations student course course lecturer Ahuva PL PL Eran Alon ⊥ ⊥ Keren SELECT student FROM Registrations; student SELECT student FROM Registrations WHERE course='PL'; Ahuva student Alon Ahuva Inconsistent logic... real problem! SELECT student FROM Registrations WHERE course!='PL'; student SELECT student FROM Registrations WHERE course='PL' OR course!='PL'; student Ahuva Alon?? 14 Labeled Nulls in “Naive” Tables • Just like nulls, but each null has a name – We do not know what the value is, but we do know that two nulls with the same name are the same Registrations Courses student course course lecturer Ahuva PL PL Eran Alon ⊥1 ⊥1 Keren Ahuva ⊥2 ⊥2 Shaul ⨝ = student course lecture r Ahuva PL Eran Alon ⊥1 Keren Ahuva ⊥2 Shaul ? ? ? ? ? ? 15 Possible Worlds Registrations Closed-World Assumption: Registrations student course Ahuva PL Alon ⊥1 Ahuva ⊥2 student course Registrations Open-World Assumption: student course Ahuva PL Ahuva PL Alon PL Alon PL Ahuva DB Ahuva DB Anna AI Registrations student course Ahuva PL Alon DB Ahuva DB Registrations student course Ahuva PL Alon Ahuva Registrations course ⊥1 Ahuva PL ⊥2 Alon DB Ahuva DB Ahuva AI Avi ML ... ... student 16 Semantics of Query Answering Incomplete DB Possible Worlds 17 Semantics of Query Answering Incomplete DB Possible Worlds 18 Semantics of Query Answering Incomplete DB Certain answers (“weak) Represent as an incomplete relation (“strong”) Possible Worlds 19 FQL Table Schema Application: Data Exchange status link PK uid status_id time source message group PK nid pic_small pic_big pic description group_type group_subtype recent_news creator update_time office website venue privacy uid name value expires path note PK note_id uid created_time updated_time content title comment xid post_id fromid time text id username reply_xid Messages Users Associations Global Schema friend_request uid_from uid_to friend source_id target_id target_type is_following updated_time is_deleted user PK Mappingname link_id owner created_time title summary url image_urls cookies gid connection page uid first_name last_name name pic_small pic_big pic_square pic affiliations profile_update_time timezone religion birthday birthday_date sex hometown_location meeting_sex meeting_for relationship_status significant_other_id political current_location activities interests is_app_user music tv movies books quotes about_me hs_info education_history work_history notes_count wall_count status has_added_app online_presence locale proxied_email profile_url email_hashes pic_small_with_logo PK standard_user_info uid first_name last_name name locale affiliations profile_url timezone birthday sex proxied_email profile id name url pic pic_square pic_small pic_big type page_admin uid page_id type 20 page_id name pic_small pic_big pic_square pic pic_large page_url type website has_added founded company_o mission products location parking public_tran hours attire payment_o culinary_te general_ma price_range restaurant_ restaurant_ release_da genre starring screenplay directed_by produced_b studio awards plot_outline network season schedule written_by band_mem hometown current_loc record_labe booking_ag The Clio Project IBM + U. Toronto – tool for data exchange Commercialized in IBM DB2 21 Formalism [Fagin et al. 05] A schema mapping is defined by a source schema S, a target schema T, and a set Σ of logical assertions stating how S relates to T S T StudLecturer student lecturer Courses Registrations student course course lecturer Σ StudLecturer(x,y) ∃z Registrations(x,z) ⋀ Courses(z,y) TaughtBy student course Ahuva Shaul Alon Keren source instance ?? We don’t have z! So 2 options: 1) Abort 2) Do our best to max usability Formalism [Fagin et al. 05] A schema mapping is defined by a source schema S, a target schema T, and a set Σ of logical assertions stating how S relates to T S T StudLecturer student lecturer Courses Registrations student course course lecturer Σ StudLecturer(x,y) ∃z Registrations(x,z) ⋀ Courses(z,y) TaughtBy Courses Registrations student course student course course lecturer Ahuva Shaul Ahuva ⊥1 ⊥1 Shaul Alon Keren Alon ⊥2 ⊥2 Keren source instance solution 23 Problems Studied in Data Exchange • Materialization – Many solutions exist; what makes one solution “better” than another? If there a “best” solution? How can we find it? • Target query answering – Given a source instance and a query over the target, evaluate the query (semantics / complexity) • Manipulating schema mappings – Composition and inversion of mappings 24 Lecture 1: Introduction INCONSISTENT DATABASES 25 Inconsistency • An inconsistent database contains inconsistent (or impossible) information – Two students have the same ID – A student gets credit for the same course twice – A student takes a course that is not listed in the course database – A student has a grade for this course but a grade is missing for an assignment • Modeling: (D,Σ) where D is a database and Σ is a set of required logical integrity constraints over DBs; alas, D violates Σ 26 Query Answering Grades Courses student course grade course lecturer Ahuva PL 90 PL Eran Alon PL 86 DC Keren Alon PL 81 Database D Functional Dependency: student, course grade Integrity Constraints Σ SELECT student FROM Grades G, Courses C WHERE G.grade >= 85 AND G.course = C.course AND C.lecturer=‘Eran’ Ahuva Alon 27 Query Answering Grades Courses student course grade course lecturer Ahuva PL 90 PL Eran Alon PL 86 DC Keren Alon PL 81 Database D Functional Dependency: Student, Course Grade Integrity Constraints Σ SELECT student FROM Grades G, Courses C WHERE G.grade >= 87 AND G.course = C.course AND C.lecturer=‘Eran’ Ahuva Alon 28 Query Answering Grades Courses student course grade course lecturer Ahuva PL 90 PL Eran Alon PL 86 DC Keren Alon PL 81 Database D Functional Dependency: Student, Course Grade Integrity Constraints Σ SELECT student FROM Grades G, Courses C WHERE G.grade >= 80 AND G.course = C.course AND C.lecturer=‘Eran’ Ahuva Alon 29 Minimal Repairs [Arenas, Bertossi, Chomicki 99]: DEFINITION: Let (D,Σ) be an inconsistent DB. A repair is a DB D', such that: 1. DB D' is consistent (with respect to Σ) 2. DB D' differs from D in a “minimal way” Grades Grades student course grade Ahuva PL 90 Alon PL 86 Alon PL 81 Inconsistent database D student course grade Ahuva PL 90 Alon PL 86 Repair D'1 Grades student course grade Ahuva PL 90 Alon PL 81 Repair D'2 30 Semantics of Query Answering Inconsistent DB Repairs (consistent DBs) 31 Semantics of Query Answering Inconsistent DB Repairs (consistent DBs) 32 Semantics of Query Answering Inconsistent DB Consistent Answers Repairs (consistent DBs) 33 Algorithms / Complexity Very recent result by Koutris & Wijsen: For consistent query answering with key constraints, Select-Project-Join (SPJ) queries w/o repeated relations can be classified into three categories: 1. Inconsistent DB 2. 3. Inconsistent DB coNP-complete (exptime under standard complexity assumptions) ignore inconsistency Rewriting Graph algorithm 34 Incorporating Preferences Functional dependencies: course lecturer lecturer course Courses course lecturer DB Keren DC Keren DC Eran What if we trust tuple 2 more than tuple 1? Staworko, Chomicki, Marcinkowski: Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell. 64(2-3): 209-246 (2012) 35 Lecture 1: Introduction PROBABILISTIC DATABASES 36 How to accommodate the probabilistic nature of data at the database & query level? Student University Ahuva Technion Alon Technion Employee Employer Role Eng Ahuva Intel PM VP HaifaU Alon Yahoo! Eng Google Eng Intel PM • Find the students that are employed as engineers • How many students work at Intel? • Is any PM a Technion student? 37 How to accommodate the probabilistic nature of data at the database & query level? Student University Pr Ahuva Technion 1.0 Technion 0.7 HaifaU 0.3 Alon Employee Ahuva Alon Employer Role Pr Eng 0.7 PM 0.2 VP 0.1 Yahoo! Eng 0.4 Google Eng 0.4 Intel PM 0.1 Intel • Find the students that are employed as engineers - Ahuva (0.7), Alon (0.8) • How many students work at Intel? - Expectation = 1 + 0.1 • Is any PM a Technion student? - Yes w/ prob 1-((1-0.2)*(1-0.7*0.1)) 38 Semantics Probabilistic DB p1 p2 p3 p4 pn Space of ordinary DBs 39 Semantics of Query Answering Probabilistic Database p1 p2 p3 p4 pn Space of ordinary DBs 40 Semantics of Query Answering Probabilistic Database p1 p1 p2 p2 p3 p3 p4 p4 pn pn Space of ordinary DBs 41 Semantics of Query Answering Probabilistic Database p1 p1 p2 p2 p3 p3 p4 Rep of the probability space Mapping tuple marginal probability p4 pn pn Space of ordinary DBs 42 Algorithms for Query Answering • Dalvi & Suciu dichotomy: SPJ queries can be fully classified into: – Queries that can be solved in polynomial time • By repeated decomposition into simpler queries – Queries for which answering is #P-hard • Hence, cannot be computed in polynomial time under standard complexity assumptions • Heuristic via BDDs [Olteanu+] • Guaranteed approximation via sampling – Additive approx. p±𝜀 is simple – Multiplicative approx. (1±𝜀)p requires more work 43 Probabilistic XML university department 0 .8 0.9 position name position Paul 3 0. chair f. prof a. prof 0.5 0. 6 name ph.d. studs Nicole 4 0. 0.8 0.5 0. 7 member 0.6 0.7 member chair f. prof a. prof name name name David Amy Emily [Abiteboul, Kimelfeld, Sagiv, Senellart]: Representation systems and XPath evaluation 44 Lecture 1: Introduction PLANNED SCHEDULE 45 1 2 3 4 5 6 7 8 9 10 11 12 24/03 Intro 31/03 DB Essentials 07/04 Passover 12/04* (comp) Incompleteness 14/04 Data Exchange 21/04 Inconsistent DBs Assignment 1 due 28/04 Consistent Q Answering 05/05 Consistent Q Answering 12/05 Pref. Repairs + Misc Assignment 2 due 19/05 Probabilistic DB 26/05 Query Inference 02/06 No Lecture Assignment 3 due 09/06 Query Inference 16/06 Guest Lecture 23/06 Extras Assignment 4 due 46