Data Science Summary 2018/2019 Disclaimer This summary is provided to you by fellow students. The education committee simply cannot check every summary for it’s quality, therefore, be aware that this summary cannot guarantee your success in the exam. Kind regards, Education Committee 2018/2019 MADE BY Jolijn Martens JBI050 Data Management Week 1a Why do we study information systems? Information needs to be stored, used and manipulated → Administrative information systems (retail, schedules, banking etc.) Problems when not using database systems: • Data redundancy and inconsistency • Difficulty in accessing data • Data isolation • Integrity problems → Hard to add new constraints and manage existing constraints (e.g. account balances must be greater than zero) Data manipulation: entering, updating, deleting and retrieving information from a database Data Manipulation Languages (DML): • Procedural: user specifies what data is required and how to get those data • Declarative: user specifies what data is required without specifying how to get those data Database design: How do we design a database and implement it Object system: the ”real world” Information system: a representation (= approximation) of the real world in a computer system Modelling of information systems → key idea: data independence → levels of abstraction • physical level: how a record is stored • logical level: how data is stored in database and the relationships among the data • view level: application programs to hide details of data types (and can hide information) Schema: logical structure of a database Instance: the actual content Physical data independence: The ability to modify the physical schema without changing the logical system Database model: collection of tools for describing data, relationships, semantics and constraints (e.g. ER model) Entity-Relational model →An attribute can also be a property of a relationship set (e.g. date) →Most relationships are binary 1 Attribute types: • Simple and composite (e.g first name - last name) attributes • Single-valued and multi-valued (phone numbers) attributes • derived attributes (age (given date of birth)) Keys: • super key: a subset of one or more of its attributes whose values uniquely determine each entity (e.g. ID, Name) • candidate key: minimal super key (e.g. ID) Although several candidate keys may exist, one of the candidate keys is selected to be the primary key Database design • Conceptual with E-R models • Logical in relational model Goodness of design: • Conceptual: accurately reflects the semantics of use in the modelled domain • Logical: accurately reflects the conceptual models and disallows redundancies and updates anomalies • physical: accurately reflects logical model and efficiently and reliably supports use of data Week 1b Relational Database Model → Database consists of a set of relations (tables) → Each table has name and a set of attributes → Each attribute has a name and a domain → A relation instance is a set of tuples (rows) Unnamed perspective: a relation instance is a finite subset of D1 xD2 x..xDn → We need to observe a fixed order in the tuples! Named perspective: Relational schema R with the attributes A1 , A2 , .., An → no ordering on attributes Foreign key: primary key of another table 2 Relational Algebra • Set-oriented: processing sets of tuples • Associative: tuples are identified via their properties • High-level: working at logical/conceptual levels Operators • Select σ to select rows ∏ • Project Duplicate rows are removed from the result • Union ∪ Duplicate rows are removed from the result For r ∪ s to be valid: → r,s have the same set of attributes → the attribute domains must be compatible • Set difference Same conditions as above • Cartesian product x The concatenation of tuples to produce a single tuple If attributes of r(R) and s(S) are not disjoint, renaming must be used • Rename ρx (E) returns the expression E under the name X Week 2a Cardinality and participation Most useful in describing binary relationship sets • One to one • one to many • Many to one 3 • Many to many Participation • Total: every entity participates in at least one relationship→ Indicated by double line • Partial: some entities may not participate in any relationship Ternary relationships → to avoid redundancy, we allow at most one arrow Although several candidate keys may exist, one of the keys is selected to be the primary key Design issues • Entity sets vs. attributes • Entity sets vs. relationship sets • Placement of relationship attributes • Binary vs. non-binary relationships Extended E-R features Weak Entity sets: The existence of a weak entity set depends on the existence of an identifying entity set → Has no primary key → If you can make a weak entity, you have to make it Specialization: Top-down design process Generalization: Bottum-up design process Overlapping: 2 arrows, non-overlapping: 1 arrow 4 Aggregation: Treats a relationship as an entity → allows relationships between relationships Week 2b Advanced Relational Algebra Additional operations: do not add power to queries • Assignment ← assign part of the query to a variable • Set intersection ∩ r ∩ s = r - (r - s) Same conditions as union operator • Natural join ▷◁ Join on columns which are the same in r and s and project all attributes of r and s • Division ÷ Suited for queries that include the phrase ”for all” Project A,B,C where D=a ∧ E=1 or D=b ∧ E=1 5 Week 3a Translation E-R model to relational model • Strong entity set • Weak entity set • Many to many relationship set • Many to one and one to many relationship sets • Partial on the ’many’ side 6 • one to one relationship set • aggregation ’ Representing specialization via schemas Dilemma: whether to include all attributes of the person table in lower tables (e.g. employee and student) If include: +no natural joins, -anomalies If not include: +avoid anomalies, -use natural joins → computationally expensive Week 3b SQL →Declarative, set-oriented, associative, high-level language →Query may contain duplicate tuples →The result of an SQL query is a relation →SQL names are case insensitive (upper- and lowercase) • Select → allows duplicates → distinct → all → * asterisk → any arithmetic expressions (+,-,*, /) 7 • Where → and, or, not, between and parentheses • From → to select the desired table(s) • Rename → old-name as new-name (as may be omitted) String operations • percent %: matches any substring • underscore _: matches any character • concatenation • converting from upper to lower case • finding string length, substrings etc. Set operations • Union • Intersect • Except Default is to eliminate duplicates; to retain duplicates use union all, intersect all and except all Subquery: a select-from-where expression that is nested within another query unique: tests whether a subquery has any duplicate tuples in its result (evaluates to true on an empty set) Week 4a Functional Dependencies First normal form: if the domains of all attributes are atomic → we assume all relations are in first normal form Good logical design accurately reflects the conceptual design and disallows redundancy as best as possible In the case that a relation R is not in ”good”form, decompose it into a set of relations such that: • Each relation is in good form • The decomposition is a lossless-join decomposition • Preferably, the decomposition should be dependency preserving Functional dependency: a generalization of the notion of a key → makes sure that a certain set of attributes uniquely determines the value for another set of attributes a → b holds if two tuples agree on α, they have to also agree on β t1 [α] = t2 [α] → t1 [β] = t2 [β] Note: t1 and t2 can point to the same tuple K is a superkey for schema R if K → R K is a candidate key if K → R and for no α ⊂ K, α → R We use functional dependencies to: 8 • Test relations to see if they are legal under a given set of FD’s • Specify constraints on the set of legal relations Lossy decomposition: if you split on an attribute that is not a super key Closure of F (F+): the set of all FD’s that are logically implied by F We can find all F+ by applying Armstrong’s Rules: • a1 reflexivity if β ⊆ α, then α → β • a2 augmentation if α → β, then γα → γβ • a3 transitivity if α → β and β → γ, then α → γ • a4 union if α → beta holds and α → γ holds, then α → βγ holds • a5 decomposition if α → βγ holds, then α → β and α → γ holds. • a6 pseudotransitivity if α → β and γβ → δ holds, then αγ → δ holds These rules are: • Sound: they generate only FD’s that actually hold • Complete: they generate all FD’s that hold • Non redundant: if we take out one of the rules, we cannot generate all FD’s that hold anymore Learn soundness proofs for the Armstrong’s Rules Week 4b Advanced SQL Aggregation • avg: average value • min: minimum value • max: maximum value • sum: sum of values • count: number of values • group by • having • order by asc (default) or desc. limit can be used to output the first .. results • is null: to check for null values Attributes that appear in the group by list must appear in the select clause Three Valued Logic: True, False, Unknown Results of where clause predicate is treated as false if its evaluates to unknown SQL assumes all values are known. All aggregate functions (except count(*) ignore null values) 9 Week 5a Canonical cover Canonical cover: the minimal set of FD’s equivalent to F, without any redundancies Week 5b Derived Relations: to use a subquery in the from clause With clause: a way to defining a temporary view View: provides a mechanism to hide certain data from the view of certain users 10 Updates of a view: insert into view values (’new value’, ’new value’) Most SQL implementations allow updates only on simple views (without aggregates) defined on a single relation Trigger: a statement that the DBMS executes automatically as a side effect of a modification of the database ECA rules: Event-condition-action rules Week 6a BCNF & 3NF A relational schema R is in BCNF if for each FD, at least one of the following holds: • α → β is trivial • α is a super key for R Relational schema’s in BNCF are not dependency preserving → constraints across relations require joins in order to verify them Dependency preservation: when there are no FD’s in F+ which need to be checked using natural joins in order to verify them BCNF Decomposition Algorithm 11 A relational schema R is in 3NF if for each FD, at least one of the following holds: • α → β is trivial • α is a super key for R • Each attribute B in β − α is contained in a candidate key for R 3NF Decomposition Algorithm Summary BCNF & 3NF Normal Form BCNF 3NF No Redundancy (By definition) Yes No Dependency Preserving (By decomposition algorithm) No Yes Losslessness (” ”) Yes Yes Supporting mechanism F+ Fc Goal for relational database design: BCNF and Lossless and Dependency Preservation. If we cannot achieve this, we accept either a lack of Dependency Preservation or Redundancy due to use of 3NF 12 Week 6b Datalog Declarative, introduces recursion and is widely influential Safety issues: generate an infinite number of answers • Use the names of the variables in the same positions of the attributes (positional notation) • all variables in the negated predicate have to also appear in the non-negated predicate • Every variable that appears in the head should also appear in the non-arithmetic positive literal in the body Ground instantiation: the result of replacing each variable in the rule by some constant Relational Operations: Recursion Suppose we are given a relation manager(X, Y ) with X the direct employee of Y. Find all direct and indirect employees of Jones: empl(X, Y ) : −manager(X, Y ) empl(X, Y ) : −manger(X, Z), empl(Z, Y ) ?empl(X, ”Jones”) Transitive closure queries: T C(X, Y ) : −edge(X, Y ) T C(X, Y ) : −edge(X, Z), T C(Z, Y ) ?T C(X, Y ) Recursion in SQL: 13 Recursion is monotonic. Recursion with negation may be non-monotonic Extensional DB views: stored in database and appear only in rule bodies Intensional DB views: appear in the head of rules and also may be in rule bodies. We only permit stratified Datalog Programs Week 7a Expressive Power of Query Languages The query languages RA, basic SQL and Non-recusive Datalog are equivalent in expressive power to first-order logic Proof: translation from RA, to basic SQL to basic Datalog. Inexpressible queries: Middle element of a chain, transitive closure, cycles, find the middle house Locality Rank: How far you can look Gaifman Local: if there exists a locality rank r such that the query cannot distinguish between r-equivalent neighbourhoods SQL w.o. recursion, with aggregation and arithmetic, Datalog without recursion and Relational Algebra are Gaifman Local Week 7b Probable exam questions Relational Algebra and ER diagram: • Division assumption: A and B have to be independent from each other. Dependence: list all customer who made a purchase in all stores in the cities the customers live in (depends on the customer table) • Translate English query to ER diagram and then to a relational schema Functional dependencies: • Proof soundness, completeness and non-redundancy of the Armstrong’s rules • Find candidate keys given a list of FD’s • If decomposition is into two values, it is automatically in BCNF (Note; fist look at your violating dependencies in F, not F+ and use this dependencies to decompose) • schema BCNF and 3NF • Compute closure of F • Compute Canonical cover of F • Check for losslessness • Check for Dependency Preservation • 3NF decomposition and BCNF decomposition algorithm • 3NF decomposition: candidate key is not in one of the relations so you need to add an extra relational schema which includes a candidate key 14 Queries: • You get some queries in English and you can choose each language once. • Translate queries to English • Translate queries to a given language • Datalog: non-linear recursion Gaifman Locality: Strategy: If yes, answer it by formulating an FO logic query. If no, prove that Q is not GL and answer the query with recursive SQL/Datalog 15