Uploaded by Engineering Design

JBI050 Pattern summary

advertisement
Data Science Summary
2018/2019
Disclaimer
This summary is provided to you by fellow students. The education committee simply cannot
check every summary for it’s quality, therefore, be aware that this summary cannot guarantee
your success in the exam.
Kind regards,
Education Committee 2018/2019
MADE BY Jolijn Martens
JBI050 Data Management
Week 1a
Why do we study information systems?
Information needs to be stored, used and manipulated → Administrative information systems
(retail, schedules, banking etc.)
Problems when not using database systems:
• Data redundancy and inconsistency
• Difficulty in accessing data
• Data isolation
• Integrity problems → Hard to add new constraints and manage existing constraints (e.g.
account balances must be greater than zero)
Data manipulation: entering, updating, deleting and retrieving information from a database
Data Manipulation Languages (DML):
• Procedural: user specifies what data is required and how to get those data
• Declarative: user specifies what data is required without specifying how to get those data
Database design: How do we design a database and implement it
Object system: the ”real world”
Information system: a representation (= approximation) of the real world in a computer system
Modelling of information systems → key idea: data independence
→ levels of abstraction
• physical level: how a record is stored
• logical level: how data is stored in database and the relationships among the data
• view level: application programs to hide details of data types (and can hide information)
Schema: logical structure of a database
Instance: the actual content
Physical data independence: The ability to modify the physical schema without changing the
logical system
Database model: collection of tools for describing data, relationships, semantics and constraints (e.g. ER model)
Entity-Relational model
→An attribute can also be a property of a relationship set (e.g. date)
→Most relationships are binary
1
Attribute types:
• Simple and composite (e.g first name - last name) attributes
• Single-valued and multi-valued (phone numbers) attributes
• derived attributes (age (given date of birth))
Keys:
• super key: a subset of one or more of its attributes whose values uniquely determine each
entity (e.g. ID, Name)
• candidate key: minimal super key (e.g. ID)
Although several candidate keys may exist, one of the candidate keys is selected to be the primary key
Database design
• Conceptual with E-R models
• Logical in relational model
Goodness of design:
• Conceptual: accurately reflects the semantics of use in the modelled domain
• Logical: accurately reflects the conceptual models and disallows redundancies and updates anomalies
• physical: accurately reflects logical model and efficiently and reliably supports use of data
Week 1b
Relational Database Model
→ Database consists of a set of relations (tables)
→ Each table has name and a set of attributes
→ Each attribute has a name and a domain
→ A relation instance is a set of tuples (rows)
Unnamed perspective: a relation instance is a finite subset of D1 xD2 x..xDn
→ We need to observe a fixed order in the tuples!
Named perspective: Relational schema R with the attributes A1 , A2 , .., An
→ no ordering on attributes
Foreign key: primary key of another table
2
Relational Algebra
• Set-oriented: processing sets of tuples
• Associative: tuples are identified via their properties
• High-level: working at logical/conceptual levels
Operators
• Select σ
to select rows
∏
• Project
Duplicate rows are removed from the result
• Union ∪
Duplicate rows are removed from the result
For r ∪ s to be valid:
→ r,s have the same set of attributes
→ the attribute domains must be compatible
• Set difference Same conditions as above
• Cartesian product x
The concatenation of tuples to produce a single tuple
If attributes of r(R) and s(S) are not disjoint, renaming must be used
• Rename ρx (E) returns the expression E under the name X
Week 2a
Cardinality and participation
Most useful in describing binary relationship sets
• One to one
• one to many
• Many to one
3
• Many to many
Participation
• Total: every entity participates in at least one relationship→ Indicated by double line
• Partial: some entities may not participate in any relationship
Ternary relationships → to avoid redundancy, we allow at most one arrow
Although several candidate keys may exist, one of the keys is selected to be the primary
key
Design issues
• Entity sets vs. attributes
• Entity sets vs. relationship sets
• Placement of relationship attributes
• Binary vs. non-binary relationships
Extended E-R features
Weak Entity sets: The existence of a weak entity set depends on the existence of an identifying
entity set → Has no primary key
→ If you can make a weak entity, you have to make it
Specialization: Top-down design process
Generalization: Bottum-up design process
Overlapping: 2 arrows, non-overlapping: 1 arrow
4
Aggregation: Treats a relationship as an entity → allows relationships between relationships
Week 2b
Advanced Relational Algebra
Additional operations: do not add power to queries
• Assignment ←
assign part of the query to a variable
• Set intersection ∩
r ∩ s = r - (r - s)
Same conditions as union operator
• Natural join ▷◁
Join on columns which are the same in r and s and project all attributes of r and s
• Division ÷
Suited for queries that include the phrase ”for all”
Project A,B,C where D=a ∧ E=1 or D=b ∧ E=1
5
Week 3a
Translation E-R model to relational model
• Strong entity set
• Weak entity set
• Many to many relationship set
• Many to one and one to many relationship sets
• Partial on the ’many’ side
6
• one to one relationship set
• aggregation
’
Representing specialization via schemas
Dilemma: whether to include all attributes of the person table in lower tables (e.g. employee
and student)
If include: +no natural joins, -anomalies
If not include: +avoid anomalies, -use natural joins → computationally expensive
Week 3b
SQL
→Declarative, set-oriented, associative, high-level language
→Query may contain duplicate tuples
→The result of an SQL query is a relation
→SQL names are case insensitive (upper- and lowercase)
• Select
→ allows duplicates
→ distinct
→ all
→ * asterisk
→ any arithmetic expressions (+,-,*, /)
7
• Where
→ and, or, not, between and parentheses
• From
→ to select the desired table(s)
• Rename
→ old-name as new-name (as may be omitted)
String operations
• percent %: matches any substring
• underscore _: matches any character
• concatenation
• converting from upper to lower case
• finding string length, substrings etc.
Set operations
• Union
• Intersect
• Except
Default is to eliminate duplicates; to retain duplicates use union all, intersect all and except all
Subquery: a select-from-where expression that is nested within another query
unique: tests whether a subquery has any duplicate tuples in its result (evaluates to true on an
empty set)
Week 4a
Functional Dependencies
First normal form: if the domains of all attributes are atomic → we assume all relations are in
first normal form
Good logical design accurately reflects the conceptual design and disallows redundancy as best
as possible
In the case that a relation R is not in ”good”form, decompose it into a set of relations such that:
• Each relation is in good form
• The decomposition is a lossless-join decomposition
• Preferably, the decomposition should be dependency preserving
Functional dependency: a generalization of the notion of a key → makes sure that a certain set
of attributes uniquely determines the value for another set of attributes
a → b holds if two tuples agree on α, they have to also agree on β
t1 [α] = t2 [α] → t1 [β] = t2 [β]
Note: t1 and t2 can point to the same tuple
K is a superkey for schema R if K → R
K is a candidate key if K → R and for no α ⊂ K, α → R
We use functional dependencies to:
8
• Test relations to see if they are legal under a given set of FD’s
• Specify constraints on the set of legal relations
Lossy decomposition: if you split on an attribute that is not a super key
Closure of F (F+): the set of all FD’s that are logically implied by F
We can find all F+ by applying Armstrong’s Rules:
• a1 reflexivity if β ⊆ α, then α → β
• a2 augmentation if α → β, then γα → γβ
• a3 transitivity if α → β and β → γ, then α → γ
• a4 union if α → beta holds and α → γ holds, then α → βγ holds
• a5 decomposition if α → βγ holds, then α → β and α → γ holds.
• a6 pseudotransitivity if α → β and γβ → δ holds, then αγ → δ holds
These rules are:
• Sound: they generate only FD’s that actually hold
• Complete: they generate all FD’s that hold
• Non redundant: if we take out one of the rules, we cannot generate all FD’s that hold
anymore
Learn soundness proofs for the Armstrong’s Rules
Week 4b
Advanced SQL
Aggregation
• avg: average value
• min: minimum value
• max: maximum value
• sum: sum of values
• count: number of values
• group by
• having
• order by asc (default) or desc. limit can be used to output the first .. results
• is null: to check for null values
Attributes that appear in the group by list must appear in the select clause
Three Valued Logic: True, False, Unknown
Results of where clause predicate is treated as false if its evaluates to unknown
SQL assumes all values are known. All aggregate functions (except count(*) ignore null values)
9
Week 5a
Canonical cover
Canonical cover: the minimal set of FD’s equivalent to F, without any redundancies
Week 5b
Derived Relations: to use a subquery in the from clause
With clause: a way to defining a temporary view
View: provides a mechanism to hide certain data from the view of certain users
10
Updates of a view: insert into view values (’new value’, ’new value’)
Most SQL implementations allow updates only on simple views (without aggregates) defined
on a single relation
Trigger: a statement that the DBMS executes automatically as a side effect of a modification
of the database
ECA rules: Event-condition-action rules
Week 6a
BCNF & 3NF
A relational schema R is in BCNF if for each FD, at least one of the following holds:
• α → β is trivial
• α is a super key for R
Relational schema’s in BNCF are not dependency preserving → constraints across relations require joins in order to verify them
Dependency preservation: when there are no FD’s in F+ which need to be checked using natural
joins in order to verify them
BCNF Decomposition Algorithm
11
A relational schema R is in 3NF if for each FD, at least one of the following holds:
• α → β is trivial
• α is a super key for R
• Each attribute B in β − α is contained in a candidate key for R
3NF Decomposition Algorithm
Summary BCNF & 3NF
Normal Form
BCNF
3NF
No Redundancy
(By definition)
Yes
No
Dependency Preserving
(By decomposition algorithm)
No
Yes
Losslessness
(” ”)
Yes
Yes
Supporting mechanism
F+
Fc
Goal for relational database design: BCNF and Lossless and Dependency Preservation. If we
cannot achieve this, we accept either a lack of Dependency Preservation or Redundancy due to
use of 3NF
12
Week 6b
Datalog
Declarative, introduces recursion and is widely influential
Safety issues: generate an infinite number of answers
• Use the names of the variables in the same positions of the attributes (positional notation)
• all variables in the negated predicate have to also appear in the non-negated predicate
• Every variable that appears in the head should also appear in the non-arithmetic positive
literal in the body
Ground instantiation: the result of replacing each variable in the rule by some constant
Relational Operations:
Recursion
Suppose we are given a relation manager(X, Y ) with X the direct employee of Y.
Find all direct and indirect employees of Jones:
empl(X, Y ) : −manager(X, Y )
empl(X, Y ) : −manger(X, Z), empl(Z, Y )
?empl(X, ”Jones”)
Transitive closure queries:
T C(X, Y ) : −edge(X, Y )
T C(X, Y ) : −edge(X, Z), T C(Z, Y )
?T C(X, Y )
Recursion in SQL:
13
Recursion is monotonic. Recursion with negation may be non-monotonic
Extensional DB views: stored in database and appear only in rule bodies
Intensional DB views: appear in the head of rules and also may be in rule bodies.
We only permit stratified Datalog Programs
Week 7a
Expressive Power of Query Languages
The query languages RA, basic SQL and Non-recusive Datalog are equivalent in expressive power
to first-order logic
Proof: translation from RA, to basic SQL to basic Datalog.
Inexpressible queries: Middle element of a chain, transitive closure, cycles, find the middle
house
Locality Rank: How far you can look
Gaifman Local: if there exists a locality rank r such that the query cannot distinguish between
r-equivalent neighbourhoods
SQL w.o. recursion, with aggregation and arithmetic, Datalog without recursion and Relational
Algebra are Gaifman Local
Week 7b
Probable exam questions
Relational Algebra and ER diagram:
• Division assumption: A and B have to be independent from each other. Dependence: list
all customer who made a purchase in all stores in the cities the customers live in (depends
on the customer table)
• Translate English query to ER diagram and then to a relational schema
Functional dependencies:
• Proof soundness, completeness and non-redundancy of the Armstrong’s rules
• Find candidate keys given a list of FD’s
• If decomposition is into two values, it is automatically in BCNF (Note; fist look at your
violating dependencies in F, not F+ and use this dependencies to decompose)
• schema BCNF and 3NF
• Compute closure of F
• Compute Canonical cover of F
• Check for losslessness
• Check for Dependency Preservation
• 3NF decomposition and BCNF decomposition algorithm
• 3NF decomposition: candidate key is not in one of the relations so you need to add an
extra relational schema which includes a candidate key
14
Queries:
• You get some queries in English and you can choose each language once.
• Translate queries to English
• Translate queries to a given language
• Datalog: non-linear recursion
Gaifman Locality: Strategy: If yes, answer it by formulating an FO logic query.
If no, prove that Q is not GL and answer the query with recursive SQL/Datalog
15
Download