Uploaded by lauraelena.rusu

BIG DATA MANAGEMENT

advertisement
1 of 13
Chapter 1: information systems and databases
Database
Logically organised collection of electronically stored data that can be directly searched and viewed.
They involve administration:
• technical: computers, database system, storage.
• administrative: setup, maintenance.
and data:
• gathering, maintaining and utilising.
• nontechnical operators: warehouse employees, bank tellers and top managers.
Trade off between data redundancies and consistency conditions.
Information systems and databases VS. Material goods
Handling:
• Representation
• Processing
• Combination
• Age
• Original
• Vagueness
• Medium
Economic and legal evaluation:
• Loss/gain in value with usage.
• Production costs.
• Property rights and ownership of goods.
Data management as a task
It’s necessary to see data management as a task for the executive level.
Once data is viewed as a factor in production it has to be planned governed monitored and controlled.
Information system
Enables users to store and connect information interactively, to ask questions and to obtain answers. Any information system of a
certain size uses database technologies to avoid the necessity to redevelop data management and analysis every time it is used.
DATABASE MANAGEMENT SYSTEMS
Software for application-independently describing, storing and querying data. They contain:
• Storage component: data + metadata
• Management component: query and data manipulation language.
RELATIONAL MODEL
- SQL structured query language databases
- Proprietary data
- Stability
- Rigidity
NON-RELATIONAL MODEL
- NoSQL databases
- Real-time Web-based services
- Heterogeneous datasets
- Flexibility
Relational model
Relational models collect and present data in a table, a set of tuples presented in tabular form and with the following requirements:
• Table name: A table has a unique table name.
• Attribute name: All attribute names are unique within a table and label one specific column with the required property. The column
headers are attribute names and attributes assign a specific data value as a property to each entry in the table. Natural/artificial .
• No column order: The number of attributes is not set, and the order of the columns within the table does not matter.
• No row order: The number of tuples is not set, and the order of the rows within the table does not matter. A row is a tuple or
record that contains a manifestation (instance) of the table.
• Identification key or Primary key: an attribute or combination that is unique, minimal and private. Application-independent and
without semantics.
2 of 13
Descriptive vs. Programming language
Descriptive language
Users get the desired results by merely setting the requested
properties in the SELECT statement.
The database management system
• takes computing tasks,
• processes the query (or manipulation) with its own search and
access methods, and
• generates the results table.
SQL requires only the specification of the desired selection
conditions in the WHERE clause.
Programming language
Programmers must provide the procedure for computing the
required records for the users.
The methods for retrieving the requested information must be
implemented by the programmer.
In this case, each query yields only one record, not a set of
tuples.
Procedural languages require the user to specify an algorithm
for finding the individual records.
Relational database management system RDBMS
Have the following properties:
• Model: all data and data relations are represented in tables. Dependencies between attribute values of tuples or multiple instances
of data can be discovered (normal forms).
• Schema: The definitions of tables and attributes are stored in the relational database schema. The schema further contains the
definition of the identification keys and rules for integrity assurance.
• Language: The database system includes SQL for data definition, selection, and manipulation. The language component is
descriptive and facilitates analyses and programming tasks for users.
• Data security and data protection: The database management system provides mechanisms to protect data from destruction,
loss, or unauthorized access.
• Architecture: The system ensures extensive data independence, i.e., data and applications are mostly segregated. This
independence is reached by separating the actual storage component from the user side using the management component. Ideally,
physical changes to relational databases are possible without the need to adjust related applications.
• Multi-user operation: The system supports multi-user operation, i.e., several users can query or manipulate the same database at
the same time. The RDBMS ensures that parallel transactions in one database do not interfere with each other or, worse, with the
correctness of data.
• Consistency assurance: The database management system provides tools for ensuring data integrity, i.e., the correct and
uncompromised storage of data.
NoSQL meet these criteria only partially. —> use Relational database technology augmented with NoSQL technology.
Big data
Large volumes of data, usually unstructured (text, graphics, images, audio, video).
Big data is high-volume, high-velocity and high- variety information assets that demand cost-effective (asset) and innovative forms
of information processing for enhanced insight (more understandable), and decision making.
The main characteristics that identify the three v’s:
• Volume (extensive amounts of data): from megabyte to zettabytes
• Variety (multiple formats, structured, semi- structured, and unstructured data)
• Velocity (high-speed and real-time processing) +
• Value: meant to increase enterprise value.
• Veracity: inaccurate or vague. Use specific algorithms to evaluate validity and quality.
NoSQL database systems
Web-based storage systems are considered NoSQL DB systems if they meet the following requirements:
• Model: The underlying database model is not relational.
• At least three Vs: The database system includes a large amount of data (volume), flexible data structures (variety), and real-time
processing (velocity).
• Schema: The database management system is not bound by a fixed database schema. When the schema is free, the structures of
individual records (or documents) can vary.
• Architecture: The database architecture supports massively distributed web applications and horizontal scaling.
• Replication: The database management system supports data replication.
• Consistency assurance: Consistency may be ensured with a delay to prioritize high availability and partition tolerance.
3 of 13
Data distribution
A distributed database can be configured to store the same data in multiple nodes across locations network. If a single node fails,
the data is still available. You don’t have to wait for the database to be restored. A geo-distributed database maintains concurrent
nodes across geographical regions for resilience in case of a regional power or communications outage. The ability to store a single
database across multiple computers requires an algorithm for replicating data that is transparent to the users.
DISTRIBUTED DATABASES: A logical interrelated collection of shared data (and metadata), physically distributed over a computer
network
DISTRIBUTED DBMS: Software system that permits the management of the distributed database and makes the distribution
transparent to users.
Consistency
Data consistency is the process of keeping information uniform as it moves across a network and between various applications on a
computer. Data consistency helps ensure that information on a crashing computer can be restored to its pre-crash state.
Strong consistency means that the database management system ensures full consistency at all times.
Systems with weak consistency tolerate that changes will be copied to replicated nodes with a delay, resulting in temporary
inconsistencies.
NoSQL databases
NoSQL databases are geared toward managing large sets of varied and frequently updated data, often in distributed systems or the
cloud. They avoid the rigid schemas associated with relational databases. But the architectures themselves vary and are separated
into four primary classifications, although types are blending over time.
KEY VALUE DATABASES
Simple data model that pairs key attribute and its associated value in storing data elements.
WIDE-COLUMN DATABASES
Table-style databases. They store data across tables that can have a very large number of columns. Ex. Internet search.
DOCUMENT DATABASES
Use Json format. Document-like structures.
GRAPH DATABASES
Store related nodes in graphs to accelerate querying.
Property graph model: nodes, directed edges, labels for nodes and edges and properties.
Chapter 2: Data modelling
Data models provide a structure and a formal description of the data and its relationships, without considering the kind of database
management system to be used for entering, storing and maintaining data.
Database setup
1. DATA ANALYSIS
To find the data required for the information system and their relationships to each other.
It contains at least a verbal task description with objectives and a list of relevant pieces of information.
The written description of data connections can be complemented by graphical illustrations or a summarising example.
2. DESIGNING A CONCEPTUAL DATA MODEL (E/R MODEL)
Identification of entity and relationship sets (for ex. Columns)—> not simple.
This design step requires experience and practice from the data architect.
E/R model contains both the entity sets (rectangles) and relationship sets (rhombi).
3. CONVERTING IT INTO A RELATIONAL OR NON-RELATIONAL DATABASE SCHEMA
Is the formal description of the database objects using either tables (relational db —> entity and relationship sets must be
expressed in tables, 1 table/relationship set) or nodes and edges (graph-oriented schema —> entity sets = nodes, relationship sets =
edges).
4 of 13
Entity-relationship model (E/R)
ENTITY
An entity is a specific object in the real world or our imagination that is distinct from all others (characterised by its attributes).
Entities of the same type are combined into entity sets and are further characterised by attributes. The attributes are properties
characterising an entity (set). In each entity set there is an identification key.
RELATIONSHIP
The relationships between entities are of interest and can form sets of their own. Like entity sets, relationship sets can be
characterised by attributes. A copy of another table entity key is named foreign key. Relationships can be understood as
associations in two directions.
ASSOCIATION
Each direction of a relationship is an association. We distinguish 4 types of association:
• Type 1: exactly one —>[1:1]: each entity from the entity set ES1 is assigned exactly one entity from the entity set ES2. Each
employee is subordinate to exactly one department.
• Type c (or 0): none or one —> [0:1]: each entity from the entity set ES1 can be assigned at most one entity from set ES2. Not
every employee can have the role of a department head.
• Type m (or m1): one or multiple —> [1:m]: 1 entity from ES1 is assigned 1 or more entities from ES2.
• Type mc (or m0): none, one, or multiple —> [0:m]
Mapping from E/R model to Relational database schema
The relational model is based on a set of formal rules collected in normal forms that are used to discover and study dependencies
within tables in order to avoid redundant information and resulting anomalies.
Achieved by dividing database in tables and defining relationships.
ATTRIBUTE REDUNDANCY
An attribute in a table is redundant if individual values of this attribute can be omitted without a loss of information.
For every employee of department D6 the department name Accounting is repeated
• This repetition occurs for ALL employees of ALL departments
• We can say that the attribute DepartmentName is redundant, since the same value is listed in the table multiple times
• It would be preferable to store the name going with each department number in a separate table for future reference instead of
redundantly carrying it along for each employee.
DATABASE ANOMALIES BY REDUNDANCIES
The normalization is a process of efficiently organizing data in database.
Taking all data in a single table we have:
1. Insertion anomaly. There is no way of adding a new department without employees in it. No new table rows can be inserted
without a unique employee number.
2. Deletion anomaly. If we delete all employees from the table, we also lose the department numbers and names.
3. Update anomalies. Changing the name of department D3 from IT to Data Processing, each employee of that department would
have to be edited individually, meaning that although only one detail is changed, the entire table must be adjusted in multiple places.
AVOID REDUNDANCIES AND ANOMALIES
INTRODUCTORY CONCEPTS
• Determinant: is an attribute that determines the values assigned to other attributes in the same row. By this definition, any
primary key or candidate key is a determinant. However, there may be determinants that aren't primary or candidate keys.
• Key:
- Keys: attributes that are used to access tuples from a table.
- Super key: attribute or set of attributes which uniquely identify the tuples in table
- Candidate key: is a minimal super key.
- Concatenated (candidate) key: candidate key composed of more than one attribute. Also named composite candidate key.
- Overlapping concatenated keys: composite keys with at least one attribute in common.
Primary key: is the (chosen) candidate key (usually) with the smallest no. of attributes. There is (the definition of) one and only
one primary key in any relationship (table). Any attribute of primary key can not contain NULL value. A primary key is a
candidate key.
Foreign key: an attribute that can only take the values which are present as values of some other attribute.
5 of 13
• Functionally dependency: B is functionally dependent on A if for each value of A there is exactly one value of B. A functional
dependency of B on A, therefore, requires that each value of A uniquely identifies one value of B. A is a determinant of B. For an
identification (or primary) key K and an attribute B in one table, there is a functional dependency K -> B (K uniquely identifies B).
• Fully functionally dependency: B is fully functionally dependent on a concatenated key consisting of (K1, K2) if B is functionally
dependent on the entire key, but not its parts. (Only K1 or K2 cannot uniquely identify B).
• Transitive dependency: C is transitively dependent on A if B is functionally dependent on A and C is functionally dependent on B.
• Multivalued dependency: C is multivaluedly dependent on A if any combination of a specific value of A with an arbitrary value of B
results in an identical set of values of C.
• Lossless join dependency: a table R with attributes A, B, C has join dependency if the projected subtables R1(A,B) and R2(B,C)
when joined via the shared attribute B result in the original table R. Splitting a table can be done with a project operator and table
reconstruction can be done with a join operator.
NORMAL FORMS
Understanding the normal forms helps to make sense of the mapping rules from an entity-relationship model to a relational model.
Mapping onto a relational database schema with a properly defined entity-relationship model and a consistent application of the
relevant mapping rules, means that the normal forms will always be met.
Creating an entity-relationship model and using mapping rules we can mostly skip checking the normal forms for each individual
design step. Not all normal forms are equally relevant. Usually only the first three normal forms are used.
• 1NF: All attribute values are atomic. Basis for all the other normal forms. Using the first normal form leaves us with a table full
of redundancies.
• 2NF: Non-key attribute are fully dependent on the key.
If a table with a concatenated key is not in 2NF, it has to be split into subtables:
1. the attributes that are dependent on a part of the key are transferred to a separate table along with that key part,
2. the concatenated key and potential other relationship attributes remain in the original table.
• 3NF: No transitive dependencies. The transitive dependency can be removed by splitting off the redundant attribute and putting
it in a separate table with “its key” attribute (D#). The dependent attribute (D#) also stays in the remaining table as a foreign key
• BCNF: (Boyce-Codd) Only dependencies on key are permitted. Used when there are multiple overlapping candidate keys in one
table. Such tables, even when they are in 3NF, may conflict with BCNF. In this case, the table has to be split due to the candidate
keys.
• 4NF: No multivalued dependencies. One attribute is the determinant of another one which can have many values. Tables
containing only multivalued dependencies attributes don’t have anomalies. Tables containing multivalued dependencies together with
other attributes can cause update anomalies.
• 5NF: Only trivial join dependency. A table is in the 5NF if it can be arbitrarily split by project operators and then reconstructed
into the original table with join operators (lossless join dependency). —> use foreign key
A table or an entire database schema in a normal form must meet also all requirements of previous normal forms
MAPPING FROM E/R MODEL TO RELATIONAL DATABASE SCHEMA
We have seen how to normalise a relational database. Now we see how entity and relationship sets can be represented in tables.
Relational database schema: contains definitions of the tables, attributes and primary keys.
Integrity constraints: set limits for the domains and the dependencies between tables.
There are two rules of major importance in mapping an entity-relationship model onto a relational database schema:
• Rule 1 (entity sets): each entity set has to be defined as a separate table with a unique primary key. The primary key can be
either the key of the respective entity set or one selected candidate key. The remaining attributes of the entity set are
converted into corresponding attributes within the table. By definition, a table requires a unique primary key. It is possible that
there are multiple candidate keys in a table, all of which meet the requirement of uniqueness and minimality. In such cases, it is up
to the data architects which candidate key they would like to use as the primary key.
• Rule 2 (relationship sets): each relationship set can be defined as a separate table. The identification keys of the corresponding
entity sets must be included in this table as foreign keys. The primary key of the relationship set table can be a concatenated key
made from the foreign keys or another candidate key, e.g., an artificial key. Other attributes of the relationship set are listed in
the table as further attributes.
Foreign key for a relationship: Foreign key is an attribute within a table that is used as an identification key in at least one other
table (possibly also within this one). Identification keys (or primary keys) can be reused in other tables to create the desired
relationships between tables. In other words: a relationship set between two tables is translated in a relational database by a
foreign key in a table “linked” to the primary key of the other table
Other rules for relationship sets: the use of rules R1 and R2 alone does not necessarily result in an ideal relational database
schema as this approach may lead to a high number of individual tables.
Mapping rules for relationship sets are based on the cardinality of relationships to avoid an unnecessarily large number of tables
and expressly limits which relationship sets always and in any case require separate tables.
6 of 13
• Rule R3 (network-like relationship sets): every complex-complex (many-to-may or m-m) relationship set must be defined as a
•
•
separate table which contains at least the identification keys of the associated entity sets as foreign keys. The primary key of a
relationship set table is either a concatenated key from the foreign keys or another candidate key. Any further characteristics
of the relationship set become attributes in the table.
Rule R4 (hierarchical relationship sets): unique-complex (one-to-many or 1-m) relationship sets can be represented (without a
separate relationship set table) by the tables of the associated entity sets (directly) —> 2 associations = 1 relationship. The
unique association (type 1 or c) allows for the primary key of the referenced table to simply be included in the referencing table
as a foreign key with an appropriate role name.
Rule R5 (unique-unique relationship sets): unique-unique relationship sets can be represented without a separate table by the
tables of the two associated entity sets (directly). Again, an identification key from the referenced table can be included in the
referencing table along with a role name (foreign key).
Mapping rules for relationship hierarchy: based on the relationship hierarchy to consider some common characteristics in a
superordinate table (generalisation) to take in account some peculiarities (specialisation).
• Rule R6 (relationship of generalisation): generalisation is the combination of entities into a superordinate entity set. Once the
entities are defined, some of them can be “generalised” in a “super-entity”: [entity subsets] ---generalisation---> [new entity set].
• Rule R7 (relationship of aggregation): Since the relational model does not directly support the relationship structure of a
generalisation, the characteristics of such a relationship hierarchy have to be modelled indirectly. Each entity set of a
generalisation hierarchy requires a separate table. The primary key of the superordinate table becomes the primary key of all
subordinate tables as well. In other words, the identification keys of the specialisation must always match those of the
superordinate table.
Aggregation is the combination of entities into a superordinate relationship set. Once the entities are defined, some of them can be
“aggregated” defining a new relationship set: [ entity sets ] ---aggregation---> [ new relationship set ]
We distinguish:
- network-like aggregation: CORPORATION EXAMPLE - Each company may have multiple superordinate and/or subordinate
companies.
- hierarchical aggregation: ITEMS (PRODUCTS) EXAMPLE - Each item (product) may consist of multiple sub-items (components).
Each sub-item is dependent on exactly one superordinate item.
If the cardinality of a relationship in an aggregation is:
• Complex-complex: separate tables must be defined for both the entity set and the relationship set. The relationship set table
contains the identification key of the associated entity set table twice with corresponding role names to form a concatenated key.
• Unique-unique: (hierarchical structure), the entity set and the relationship set can be combined in a single table.
Integrity
Integrity (or consistency) of data means that stored data does not contradict itself.
STRUCTURAL INTEGRITY CONSTRAINTS
Rules to ensure integrity that can be represented within the database schema itself. For relational databases, they include the
following:
• Uniqueness constraint: Each table has an identification key (attribute or combination of attributes) that uniquely identifies each
tuple within the table. A consistent EMPLOYEE table requires that the names of employees, streets, and cities really exist and are
correctly assigned. If there are multiple candidate keys within one table, one of them has to be declared the primary key to fulfil
the uniqueness constraint. The uniqueness of the primary keys themselves is checked by the DBMS.
• Domain constraint: The attributes in a table can only take on values from a predefined domain. Defining a domain is not enough
when it comes to verifying city or street names; for instance, a “CHARACTER (20)” limit does not have any bearing on meaningful
street or city names. Often the domain constraint comes in the form of enumerated types: Profession ≡ {Programmer, Analyst,
Organizer}; YearOfBirth ≡ {1916...2021}
• Referential integrity constraint: Each value of a foreign key must actually exist as a key value in the referenced table.
Insertion integrity: cannot insert ( E20, Mahoney, Market Ave S, Canton, D7 ) into EMPLOYEE because D7 doesn’t exist in
DEPARTMENT
Restricted delete: deletion of tuple ( D6, Accounting ) from DEPARTMENT is denied
Cascade delete: deleting ( D6, Accounting ) tuple from DEPARTMENT, two EMPLOYEE tuples (E19 and E4) would be removed
Unknown setting: deleting ( D6, Accounting ) tuple from DEPARTMENT, the foreign key of tuples E19 and E4 will be assigned
to NULL
7 of 13
Graph model
GRAPH THEORY
A graph is defined by the sets of its nodes (or vertices) and edges.
Properties of network structures:
• Undirected graph: G=(V,E) consists of a vertex set V and an edge set E, with each edge being assigned two potentially identical
vertices.
• Connected graph: A graph is connected if there are paths between any two vertices.
• Loop: is an edge that connects a vertex to itself. A simple graph contains no loops.
• Degree of a vertex: is the number of edges originating from it. A graph G is Eulerian, if it is connected and each node has an even
degree. “A path traversing each edge of a graph exactly once can only exist if each vertex has an even degree” (Euler, 1736). If all
its vertices are of an even degree, there is at least one Eulerian cycle.
Fleury’s algorithm: How to find an Eulerian cycle?
1) Choose any node as the starting vertex.
2) Choose any (nonmarked) incidental edge and mark it (e.g., with sequential numbers or letters).
3) Take the end node as the new starting vertex.
4) Repeat from step (2).
There is, of course, more than one possible solution, and the path does not necessarily have to be a cycle.
Weighted graph
Are graphs whose vertices or edges have properties assigned to them. The weight of a graph is the sum of all weights within the
graph, i.e., all node or edge weights. This definition also applies to partial graphs, trees, or paths as subsets of a weighted graph.
• The shortest path: Search for partial graphs with maximum or minimum weight. Find the shortest path: The smallest weight
between the stations v0 and v7, i.e., the shortest path from stop v0 to stop v7. Given an undirected graph G = (V,E) with positive
edge weights and an initial vertex vi, we look at the nodes vj neighboring this vertex and calculate the set Sk(v). We select the
neighboring vertex vj closest to vi and add it to the set Sk(v).
• Djistra’s algorithm: 1. Calculate the sum of the respective edge weights for each neighboring vertex of the current node. 2.
Select the neighboring vertex with the smallest sum. 3. If the sum of the edge weights for that node is smaller than the
distance value (dist) stored for it, set the current node as the previous vertex for it (prev) and enter the new distance in Sk.
Mapping from E/R model to Graph database schema
Graph databases are often founded on the model of directed weighted graphs. The objective is to convert entity and relationship
sets into nodes and edges of a graph. A graph database schema contains nodes and edges.
Compared to the relational model, the graph model allows for a broader variety of options for representing entity and relationship
sets: undirected graph, directed graph, relationship sets as edges, relationship sets as nodes, etc.
• Rule G1 (entity sets): Each entity set has to be defined as an individual vertex in the graph database. The attributes of each
entity set are made into properties of the respective vertex.
• Rule G2 (relationship sets): Each relationship set can be defined as an undirected edge within the graph database. The attributes
of each relationship set are assigned to the respective edge (attributed edges).
Relationship sets can also be represented as directed edges. The directed edge constellations are used to highlight one specific
association of a relationship using the direction of the corresponding edge.
• Rule G3 (network-like relationships): one employee can work on multiple projects (IS_INVOLVED) and each project must involve
multiple employees (INVOLVES). Alternatively, a double arrow could be drawn between the E and P vertices, with the name
INVOLVED and the attribute Percentage.
• Rule G4 (hierarchical relationships): Unique-complex relationship set can be defined as a directed edge between vertices in the
direction from the root node to the leaf node and with the multiple association type (m or mc) noted at the arrowhead. In the oneto-many relationship one directed edge is enough to represent the relationship set.
• Rule G5 (unique-unique relationships): Every unique-unique relationship set can be represented as a directed edge between the
respective vertices. The direction of the edge should be chosen so that the association type at the arrowhead is unique, if
possible. Each department must haveone dept. head and each employee can be one dept. head. It would also be possible to use the
reverse direction from employees to departments as an alternative, where the edge would be IS_DEPARTMENT_HEAD and the
association type ‘c’ would be noted at the arrowhead.
• Rule G6 (generalisation): The superordinate entity set of a generalization becomes a super node* the entity subsets become
normal vertices. The generalization hierarchy is then complemented by specialization (from the general to the particular) edges.
• Rule G7 (aggregation): For network-like or hierarchical aggregation structures, entity sets are represented by nodes, and
relationship sets are represented by edges with the association type mc noted at the arrowhead. Entity set attributes are
attached to the nodes; relationship set properties are attached to the edges.
8 of 13
Structural integrity constraints
Structural integrity constraints are secured by the database management system. For graph databases, they include the following:
• Uniqueness constraint: Each vertex and each edge can be uniquely identified within the graph. Path expressions can be used to
navigate to individual edges or nodes.
• Domain constraint: The attributes of both vertices and edges belong to the specified data types, i.e., they come from well-defined
domains.
• Connectivity: A graph is connected if there is a path between any two vertices within the graph. The graph database ensures
connectivity for graphs and partial graphs.
• Tree structures: Special graphs, such as trees, can be managed by the graph database. It ensures that the tree structure is kept
intact in case of changes to nodes or edges.
• Duality: For a planar graph G, the dual of G denoted G∗ is a graph obtained by replacing all faces in G with vertices, and connecting
vertices with edges if those faces share an edge.
The planar graphs are the Euclidean plane that have no intersecting edges.
Duality: Given a planar graph G = (V, E), its dual graph G* = (V*, E*) is constructed by placing a vertex in the dual graph G* for each
area of the graph G, then connecting those vertices V* to get the edges E*. Two nodes in G* are connected by the same number of
edges, as the corresponding areas in G have shared edges. A graph database can translate a planar graph G into its dual graph G* and
vice versa.
Moving problems to dual space can make easier ways to find solution options. If properties can be derived in dual space, they are also
valid in the original space.
Chapter 3: database languages
Defining a data architecture, we have to follow:
• What data is to be gathered by the company itself and what will be obtained from external data suppliers?
• Who is in charge of distributed data?
• What are the relevant obligations regarding data protection and data security in international contexts?
• How can the stored data be classified and structured according to maintaining and servicing the geographically national and
international conditions?
• Which rights and duties apply to data exchange and disclosure?
DATABASE LANGUAGE USERS
We can distinguish:
• Database administrator uses a database language to manage the data descriptions. He/she sets permissions.
• System administrator ensures consistently update of data description and in adherence with data architecture.
• DB specialist defines, installs, and monitors databases. Some accesses are limited to the diagnostic period only.
• App programmer uses database languages to run analyses on or apply changes to databases.
• Data analyst or data scientist is the final user of database languages for their everyday analysis. She/he is expert on the
targeted interpretation of database contents on specific issues, usually with limited IT skills.
Relational algebra
It’s a framework for the database languages working on tables, i.e. relational databases. Relational algebra defines a number of
algebraic operators that apply on relations. Most of these operators are not used directly in the languages, but, to be a “relationally
complete language”, they have to be capable to replicate these operations.
Relational operators apply to one ore more relations (tables) and return another relation. This consistency allows for the combination
of multiple operators.
We distinguish:
SET OPERATORS
• union: Two relations are union-compatible if they meet both of the following criteria: both relations have the same number of
attributes and the data formats (domain) of the corresponding attribute categories are identical.
• Relations R and S are combined by a set union R ∪ S when all entries from R and all entries from S are entered into the
resulting table. Identical records are automatically unified, since a distinction between tuples with identical attribute values
in the resulting set R ∪ S is not possible
• intersection: Relations R and S combined by a set intersection
∩
hold only those entries found in both R and S.
𝑆
𝑅
• difference: Relations R and S are combined by a set difference R \ S removing all entries from R that also exist in S. Intersection
operator can be expressed with difference operator: R ∩ S = R\(R\S)
9 of 13
• cartesian product: It does not need to be union-compatible. The Cartesian product R × S of two relations R and S is the set of all
possible combinations of tuples from R with tuples from S. For two arbitrary relations R and S with m and n entries, respectively,
the Cartesian product R × S has m times n tuples. Many resulting tuples could be meaningless.
RELATIONAL OPERATORS
• Projection: A projection πa(R) with the project operator π forms a subrelation of the relation R based on the attribute names
defined by a. Given a relation R(A,B,C,D), the expression πA,C(R) reduces R to the attributes A and C: the order of attribute
names in a projection matters! For example, R′ := πC,A(R) means a projection of R = (A,B,C,D) onto R′ = (C,A). Then πC,A(R) ≠
πA,C(R).
• Selection: The selection σP(R) with the select operator σ in an expression that extracts a selection of tuples from the relation R
based on the property P. Property P is described by a condition (with operators such as <,>,=, AND, OR, NOT, ecc.)
• Join: The join operator merges two relations into a single one. The join σP(R × S) of the two relations R and S by the property
(predicate) P is a combination of all tuples from R with all tuples from S where each tuple meets the join property (predicate) P. A
join is a selection (σ) of a Cartesian product (×). If the join predicate P uses the relational operator = , the result is called an equi–
join. With wrong or missing predicate the join operator causes misunderstandings which may lead to wrong or unwanted results. In
fact, missing predicate causes the selection of all rows of the Cartesian product.
• Division: A division of the relation R by the relation S is only possible if S is contained within R as a subrelation. For instance: R :=
which employees work on which projects; S := projects P2 and P4; R′ := R ÷ S = the two employees E1 and E4 (are involved in both
projects). The division of R by S ( ÷ or R / S) returns all values of the attributes R.t that have a row with ALL value in S.s. S.s
are the common attribute(s).
Relationally complete languages
COMPLETENESS CRITERION
Languages are relationally complete if they support all operations of relational algebra.
A database language is considered relationally complete if it enables at least the set operators set union, set difference, and
Cartesian product, as well as the relation operators projection and selection.
RELATIONALLY COMPLETE LANGUAGES
The following functions are required:
• It must be possible to define tables and attributes.
• Insert, change, and delete operations (manipulation) must be possible.
• Aggregate functions such as addition, maximum and minimum determination, or average calculation should be included.
• Arithmetic expressions and calculations should preferably be supported.
• Formatting and printing tables by various criteria must be possible, e.g., including sorting orders and control breaks for table
visualization.
• Languages for relational databases must include elements for assigning user permissions and for protecting the databases.
• Multi-user access should be supported and commands for data security should be
included.
• QUERIES!
Since most of database users are not so expert on algebraic operators, the relational algebra is implemented in more user-friendly
relational database languages as possible.
Two common examples:
• SQL (Structured Query Language), is considered a direct implementation of relational algebra.
• QBE (Query by Example) is a language in which the actual queries and manipulations are executed via sample graphics.
SQL
The concept behind SEQUEL was to create a relationally complete query language based on English words such as select, from,
where, count, group by, rather than mathematical symbols.
Output —> select selected attributes
Input —> from tables to be searched
Processing —> where selection condition
𝑆
𝑅
• Projection: select column/s from table
• Selection: select * from table where column=‘value’ # * disables projection
• Join: select columns from table1, table2.
• Calculate a scalar value: select count(column) from .. where .. # sum, AVG, max, min.
10 of 13
•
•
•
•
Create a table: create table tablename
Delete table: drop table tablename
Insert new tuple: insert into tablename values ()
Change tuples content: update tablename set column=‘new value’ where column=‘old value’
QBE
The language Query by Example (QBE) is a database language that allows users to create and execute their analysis needs directly
within the table using interactive examples.
Query by Example, unlike SQL, cannot be used to define tables or assign and manage user permissions (DDL). QBE is limited to the
query and manipulation part (DML). QBE is relationally complete (QL), just like SQL, but with some limits on recursion (queries-ofqueries). A quick translation from and to SQL can be used to improve both languages knowledge. Analysis specialists prefer SQL for
their sophisticated queries.
Graph based languages
Graph databases store data in graph structures. Like relational language provide options for:
• data manipulation on a graph transformation level (DML)
• query language (QL)
• programming language for graph structures (nodes and edges)
• set-based (set of vertices and set of edges)
• filtering by predicate (filtering returns a subset of nodes and edges, i.e. a partial graph)
• features for aggregating sets of nodes in the graph into scalar values, e.g., counts, sums, or minimums
Additional features (not available for relational languages):
• path analysis
• indirect (more than one edge) recursive relationship
Recursive relationship in SQL:
with recursive
rpath (partID, hasPartId, length) – CTE definition as (
select partID, hasPartId, 1 -- Initialization from part
union all
select r.partID, p.hasPartId, r.length+1 from part p
join rpath r – Recursive join of CTE
on (r.hasPartId = p.partID) )
select
distinct path.partID, path.hasPartId, path.length from path -- Selection via recursive defined CTE
Parts (products) can potentially have multiple subparts (components) and at the same time also potentially be a subpart (component)
to another, superordinate part (product).
Recursive relationship in Cypher:
MATCH path = (p:Part) <-[:HAS*]- (has:Part)
RETURN p.partID, has.partID, LENGTH(path)
HAS* defines the set of all possible concatenations of connections with the edge type HAS.
CYPHER
Cypher is based on a pattern matching mechanism. Language commands for data queries (QL) and data manipulation (data
manipulation language, DML). Node and edge types are defined by inserting instances because are implicit, so data definition
language (DDL) of Cypher can only describe indexes, unique constraints, and statistics. Cypher does not include any direct linguistic
elements for security mechanisms.
•
•
•
•
•
relational algebra: match (m: movie) where m.released > 2008 return m.release order by m.released
Cartesian product: match (m: movie),(p: person) return m.title,p.name
Join: match (m: movie) <-[:acted_in]-(p: person) return m.title,p.name
Left join: MATCH(m: Movie) OPTIONAL MATCH (m) <- [:REVIEWED]-(p: Person) RETURN m.title, p.name
Aggregate functions: MATCH (m: Movie) <-[:REVIEWED]-(p: Person) RETURN m.title,count(p.name) # others: avg, max, min,
stdefv, sum, collect.
11 of 13
Chapter 4: consistency assurance
Data consistency
Consistency and integrity of a database describe a state in which the stored data does not contradict itself.
Integrity constraints are applied to ensure data consistency during all insert and update operations.
Striving for full consistency is not always desirable. — CAP theorem
MULTI USER OPERATIONS
• Simultaneous accesses —> conflicts —> consistency rules unacceptable.
• Transaction: set of rules that are executed to achieve a target. It can contain a single or multiple independent instructions for
accessing or modifying data. They are bound by integrity rules, which update database states while maintaining consistency. Once
the transaction is complete the consistency is maintained.
The transaction has to be atomic, consistent, isolated, and durable:
ACID CONCEPT OF TRANSACTIONS — only RELATIONAL DATABASE SYSTEMS
• Atomicity: transactions are either succeeded or failed —> incomplete transactions must be complete or reset before other tran.
• Consistency: consistent-DB→transaction (with temporary inconsistencies)→consistent-DB.
• Isolation: multi-user transactions —> SEQUENTIALITY (they need to wait).
• Durability: retains the effects of a correctly completed transaction till the next transaction is complete.
DBMS’s guarantee that all users can only make changes that lead from one consistent database state to another (consistent
database).
SERIALIZABILITY OF CONCURRING TRANSACTIONS: Concurring transactions are requiring access to the same object at the
same time, then they must be serialised in order for database users to work independently.
pessimistic method: use locks. Only one transaction can access the object, all other transactions must wait until the object
is unlocked. Ex: 2PL (2-phase-locking protocol) —> splits the transaction in expanding (locks) phase and shrinking (unlocks) phase.
dichotomy: read-locks shared (access to read-only) + write-locks exclusive (permit write access to the object)
time stamps: allow for strictly ordered object access according to the age of the transactions.
optimistic method assumption: “Conflicts between concurring transactions will be rare occurrences.”
transactions are split into: read (transactions run simultaneously), validate (check for conflicts), write.
availability and partition tolerance take priority (web based applications).
BASE MODEL - NOSQL DATABASE SYSTEMS
to enable flexibility and scalability, partially sacrificing consistency, BASE is adopted:
• Basically Available: The system is guaranteed to be available for querying (read access) by all users (without isolating queries).
Highly distributed approach: instead of maintaining a single large data store and focusing on the fault tolerance of that store,
BASE databases spread data across many storage systems with a high degree of replication. In the unlikely event that a failure
disrupts access to a segment of data, this does not necessarily result in a complete database outage.
• Soft State: The values stored in the system may change because of the “eventual consistency” model. In the BASE model data
consistency is the developer’s problem and should not be handled by the database.
• Eventually Consistent: As data is added to the system, the system’s state is gradually replicated across all nodes. Again, during the
short period of time before all updated blocks are replicated, the state of the file system isn’t consistent. It will not always
happen but users should know that it might.
CAP theorem: any database can, at most, have 2 out of 3: consistency (when a transaction changes data in a distributed database
with replicated nodes, all reading transactions receive the current state, no matter from which node they access the database),
availability (running applications operate continuously and have acceptable response times) or partition tolerance (failures of
individual nodes or connections between nodes in a replicated computer network do not impact the entire system, and nodes can be
added or removed at any time without having to stop operation (warm configuration). Partition tolerance allows replicated computer
nodes to temporarily hold diverging data versions and be updated with a delay)
Chapter 5: system architecture
OPTIMISATION OF RELATIONAL QUERIES
When combined expressions in different order generate the same result, they are called equivalent expressions.
• Equivalent expressions allow queries optimisation without affecting the result.
• Optimised equivalent expression reduces the computational expense.
Ex: structure of query tree. Root node —> branches —> leaf nodes.
12 of 13
ALGEBRAIC OPTIMISATION PRINCIPLES
Multiple selections on one table can be merged into one so the selection predicate only has to be validated once.
Selections should be made as early as possible to keep intermediate results (tables) small. Then, the selection operators should be
placed as close to the leaves (source tables) as possible.
Projections should also be run as early as possible, but never before selections. Projection operations reduce the number of columns
and often also the tuples.
Join operators should be calculated near the root node of the query tree, since they require a great deal of computational expense.
Chapter 6: post-relational databases
Example: NoSQL, graph, distributed db systems, temporal db systems, ecc.
• Distributed databases: data is stored across different physical locations (multiple computers or network of interconnected).
- Centralised vs. Federated: data are stored, maintained and processed in different places.
- Standalone vs. Distributed: data are stored in different computers.
- Replication: copying data into different computers.
- Fragmentation: split data into smaller parts between several computers.
The user ignores the physical fragments, while the database system itself performs the database operations locally or, if
necessary, split between several computers.
Vertical fragmentation: combine several columns along with the identification key. One example is the EMPLOYEE table, where
certain parts like salary, qualifications, development potential, etc., would be kept in a vertical fragment restricted to the HR
department for confidentiality reasons. The remaining information could be made accessible for the individual departments in
another fragment.
Horizontal fragmentation: important task for administrators. Keeps the original structure of the table. (Select rows based on some
condition). Not overlap.
Distributed database: use remote access, allowing distributed transaction.
Temporal databases: date, time, durations (employee’s age, months of product storage). Temporal databases are designed to relate
data values, individual tuples, or whole tables to the time axis.
- instant: date/time data types
- period: integer/float data types
- transaction time: instant of entering/changing data
New attributes: VALID_FROM, VALID_TO, VALID_AT.
Temporal database system (TDBMS)
Supports the time axis as valid time by ordering attribute values or tuples by time and contains temporal language elements for
queries into future, present, and past.
The SQL standard supports temporal databases with VALID_FROM, VALID_TO and VALID_AT attributes.
Multidimensional databases: all decision-relevant information are stored according to various analysis: dimensions (data cube) and
indicators (or facts).
The indicator is placed at the center of a star schema with around the dimension tables.
Online Transaction Processing (OLTP): transactions aim to provide data for business handling as quickly and precisely as possible.
• Databases are designed primarily for day-to-day business
• Historically, operative data are overwritten daily, losing important information for decision-making.
Online Analytical Processing (OLAP): recently, new specialized (multidimensional) databases are developed for data analysis and
decision support. Ex: Sales in a multidimensional database by time, region, or product.
dimensions can be structured further: they also describe aggregation levels.
drill-down: more details for deeper analysis
roll-up: analyse further aggregation levels
SQL is of no use. A multidimensional database management system supports a data cube with the following functionality:
• Possibilities to define several dimension tables with arbitrary aggregation levels, especially for the time dimension
• The analysis language offers functions for drill-down and roll-up.
• Slicing, dicing, and rotation of data cubes are supported.
BUSINESS INTELLIGENCE
Comprises the strategies and technologies used by enterprises for the data analysis and management of business information. It
provides facts that can be gathered from the analysis of the available data through:
• Integration of heterogeneous data —> federated database systems
13 of 13
• Historicisation of current and volatile data —> temporal databases
• Completing availability of data on subject areas —> multidimensional databases
DATA WAREHOUSE (DWH)
Central repositories of integrated data from one or more disparate sources used for reporting and data analysis in business
intelligence.
Data warehousing implements:
• federated database systems (FDBMS),
• temporal database systems (TDBMS), and
• multidimensional database systems (MDBMS)
using specific simulation of these DBMS in a distributed relational database technology with these properties:
• Integration of data from various sources (internal or external) and applications in a uniform schema
• Read-only of data. The data warehouse is not changed once it is written.
• Historicization of data adding a timeline (validity attributes) in the central storage.
• Analysis-oriented, so that all data on different areas (dimensions) is fully available in one place (data cube).
• Decision support, because the information in data cubes serves as a basis for management decisions (business purposes).
KNOWLEDGE DATABASES
They can manage facts (TRUE/FALSE values) and rules (relationships that allow to deduct contents from the table).
Expert system: information system that provides specialist knowledge and conclusions for a certain limited field of application.
The fields of databases, programming languages, and artificial intelligence will increasingly influence each other and in the future
provide efficient problem-solving processes for practical application.
Download