BIG DATA MANAGEMENT

1 of 13 Chapter 1: information systems and databases Database Logically organised collection of electronically stored data that can be directly searched and viewed. They involve administration: • technical: computers, database system, storage. • administrative: setup, maintenance. and data: • gathering, maintaining and utilising. • nontechnical operators: warehouse employees, bank tellers and top managers. Trade off between data redundancies and consistency conditions. Information systems and databases VS. Material goods Handling: • Representation • Processing • Combination • Age • Original • Vagueness • Medium Economic and legal evaluation: • Loss/gain in value with usage. • Production costs. • Property rights and ownership of goods. Data management as a task It’s necessary to see data management as a task for the executive level. Once data is viewed as a factor in production it has to be planned governed monitored and controlled. Information system Enables users to store and connect information interactively, to ask questions and to obtain answers. Any information system of a certain size uses database technologies to avoid the necessity to redevelop data management and analysis every time it is used. DATABASE MANAGEMENT SYSTEMS Software for application-independently describing, storing and querying data. They contain: • Storage component: data + metadata • Management component: query and data manipulation language. RELATIONAL MODEL - SQL structured query language databases - Proprietary data - Stability - Rigidity NON-RELATIONAL MODEL - NoSQL databases - Real-time Web-based services - Heterogeneous datasets - Flexibility Relational model Relational models collect and present data in a table, a set of tuples presented in tabular form and with the following requirements: • Table name: A table has a unique table name. • Attribute name: All attribute names are unique within a table and label one specific column with the required property. The column headers are attribute names and attributes assign a specific data value as a property to each entry in the table. Natural/artificial . • No column order: The number of attributes is not set, and the order of the columns within the table does not matter. • No row order: The number of tuples is not set, and the order of the rows within the table does not matter. A row is a tuple or record that contains a manifestation (instance) of the table. • Identification key or Primary key: an attribute or combination that is unique, minimal and private. Application-independent and without semantics. 2 of 13 Descriptive vs. Programming language Descriptive language Users get the desired results by merely setting the requested properties in the SELECT statement. The database management system • takes computing tasks, • processes the query (or manipulation) with its own search and access methods, and • generates the results table. SQL requires only the specification of the desired selection conditions in the WHERE clause. Programming language Programmers must provide the procedure for computing the required records for the users. The methods for retrieving the requested information must be implemented by the programmer. In this case, each query yields only one record, not a set of tuples. Procedural languages require the user to specify an algorithm for finding the individual records. Relational database management system RDBMS Have the following properties: • Model: all data and data relations are represented in tables. Dependencies between attribute values of tuples or multiple instances of data can be discovered (normal forms). • Schema: The definitions of tables and attributes are stored in the relational database schema. The schema further contains the definition of the identification keys and rules for integrity assurance. • Language: The database system includes SQL for data definition, selection, and manipulation. The language component is descriptive and facilitates analyses and programming tasks for users. • Data security and data protection: The database management system provides mechanisms to protect data from destruction, loss, or unauthorized access. • Architecture: The system ensures extensive data independence, i.e., data and applications are mostly segregated. This independence is reached by separating the actual storage component from the user side using the management component. Ideally, physical changes to relational databases are possible without the need to adjust related applications. • Multi-user operation: The system supports multi-user operation, i.e., several users can query or manipulate the same database at the same time. The RDBMS ensures that parallel transactions in one database do not interfere with each other or, worse, with the correctness of data. • Consistency assurance: The database management system provides tools for ensuring data integrity, i.e., the correct and uncompromised storage of data. NoSQL meet these criteria only partially. —> use Relational database technology augmented with NoSQL technology. Big data Large volumes of data, usually unstructured (text, graphics, images, audio, video). Big data is high-volume, high-velocity and high- variety information assets that demand cost-effective (asset) and innovative forms of information processing for enhanced insight (more understandable), and decision making. The main characteristics that identify the three v’s: • Volume (extensive amounts of data): from megabyte to zettabytes • Variety (multiple formats, structured, semi- structured, and unstructured data) • Velocity (high-speed and real-time processing) + • Value: meant to increase enterprise value. • Veracity: inaccurate or vague. Use specific algorithms to evaluate validity and quality. NoSQL database systems Web-based storage systems are considered NoSQL DB systems if they meet the following requirements: • Model: The underlying database model is not relational. • At least three Vs: The database system includes a large amount of data (volume), flexible data structures (variety), and real-time processing (velocity). • Schema: The database management system is not bound by a fixed database schema. When the schema is free, the structures of individual records (or documents) can vary. • Architecture: The database architecture supports massively distributed web applications and horizontal scaling. • Replication: The database management system supports data replication. • Consistency assurance: Consistency may be ensured with a delay to prioritize high availability and partition tolerance. 3 of 13 Data distribution A distributed database can be configured to store the same data in multiple nodes across locations network. If a single node fails, the data is still available. You don’t have to wait for the database to be restored. A geo-distributed database maintains concurrent nodes across geographical regions for resilience in case of a regional power or communications outage. The ability to store a single database across multiple computers requires an algorithm for replicating data that is transparent to the users. DISTRIBUTED DATABASES: A logical interrelated collection of shared data (and metadata), physically distributed over a computer network DISTRIBUTED DBMS: Software system that permits the management of the distributed database and makes the distribution transparent to users. Consistency Data consistency is the process of keeping information uniform as it moves across a network and between various applications on a computer. Data consistency helps ensure that information on a crashing computer can be restored to its pre-crash state. Strong consistency means that the database management system ensures full consistency at all times. Systems with weak consistency tolerate that changes will be copied to replicated nodes with a delay, resulting in temporary inconsistencies. NoSQL databases NoSQL databases are geared toward managing large sets of varied and frequently updated data, often in distributed systems or the cloud. They avoid the rigid schemas associated with relational databases. But the architectures themselves vary and are separated into four primary classifications, although types are blending over time. KEY VALUE DATABASES Simple data model that pairs key attribute and its associated value in storing data elements. WIDE-COLUMN DATABASES Table-style databases. They store data across tables that can have a very large number of columns. Ex. Internet search. DOCUMENT DATABASES Use Json format. Document-like structures. GRAPH DATABASES Store related nodes in graphs to accelerate querying. Property graph model: nodes, directed edges, labels for nodes and edges and properties. Chapter 2: Data modelling Data models provide a structure and a formal description of the data and its relationships, without considering the kind of database management system to be used for entering, storing and maintaining data. Database setup 1. DATA ANALYSIS To find the data required for the information system and their relationships to each other. It contains at least a verbal task description with objectives and a list of relevant pieces of information. The written description of data connections can be complemented by graphical illustrations or a summarising example. 2. DESIGNING A CONCEPTUAL DATA MODEL (E/R MODEL) Identification of entity and relationship sets (for ex. Columns)—> not simple. This design step requires experience and practice from the data architect. E/R model contains both the entity sets (rectangles) and relationship sets (rhombi). 3. CONVERTING IT INTO A RELATIONAL OR NON-RELATIONAL DATABASE SCHEMA Is the formal description of the database objects using either tables (relational db —> entity and relationship sets must be expressed in tables, 1 table/relationship set) or nodes and edges (graph-oriented schema —> entity sets = nodes, relationship sets = edges). 4 of 13 Entity-relationship model (E/R) ENTITY An entity is a specific object in the real world or our imagination that is distinct from all others (characterised by its attributes). Entities of the same type are combined into entity sets and are further characterised by attributes. The attributes are properties characterising an entity (set). In each entity set there is an identification key. RELATIONSHIP The relationships between entities are of interest and can form sets of their own. Like entity sets, relationship sets can be characterised by attributes. A copy of another table entity key is named foreign key. Relationships can be understood as associations in two directions. ASSOCIATION Each direction of a relationship is an association. We distinguish 4 types of association: • Type 1: exactly one —>[1:1]: each entity from the entity set ES1 is assigned exactly one entity from the entity set ES2. Each employee is subordinate to exactly one department. • Type c (or 0): none or one —> [0:1]: each entity from the entity set ES1 can be assigned at most one entity from set ES2. Not every employee can have the role of a department head. • Type m (or m1): one or multiple —> [1:m]: 1 entity from ES1 is assigned 1 or more entities from ES2. • Type mc (or m0): none, one, or multiple —> [0:m] Mapping from E/R model to Relational database schema The relational model is based on a set of formal rules collected in normal forms that are used to discover and study dependencies within tables in order to avoid redundant information and resulting anomalies. Achieved by dividing database in tables and defining relationships. ATTRIBUTE REDUNDANCY An attribute in a table is redundant if individual values of this attribute can be omitted without a loss of information. For every employee of department D6 the department name Accounting is repeated • This repetition occurs for ALL employees of ALL departments • We can say that the attribute DepartmentName is redundant, since the same value is listed in the table multiple times • It would be preferable to store the name going with each department number in a separate table for future reference instead of redundantly carrying it along for each employee. DATABASE ANOMALIES BY REDUNDANCIES The normalization is a process of efficiently organizing data in database. Taking all data in a single table we have: 1. Insertion anomaly. There is no way of adding a new department without employees in it. No new table rows can be inserted without a unique employee number. 2. Deletion anomaly. If we delete all employees from the table, we also lose the department numbers and names. 3. Update anomalies. Changing the name of department D3 from IT to Data Processing, each employee of that department would have to be edited individually, meaning that although only one detail is changed, the entire table must be adjusted in multiple places. AVOID REDUNDANCIES AND ANOMALIES INTRODUCTORY CONCEPTS • Determinant: is an attribute that determines the values assigned to other attributes in the same row. By this definition, any primary key or candidate key is a determinant. However, there may be determinants that aren't primary or candidate keys. • Key: - Keys: attributes that are used to access tuples from a table. - Super key: attribute or set of attributes which uniquely identify the tuples in table - Candidate key: is a minimal super key. - Concatenated (candidate) key: candidate key composed of more than one attribute. Also named composite candidate key. - Overlapping concatenated keys: composite keys with at least one attribute in common. Primary key: is the (chosen) candidate key (usually) with the smallest no. of attributes. There is (the definition of) one and only one primary key in any relationship (table). Any attribute of primary key can not contain NULL value. A primary key is a candidate key. Foreign key: an attribute that can only take the values which are present as values of some other attribute. 5 of 13 • Functionally dependency: B is functionally dependent on A if for each value of A there is exactly one value of B. A functional dependency of B on A, therefore, requires that each value of A uniquely identifies one value of B. A is a determinant of B. For an identification (or primary) key K and an attribute B in one table, there is a functional dependency K -> B (K uniquely identifies B). • Fully functionally dependency: B is fully functionally dependent on a concatenated key consisting of (K1, K2) if B is functionally dependent on the entire key, but not its parts. (Only K1 or K2 cannot uniquely identify B). • Transitive dependency: C is transitively dependent on A if B is functionally dependent on A and C is functionally dependent on B. • Multivalued dependency: C is multivaluedly dependent on A if any combination of a specific value of A with an arbitrary value of B results in an identical set of values of C. • Lossless join dependency: a table R with attributes A, B, C has join dependency if the projected subtables R1(A,B) and R2(B,C) when joined via the shared attribute B result in the original table R. Splitting a table can be done with a project operator and table reconstruction can be done with a join operator. NORMAL FORMS Understanding the normal forms helps to make sense of the mapping rules from an entity-relationship model to a relational model. Mapping onto a relational database schema with a properly defined entity-relationship model and a consistent application of the relevant mapping rules, means that the normal forms will always be met. Creating an entity-relationship model and using mapping rules we can mostly skip checking the normal forms for each individual design step. Not all normal forms are equally relevant. Usually only the first three normal forms are used. • 1NF: All attribute values are atomic. Basis for all the other normal forms. Using the first normal form leaves us with a table full of redundancies. • 2NF: Non-key attribute are fully dependent on the key. If a table with a concatenated key is not in 2NF, it has to be split into subtables: 1. the attributes that are dependent on a part of the key are transferred to a separate table along with that key part, 2. the concatenated key and potential other relationship attributes remain in the original table. • 3NF: No transitive dependencies. The transitive dependency can be removed by splitting off the redundant attribute and putting it in a separate table with “its key” attribute (D#). The dependent attribute (D#) also stays in the remaining table as a foreign key • BCNF: (Boyce-Codd) Only dependencies on key are permitted. Used when there are multiple overlapping candidate keys in one table. Such tables, even when they are in 3NF, may conflict with BCNF. In this case, the table has to be split due to the candidate keys. • 4NF: No multivalued dependencies. One attribute is the determinant of another one which can have many values. Tables containing only multivalued dependencies attributes don’t have anomalies. Tables containing multivalued dependencies together with other attributes can cause update anomalies. • 5NF: Only trivial join dependency. A table is in the 5NF if it can be arbitrarily split by project operators and then reconstructed into the original table with join operators (lossless join dependency). —> use foreign key A table or an entire database schema in a normal form must meet also all requirements of previous normal forms MAPPING FROM E/R MODEL TO RELATIONAL DATABASE SCHEMA We have seen how to normalise a relational database. Now we see how entity and relationship sets can be represented in tables. Relational database schema: contains definitions of the tables, attributes and primary keys. Integrity constraints: set limits for the domains and the dependencies between tables. There are two rules of major importance in mapping an entity-relationship model onto a relational database schema: • Rule 1 (entity sets): each entity set has to be defined as a separate table with a unique primary key. The primary key can be either the key of the respective entity set or one selected candidate key. The remaining attributes of the entity set are converted into corresponding attributes within the table. By definition, a table requires a unique primary key. It is possible that there are multiple candidate keys in a table, all of which meet the requirement of uniqueness and minimality. In such cases, it is up to the data architects which candidate key they would like to use as the primary key. • Rule 2 (relationship sets): each relationship set can be defined as a separate table. The identification keys of the corresponding entity sets must be included in this table as foreign keys. The primary key of the relationship set table can be a concatenated key made from the foreign keys or another candidate key, e.g., an artificial key. Other attributes of the relationship set are listed in the table as further attributes. Foreign key for a relationship: Foreign key is an attribute within a table that is used as an identification key in at least one other table (possibly also within this one). Identification keys (or primary keys) can be reused in other tables to create the desired relationships between tables. In other words: a relationship set between two tables is translated in a relational database by a foreign key in a table “linked” to the primary key of the other table Other rules for relationship sets: the use of rules R1 and R2 alone does not necessarily result in an ideal relational database schema as this approach may lead to a high number of individual tables. Mapping rules for relationship sets are based on the cardinality of relationships to avoid an unnecessarily large number of tables and expressly limits which relationship sets always and in any case require separate tables. 6 of 13 • Rule R3 (network-like relationship sets): every complex-complex (many-to-may or m-m) relationship set must be defined as a • • separate table which contains at least the identification keys of the associated entity sets as foreign keys. The primary key of a relationship set table is either a concatenated key from the foreign keys or another candidate key. Any further characteristics of the relationship set become attributes in the table. Rule R4 (hierarchical relationship sets): unique-complex (one-to-many or 1-m) relationship sets can be represented (without a separate relationship set table) by the tables of the associated entity sets (directly) —> 2 associations = 1 relationship. The unique association (type 1 or c) allows for the primary key of the referenced table to simply be included in the referencing table as a foreign key with an appropriate role name. Rule R5 (unique-unique relationship sets): unique-unique relationship sets can be represented without a separate table by the tables of the two associated entity sets (directly). Again, an identification key from the referenced table can be included in the referencing table along with a role name (foreign key). Mapping rules for relationship hierarchy: based on the relationship hierarchy to consider some common characteristics in a superordinate table (generalisation) to take in account some peculiarities (specialisation). • Rule R6 (relationship of generalisation): generalisation is the combination of entities into a superordinate entity set. Once the entities are defined, some of them can be “generalised” in a “super-entity”: [entity subsets] ---generalisation---> [new entity set]. • Rule R7 (relationship of aggregation): Since the relational model does not directly support the relationship structure of a generalisation, the characteristics of such a relationship hierarchy have to be modelled indirectly. Each entity set of a generalisation hierarchy requires a separate table. The primary key of the superordinate table becomes the primary key of all subordinate tables as well. In other words, the identification keys of the specialisation must always match those of the superordinate table. Aggregation is the combination of entities into a superordinate relationship set. Once the entities are defined, some of them can be “aggregated” defining a new relationship set: [ entity sets ] ---aggregation---> [ new relationship set ] We distinguish: - network-like aggregation: CORPORATION EXAMPLE - Each company may have multiple superordinate and/or subordinate companies. - hierarchical aggregation: ITEMS (PRODUCTS) EXAMPLE - Each item (product) may consist of multiple sub-items (components). Each sub-item is dependent on exactly one superordinate item. If the cardinality of a relationship in an aggregation is: • Complex-complex: separate tables must be defined for both the entity set and the relationship set. The relationship set table contains the identification key of the associated entity set table twice with corresponding role names to form a concatenated key. • Unique-unique: (hierarchical structure), the entity set and the relationship set can be combined in a single table. Integrity Integrity (or consistency) of data means that stored data does not contradict itself. STRUCTURAL INTEGRITY CONSTRAINTS Rules to ensure integrity that can be represented within the database schema itself. For relational databases, they include the following: • Uniqueness constraint: Each table has an identification key (attribute or combination of attributes) that uniquely identifies each tuple within the table. A consistent EMPLOYEE table requires that the names of employees, streets, and cities really exist and are correctly assigned. If there are multiple candidate keys within one table, one of them has to be declared the primary key to fulfil the uniqueness constraint. The uniqueness of the primary keys themselves is checked by the DBMS. • Domain constraint: The attributes in a table can only take on values from a predefined domain. Defining a domain is not enough when it comes to verifying city or street names; for instance, a “CHARACTER (20)” limit does not have any bearing on meaningful street or city names. Often the domain constraint comes in the form of enumerated types: Profession ≡ {Programmer, Analyst, Organizer}; YearOfBirth ≡ {1916...2021} • Referential integrity constraint: Each value of a foreign key must actually exist as a key value in the referenced table. Insertion integrity: cannot insert ( E20, Mahoney, Market Ave S, Canton, D7 ) into EMPLOYEE because D7 doesn’t exist in DEPARTMENT Restricted delete: deletion of tuple ( D6, Accounting ) from DEPARTMENT is denied Cascade delete: deleting ( D6, Accounting ) tuple from DEPARTMENT, two EMPLOYEE tuples (E19 and E4) would be removed Unknown setting: deleting ( D6, Accounting ) tuple from DEPARTMENT, the foreign key of tuples E19 and E4 will be assigned to NULL 7 of 13 Graph model GRAPH THEORY A graph is defined by the sets of its nodes (or vertices) and edges. Properties of network structures: • Undirected graph: G=(V,E) consists of a vertex set V and an edge set E, with each edge being assigned two potentially identical vertices. • Connected graph: A graph is connected if there are paths between any two vertices. • Loop: is an edge that connects a vertex to itself. A simple graph contains no loops. • Degree of a vertex: is the number of edges originating from it. A graph G is Eulerian, if it is connected and each node has an even degree. “A path traversing each edge of a graph exactly once can only exist if each vertex has an even degree” (Euler, 1736). If all its vertices are of an even degree, there is at least one Eulerian cycle. Fleury’s algorithm: How to find an Eulerian cycle? 1) Choose any node as the starting vertex. 2) Choose any (nonmarked) incidental edge and mark it (e.g., with sequential numbers or letters). 3) Take the end node as the new starting vertex. 4) Repeat from step (2). There is, of course, more than one possible solution, and the path does not necessarily have to be a cycle. Weighted graph Are graphs whose vertices or edges have properties assigned to them. The weight of a graph is the sum of all weights within the graph, i.e., all node or edge weights. This definition also applies to partial graphs, trees, or paths as subsets of a weighted graph. • The shortest path: Search for partial graphs with maximum or minimum weight. Find the shortest path: The smallest weight between the stations v0 and v7, i.e., the shortest path from stop v0 to stop v7. Given an undirected graph G = (V,E) with positive edge weights and an initial vertex vi, we look at the nodes vj neighboring this vertex and calculate the set Sk(v). We select the neighboring vertex vj closest to vi and add it to the set Sk(v). • Djistra’s algorithm: 1. Calculate the sum of the respective edge weights for each neighboring vertex of the current node. 2. Select the neighboring vertex with the smallest sum. 3. If the sum of the edge weights for that node is smaller than the distance value (dist) stored for it, set the current node as the previous vertex for it (prev) and enter the new distance in Sk. Mapping from E/R model to Graph database schema Graph databases are often founded on the model of directed weighted graphs. The objective is to convert entity and relationship sets into nodes and edges of a graph. A graph database schema contains nodes and edges. Compared to the relational model, the graph model allows for a broader variety of options for representing entity and relationship sets: undirected graph, directed graph, relationship sets as edges, relationship sets as nodes, etc. • Rule G1 (entity sets): Each entity set has to be defined as an individual vertex in the graph database. The attributes of each entity set are made into properties of the respective vertex. • Rule G2 (relationship sets): Each relationship set can be defined as an undirected edge within the graph database. The attributes of each relationship set are assigned to the respective edge (attributed edges). Relationship sets can also be represented as directed edges. The directed edge constellations are used to highlight one specific association of a relationship using the direction of the corresponding edge. • Rule G3 (network-like relationships): one employee can work on multiple projects (IS_INVOLVED) and each project must involve multiple employees (INVOLVES). Alternatively, a double arrow could be drawn between the E and P vertices, with the name INVOLVED and the attribute Percentage. • Rule G4 (hierarchical relationships): Unique-complex relationship set can be defined as a directed edge between vertices in the direction from the root node to the leaf node and with the multiple association type (m or mc) noted at the arrowhead. In the oneto-many relationship one directed edge is enough to represent the relationship set. • Rule G5 (unique-unique relationships): Every unique-unique relationship set can be represented as a directed edge between the respective vertices. The direction of the edge should be chosen so that the association type at the arrowhead is unique, if possible. Each department must haveone dept. head and each employee can be one dept. head. It would also be possible to use the reverse direction from employees to departments as an alternative, where the edge would be IS_DEPARTMENT_HEAD and the association type ‘c’ would be noted at the arrowhead. • Rule G6 (generalisation): The superordinate entity set of a generalization becomes a super node* the entity subsets become normal vertices. The generalization hierarchy is then complemented by specialization (from the general to the particular) edges. • Rule G7 (aggregation): For network-like or hierarchical aggregation structures, entity sets are represented by nodes, and relationship sets are represented by edges with the association type mc noted at the arrowhead. Entity set attributes are attached to the nodes; relationship set properties are attached to the edges. 8 of 13 Structural integrity constraints Structural integrity constraints are secured by the database management system. For graph databases, they include the following: • Uniqueness constraint: Each vertex and each edge can be uniquely identified within the graph. Path expressions can be used to navigate to individual edges or nodes. • Domain constraint: The attributes of both vertices and edges belong to the specified data types, i.e., they come from well-defined domains. • Connectivity: A graph is connected if there is a path between any two vertices within the graph. The graph database ensures connectivity for graphs and partial graphs. • Tree structures: Special graphs, such as trees, can be managed by the graph database. It ensures that the tree structure is kept intact in case of changes to nodes or edges. • Duality: For a planar graph G, the dual of G denoted G∗ is a graph obtained by replacing all faces in G with vertices, and connecting vertices with edges if those faces share an edge. The planar graphs are the Euclidean plane that have no intersecting edges. Duality: Given a planar graph G = (V, E), its dual graph G* = (V*, E*) is constructed by placing a vertex in the dual graph G* for each area of the graph G, then connecting those vertices V* to get the edges E*. Two nodes in G* are connected by the same number of edges, as the corresponding areas in G have shared edges. A graph database can translate a planar graph G into its dual graph G* and vice versa. Moving problems to dual space can make easier ways to find solution options. If properties can be derived in dual space, they are also valid in the original space. Chapter 3: database languages Defining a data architecture, we have to follow: • What data is to be gathered by the company itself and what will be obtained from external data suppliers? • Who is in charge of distributed data? • What are the relevant obligations regarding data protection and data security in international contexts? • How can the stored data be classified and structured according to maintaining and servicing the geographically national and international conditions? • Which rights and duties apply to data exchange and disclosure? DATABASE LANGUAGE USERS We can distinguish: • Database administrator uses a database language to manage the data descriptions. He/she sets permissions. • System administrator ensures consistently update of data description and in adherence with data architecture. • DB specialist defines, installs, and monitors databases. Some accesses are limited to the diagnostic period only. • App programmer uses database languages to run analyses on or apply changes to databases. • Data analyst or data scientist is the final user of database languages for their everyday analysis. She/he is expert on the targeted interpretation of database contents on specific issues, usually with limited IT skills. Relational algebra It’s a framework for the database languages working on tables, i.e. relational databases. Relational algebra defines a number of algebraic operators that apply on relations. Most of these operators are not used directly in the languages, but, to be a “relationally complete language”, they have to be capable to replicate these operations. Relational operators apply to one ore more relations (tables) and return another relation. This consistency allows for the combination of multiple operators. We distinguish: SET OPERATORS • union: Two relations are union-compatible if they meet both of the following criteria: both relations have the same number of attributes and the data formats (domain) of the corresponding attribute categories are identical. • Relations R and S are combined by a set union R ∪ S when all entries from R and all entries from S are entered into the resulting table. Identical records are automatically unified, since a distinction between tuples with identical attribute values in the resulting set R ∪ S is not possible • intersection: Relations R and S combined by a set intersection ∩ hold only those entries found in both R and S. 𝑆 𝑅 • difference: Relations R and S are combined by a set difference R \ S removing all entries from R that also exist in S. Intersection operator can be expressed with difference operator: R ∩ S = R\(R\S) 9 of 13 • cartesian product: It does not need to be union-compatible. The Cartesian product R × S of two relations R and S is the set of all possible combinations of tuples from R with tuples from S. For two arbitrary relations R and S with m and n entries, respectively, the Cartesian product R × S has m times n tuples. Many resulting tuples could be meaningless. RELATIONAL OPERATORS • Projection: A projection πa(R) with the project operator π forms a subrelation of the relation R based on the attribute names defined by a. Given a relation R(A,B,C,D), the expression πA,C(R) reduces R to the attributes A and C: the order of attribute names in a projection matters! For example, R′ := πC,A(R) means a projection of R = (A,B,C,D) onto R′ = (C,A). Then πC,A(R) ≠ πA,C(R). • Selection: The selection σP(R) with the select operator σ in an expression that extracts a selection of tuples from the relation R based on the property P. Property P is described by a condition (with operators such as <,>,=, AND, OR, NOT, ecc.) • Join: The join operator merges two relations into a single one. The join σP(R × S) of the two relations R and S by the property (predicate) P is a combination of all tuples from R with all tuples from S where each tuple meets the join property (predicate) P. A join is a selection (σ) of a Cartesian product (×). If the join predicate P uses the relational operator = , the result is called an equi– join. With wrong or missing predicate the join operator causes misunderstandings which may lead to wrong or unwanted results. In fact, missing predicate causes the selection of all rows of the Cartesian product. • Division: A division of the relation R by the relation S is only possible if S is contained within R as a subrelation. For instance: R := which employees work on which projects; S := projects P2 and P4; R′ := R ÷ S = the two employees E1 and E4 (are involved in both projects). The division of R by S ( ÷ or R / S) returns all values of the attributes R.t that have a row with ALL value in S.s. S.s are the common attribute(s). Relationally complete languages COMPLETENESS CRITERION Languages are relationally complete if they support all operations of relational algebra. A database language is considered relationally complete if it enables at least the set operators set union, set difference, and Cartesian product, as well as the relation operators projection and selection. RELATIONALLY COMPLETE LANGUAGES The following functions are required: • It must be possible to define tables and attributes. • Insert, change, and delete operations (manipulation) must be possible. • Aggregate functions such as addition, maximum and minimum determination, or average calculation should be included. • Arithmetic expressions and calculations should preferably be supported. • Formatting and printing tables by various criteria must be possible, e.g., including sorting orders and control breaks for table visualization. • Languages for relational databases must include elements for assigning user permissions and for protecting the databases. • Multi-user access should be supported and commands for data security should be included. • QUERIES! Since most of database users are not so expert on algebraic operators, the relational algebra is implemented in more user-friendly relational database languages as possible. Two common examples: • SQL (Structured Query Language), is considered a direct implementation of relational algebra. • QBE (Query by Example) is a language in which the actual queries and manipulations are executed via sample graphics. SQL The concept behind SEQUEL was to create a relationally complete query language based on English words such as select, from, where, count, group by, rather than mathematical symbols. Output —> select selected attributes Input —> from tables to be searched Processing —> where selection condition 𝑆 𝑅 • Projection: select column/s from table • Selection: select * from table where column=‘value’ # * disables projection • Join: select columns from table1, table2. • Calculate a scalar value: select count(column) from .. where .. # sum, AVG, max, min. 10 of 13 • • • • Create a table: create table tablename Delete table: drop table tablename Insert new tuple: insert into tablename values () Change tuples content: update tablename set column=‘new value’ where column=‘old value’ QBE The language Query by Example (QBE) is a database language that allows users to create and execute their analysis needs directly within the table using interactive examples. Query by Example, unlike SQL, cannot be used to define tables or assign and manage user permissions (DDL). QBE is limited to the query and manipulation part (DML). QBE is relationally complete (QL), just like SQL, but with some limits on recursion (queries-ofqueries). A quick translation from and to SQL can be used to improve both languages knowledge. Analysis specialists prefer SQL for their sophisticated queries. Graph based languages Graph databases store data in graph structures. Like relational language provide options for: • data manipulation on a graph transformation level (DML) • query language (QL) • programming language for graph structures (nodes and edges) • set-based (set of vertices and set of edges) • filtering by predicate (filtering returns a subset of nodes and edges, i.e. a partial graph) • features for aggregating sets of nodes in the graph into scalar values, e.g., counts, sums, or minimums Additional features (not available for relational languages): • path analysis • indirect (more than one edge) recursive relationship Recursive relationship in SQL: with recursive rpath (partID, hasPartId, length) – CTE definition as ( select partID, hasPartId, 1 -- Initialization from part union all select r.partID, p.hasPartId, r.length+1 from part p join rpath r – Recursive join of CTE on (r.hasPartId = p.partID) ) select distinct path.partID, path.hasPartId, path.length from path -- Selection via recursive defined CTE Parts (products) can potentially have multiple subparts (components) and at the same time also potentially be a subpart (component) to another, superordinate part (product). Recursive relationship in Cypher: MATCH path = (p:Part) <-[:HAS*]- (has:Part) RETURN p.partID, has.partID, LENGTH(path) HAS* defines the set of all possible concatenations of connections with the edge type HAS. CYPHER Cypher is based on a pattern matching mechanism. Language commands for data queries (QL) and data manipulation (data manipulation language, DML). Node and edge types are defined by inserting instances because are implicit, so data definition language (DDL) of Cypher can only describe indexes, unique constraints, and statistics. Cypher does not include any direct linguistic elements for security mechanisms. • • • • • relational algebra: match (m: movie) where m.released > 2008 return m.release order by m.released Cartesian product: match (m: movie),(p: person) return m.title,p.name Join: match (m: movie) <-[:acted_in]-(p: person) return m.title,p.name Left join: MATCH(m: Movie) OPTIONAL MATCH (m) <- [:REVIEWED]-(p: Person) RETURN m.title, p.name Aggregate functions: MATCH (m: Movie) <-[:REVIEWED]-(p: Person) RETURN m.title,count(p.name) # others: avg, max, min, stdefv, sum, collect. 11 of 13 Chapter 4: consistency assurance Data consistency Consistency and integrity of a database describe a state in which the stored data does not contradict itself. Integrity constraints are applied to ensure data consistency during all insert and update operations. Striving for full consistency is not always desirable. — CAP theorem MULTI USER OPERATIONS • Simultaneous accesses —> conflicts —> consistency rules unacceptable. • Transaction: set of rules that are executed to achieve a target. It can contain a single or multiple independent instructions for accessing or modifying data. They are bound by integrity rules, which update database states while maintaining consistency. Once the transaction is complete the consistency is maintained. The transaction has to be atomic, consistent, isolated, and durable: ACID CONCEPT OF TRANSACTIONS — only RELATIONAL DATABASE SYSTEMS • Atomicity: transactions are either succeeded or failed —> incomplete transactions must be complete or reset before other tran. • Consistency: consistent-DB→transaction (with temporary inconsistencies)→consistent-DB. • Isolation: multi-user transactions —> SEQUENTIALITY (they need to wait). • Durability: retains the effects of a correctly completed transaction till the next transaction is complete. DBMS’s guarantee that all users can only make changes that lead from one consistent database state to another (consistent database). SERIALIZABILITY OF CONCURRING TRANSACTIONS: Concurring transactions are requiring access to the same object at the same time, then they must be serialised in order for database users to work independently. pessimistic method: use locks. Only one transaction can access the object, all other transactions must wait until the object is unlocked. Ex: 2PL (2-phase-locking protocol) —> splits the transaction in expanding (locks) phase and shrinking (unlocks) phase. dichotomy: read-locks shared (access to read-only) + write-locks exclusive (permit write access to the object) time stamps: allow for strictly ordered object access according to the age of the transactions. optimistic method assumption: “Conflicts between concurring transactions will be rare occurrences.” transactions are split into: read (transactions run simultaneously), validate (check for conflicts), write. availability and partition tolerance take priority (web based applications). BASE MODEL - NOSQL DATABASE SYSTEMS to enable flexibility and scalability, partially sacrificing consistency, BASE is adopted: • Basically Available: The system is guaranteed to be available for querying (read access) by all users (without isolating queries). Highly distributed approach: instead of maintaining a single large data store and focusing on the fault tolerance of that store, BASE databases spread data across many storage systems with a high degree of replication. In the unlikely event that a failure disrupts access to a segment of data, this does not necessarily result in a complete database outage. • Soft State: The values stored in the system may change because of the “eventual consistency” model. In the BASE model data consistency is the developer’s problem and should not be handled by the database. • Eventually Consistent: As data is added to the system, the system’s state is gradually replicated across all nodes. Again, during the short period of time before all updated blocks are replicated, the state of the file system isn’t consistent. It will not always happen but users should know that it might. CAP theorem: any database can, at most, have 2 out of 3: consistency (when a transaction changes data in a distributed database with replicated nodes, all reading transactions receive the current state, no matter from which node they access the database), availability (running applications operate continuously and have acceptable response times) or partition tolerance (failures of individual nodes or connections between nodes in a replicated computer network do not impact the entire system, and nodes can be added or removed at any time without having to stop operation (warm configuration). Partition tolerance allows replicated computer nodes to temporarily hold diverging data versions and be updated with a delay) Chapter 5: system architecture OPTIMISATION OF RELATIONAL QUERIES When combined expressions in different order generate the same result, they are called equivalent expressions. • Equivalent expressions allow queries optimisation without affecting the result. • Optimised equivalent expression reduces the computational expense. Ex: structure of query tree. Root node —> branches —> leaf nodes. 12 of 13 ALGEBRAIC OPTIMISATION PRINCIPLES Multiple selections on one table can be merged into one so the selection predicate only has to be validated once. Selections should be made as early as possible to keep intermediate results (tables) small. Then, the selection operators should be placed as close to the leaves (source tables) as possible. Projections should also be run as early as possible, but never before selections. Projection operations reduce the number of columns and often also the tuples. Join operators should be calculated near the root node of the query tree, since they require a great deal of computational expense. Chapter 6: post-relational databases Example: NoSQL, graph, distributed db systems, temporal db systems, ecc. • Distributed databases: data is stored across different physical locations (multiple computers or network of interconnected). - Centralised vs. Federated: data are stored, maintained and processed in different places. - Standalone vs. Distributed: data are stored in different computers. - Replication: copying data into different computers. - Fragmentation: split data into smaller parts between several computers. The user ignores the physical fragments, while the database system itself performs the database operations locally or, if necessary, split between several computers. Vertical fragmentation: combine several columns along with the identification key. One example is the EMPLOYEE table, where certain parts like salary, qualifications, development potential, etc., would be kept in a vertical fragment restricted to the HR department for confidentiality reasons. The remaining information could be made accessible for the individual departments in another fragment. Horizontal fragmentation: important task for administrators. Keeps the original structure of the table. (Select rows based on some condition). Not overlap. Distributed database: use remote access, allowing distributed transaction. Temporal databases: date, time, durations (employee’s age, months of product storage). Temporal databases are designed to relate data values, individual tuples, or whole tables to the time axis. - instant: date/time data types - period: integer/float data types - transaction time: instant of entering/changing data New attributes: VALID_FROM, VALID_TO, VALID_AT. Temporal database system (TDBMS) Supports the time axis as valid time by ordering attribute values or tuples by time and contains temporal language elements for queries into future, present, and past. The SQL standard supports temporal databases with VALID_FROM, VALID_TO and VALID_AT attributes. Multidimensional databases: all decision-relevant information are stored according to various analysis: dimensions (data cube) and indicators (or facts). The indicator is placed at the center of a star schema with around the dimension tables. Online Transaction Processing (OLTP): transactions aim to provide data for business handling as quickly and precisely as possible. • Databases are designed primarily for day-to-day business • Historically, operative data are overwritten daily, losing important information for decision-making. Online Analytical Processing (OLAP): recently, new specialized (multidimensional) databases are developed for data analysis and decision support. Ex: Sales in a multidimensional database by time, region, or product. dimensions can be structured further: they also describe aggregation levels. drill-down: more details for deeper analysis roll-up: analyse further aggregation levels SQL is of no use. A multidimensional database management system supports a data cube with the following functionality: • Possibilities to define several dimension tables with arbitrary aggregation levels, especially for the time dimension • The analysis language offers functions for drill-down and roll-up. • Slicing, dicing, and rotation of data cubes are supported. BUSINESS INTELLIGENCE Comprises the strategies and technologies used by enterprises for the data analysis and management of business information. It provides facts that can be gathered from the analysis of the available data through: • Integration of heterogeneous data —> federated database systems 13 of 13 • Historicisation of current and volatile data —> temporal databases • Completing availability of data on subject areas —> multidimensional databases DATA WAREHOUSE (DWH) Central repositories of integrated data from one or more disparate sources used for reporting and data analysis in business intelligence. Data warehousing implements: • federated database systems (FDBMS), • temporal database systems (TDBMS), and • multidimensional database systems (MDBMS) using specific simulation of these DBMS in a distributed relational database technology with these properties: • Integration of data from various sources (internal or external) and applications in a uniform schema • Read-only of data. The data warehouse is not changed once it is written. • Historicization of data adding a timeline (validity attributes) in the central storage. • Analysis-oriented, so that all data on different areas (dimensions) is fully available in one place (data cube). • Decision support, because the information in data cubes serves as a basis for management decisions (business purposes). KNOWLEDGE DATABASES They can manage facts (TRUE/FALSE values) and rules (relationships that allow to deduct contents from the table). Expert system: information system that provides specialist knowledge and conclusions for a certain limited field of application. The fields of databases, programming languages, and artificial intelligence will increasingly influence each other and in the future provide efficient problem-solving processes for practical application.

BIG DATA MANAGEMENT

Related documents

Products

Support

BIG DATA MANAGEMENT

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib