PMIT-6102 Advanced Database Systems ByJesmin Akhter Assistant Professor, IIT, Jahangirnagar University Lecture 02 Introduction to DDBMS Overview of Relational DBMS Introduction Distributed DBMS Promises Problem Areas Architectural Models for Distributed DBMSs Slide 3 In centralized database systems, the only available resource that needs to be shielded from the user is the data. In a distributed database environment a second resource that needs to be managed in much the same manner: the network. The user should be protected from the operational details of the network; possibly even hiding the existence of the network. Then there would be no difference between database applications that would run on a centralized database and those that would run on a distributed database. This type of transparency is referred to as network transparency or distribution transparency. Slide 4 From a DBMS perspective, distribution transparency requires that users do not have to specify where data are located. Sometimes two types of distribution transparency are identified: location transparency Naming transparency. Slide 5 Location transparency refers to the fact that the command used to perform a task is independent of both the location of the data and the system on which an operation is carried out. Naming transparency means that a unique name is provided for each object in the database. In the absence of naming transparency, users are required to embed the location name as part of the object name. Slide 6 Distribute data in a replicated fashion across the machines on a network. If one of the machines fails, a copy of the data are still available on another machine on the network Increase reliability, and availability of data. Increases the locality of reference. Slide 7 Data are replicated, the transparency issue is: The users should not be aware of the existence of copies and the system should handle the management of copies. The users not to be involved with handling copies and having to specify the fact that a certain action can and/or should be taken on multiple copies. Slide 8 Increase performance, availability and reliability. fragmentation can reduce the negative effects of replication. Each replica is not the full relation but only a subset of it; thus less space is required and fewer data items need be managed. Slide 9 Horizontal fragmentation: A relation is partitioned into a set of sub-relations each of which have a subset of the tuples (rows) of the original relation. Vertical fragmentation: Where each subrelation is defined on a subset of the attributes (columns) of the original relation. Slide 10 Improve reliability since they have replicated components and, thereby eliminate single points of failure. The failure of a single site, or the failure of a communication link which makes one or more sites unreachable, is not sufficient to bring down the entire system. Slide 11 Proximity to its points of use (also called data localization). Requires some support for fragmentation and replication. This has two potential advantages: Since each site handles only a portion of the database, contention for CPU and I/O services is not as severe as for centralized databases. Localization reduces remote access delays that are usually involved in wide area networks. Slide 12 Issue is database scaling One aspect of easier system expansion is economics. It normally costs much less to put together a system of “smaller” computers with the equivalent power of a single big machine. Slide 13 First, data may be replicated in a distributed environment. A distributed data base can be designed so that the entire database, or portions of it, reside at different sites of a computer network. Second, if some sites fail (e.g., by either hardware or software malfunction), or if some communication links fail (making some of the sites unreachable) While an update is being executed, the effects will not be reflected on the data residing at the failing or unreachable. The third point is that since each site cannot have instantaneous information on the actions currently being carried out at the other sites, The synchronization of transactions on multiple sites is considerably harder than for a centralized system. Slide 14 Possible ways in which a distributed DBMS may be architected: (1) Autonomy of local systems, (2) Their distribution, and (3) Their heterogeneity. Slide 15 Autonomy Autonomy, refers to the distribution (or decentralization) of control, not of data. It indicates the degree to which individual DBMSs can operate independently. Autonomy is a function of a number of factors such as whether the component systems (i.e., individual DBMSs) exchange information, whether they can independently execute transactions, and whether one is allowed to modify them. Slide 16 Dimensions of Autonomy Design autonomy Individual DBMSs are free to use the data models and transaction management techniques that they prefer. Communication autonomy Each of the individual DBMSs is free to make its own decision as to what type of information it wants to provide to the other DBMSs or to the software that controls their global execution. Execution autonomy Each DBMS can execute the transactions that are submitted to it in any way that it wants to. Slide 17 Distribution The distribution dimension of the taxonomy deals with data. Physical distribution of data over multiple sites; The user sees the data as one logical pool. There are a number of ways DBMSs have been distributed. Two classes: client/server distribution peer-to-peer distribution (or full distribution). Slide 18 Client/server distribution The client/server distribution concentrates data management duties at servers while the clients focus on providing the application environment including the user interface. The communication duties are shared between the client machines and servers. Slide 19 Peer-to-peer distribution (or full distribution). In peer-to-peer systems, there is no distinction of client machines versus servers. Each machine has full DBMS functionality and can communicate with other machines to execute queries and transactions. Slide 20 Heterogeneity Hardware heterogeneity Differences in networking protocols to variations in data managers. Heterogeneity in query languages not only involves the use of completely different data access paradigms in different data models. but also covers differences in languages even when the individual systems use the same data model. Slide 21 Overview of Relational DBMS Structure of Relational Databases Relational Algebra Slide 22 Most of the distributed database technology has been developed using the relational model Very simple model. Often a good match for the way we think about our data. Example of a Relation: account (accountnumber, branch-name, balance) Slide 23 Simplest approach (not always best): convert each Entity Set to a relation and each relationship to a relation. Entity Set Relation Entity Set attributes become relational attributes. branch-name balance account-number account Becomes: account (account-number, branch-name, balance) Slide 24 Table = relation. Column headers = attributes. Row = tuple Account Relation schema = name(attributes) + other structure info., e.g., keys, other constraints. Example: Account (accountnumber, branch-name, balance) Order of attributes is arbitrary, but in practice we need to assume the order given in the relation schema. Relation instance is current set of rows for a relation schema. Database schema = collection of relation schemas. Slide 25 Relation as table Rows = tuples Columns = components Names of columns = attributes Set of attribute names = schema REL (A1,A2,...,An) A1 A2 A3 ... An C a r d i n a l i t y a1 a2 a3 an b1 b2 a3 cn a1 c3 b3 . . . bn x1 v2 d3 wn Set theoretic Domain — set of values like a data type n-tuples (V1,V2,...,Vn) s.t., V1 D1, V2 D2,...,Vn Dn Tuples = members of a relation inst. Arity = number of domains Attributes Components = values in a tuple Domains — corresp. with attributes Cardinality = number of tuples Tuple Component Arity Slide 26 Each attribute of a relation has a name The set of allowed values for each attribute is called the domain of the attribute Attribute values are (normally) required to be atomic, that is, indivisible E.g. multivalued attribute values are not atomic E.g. composite attribute values are not atomic The special value null is a member of every domain Slide 27 A1, A2, …, An are attributes R = (A1, A2, …, An ) is a relation schema E.g. Customer-schema = (customer-name, customer-street, customer-city) r(R) is a relation on the relation schema R E.g. customer (Customer-schema) Slide 28 The current values (relation instance) of a relation are specified by a table An element t of r is a tuple, represented by a row in a table attributes (or columns) customer-name customer-street customer-city Jones Smith Curry Lindsay Main North North Park Harrison Rye Rye Pittsfield tuples (or rows) customer Slide 29 A database consists of multiple relations Information about an enterprise is broken up into parts, with each relation storing one part of the information E.g.: account : stores information about accounts depositor : stores information about which customer owns which account customer : stores information about customers Storing all information as a single relation such as bank(account-number, balance, customer-name, ..) results in repetition of information (e.g. customer own two account) the need for null values (e.g. represent a customer without an account) Normalization theory deals with how to design relational schemas Slide 30 The customer Relation The depositor Relation The branch Relation Account Relation 31 borrower Relation The Loan Relation Loannumber Branchname amount L-11 Round Hill 900 L-14 Downtown 1500 L-15 Perryridge 1500 L-16 Perryridge 1300 L-17 Downtown 1000 L-23 Redwood 2000 L-93 Mianus32 500 Slide 32 Superkey is a set of attributes within a table whose values can be used to uniquely identify a tuple. A candidate key is a minimal set of attributes necessary to identify a tuple, this is also called a minimal superkey. For example, given an employee schema, consisting of the attributes employeeID, name, job, and departmentID, we could use the employeeID in combination with any or all other attributes of this table to uniquely identify a tuple in the table. Examples of superkeys in this schema would be {employeeID}, {employeeID, Name}, {employeeID, Name, job}, and {employeeID, Name, job, departmentID}. The last example is known as trivial superkey, because it uses all attributes of this table to identify the tuple. In a real database we do not need values for all of those attributes to identify a tuple. We only need, per our example, the set {employeeID}. This is a minimal superkey – that is, a minimal set of attributes that can be used to identify a single tuple. So, employeeID is a candidate key. Although several candidate keys may exist, one of the candidate Slide 33 keys is selected to be the primary key. Strong entity set. The primary key of the entity set becomes the primary key of the relation. Weak entity set. The primary key of the relation consists of the union of the primary key of the strong entity set and the discriminator of the weak entity set. Relationship set. The union of the primary keys of the related entity sets becomes a super key of the relation. For binary many-to-one relationship sets, the primary key of the “many” entity set becomes the relation’s primary key. For one-to-one relationship sets, the relation’s primary key can be that of either entity set. For many-to-many relationship sets, the union of the primary keys becomes the relation’s primary key Slide 34 Language in which user requests information from the database. Categories of languages Procedural User instructs the system to perform a sequence of operations on the database to compute the desired result. non-procedural User describes the desired information without giving a specific procedure for obtaining that information. “Pure” languages: Relational Algebra Tuple Relational Calculus Domain Relational Calculus Slide 35 Procedural language Six basic operators select project union set difference Cartesian product rename The operators take two or more relations as inputs and give a new relation as a result. Slide 36 Select Operation – Example • Relation r • A=B ^ D > 5 (r) Slide 37 Notation: p(r) p is called the selection predicate Defined as: p(r) = {t | t r and p(t)} Where p is a formula in propositional calculus consisting of terms connected by : (and), (or), (not) Each term is one of: <attribute> op <attribute> or <constant> where op is one of: =, , >, . <. Slide 38 Example of selection: branch-name = “Perryridge” (loan) branch-name=“Perryridge”(loan) Slide 39 Relation r: A,C (r) Duplicate rows removed = Slide 40 Notation: A1, A2, …, Ak (r) where A1, A2 are attribute names and r is a relation name. The result is defined as the relation of k columns obtained by erasing the columns that are not listed Duplicate rows removed from result, since relations are sets E.g. To eliminate the branch-name attribute of account account-number, balance (account) Slide 41 Relations r, s: r s: Slide 42 Notation: r s Defined as: r s = {t | t r or t s} For r s to be valid. 1. r, s must have the same arity (same number of attributes) 2. The attribute domains must be compatible (e.g., 2nd column of r deals with the same type of values as does the 2nd column of s) E.g. to find all customers with either an account or a loan customer-name (depositor) customer-name (borrower) Slide 43 Union Operation Names of All Customers Who Have Either a Loan or an Account customer-name (depositor) customer-name (borrower) Slide 44 Notation r – s Defined as: r – s = {t | t r and t s} Set differences must be taken between compatible relations. r and s must have the same arity attribute domains of r and s must be compatible Slide 45 Relations r, s: r – s: Slide 46 Notation r x s Defined as: r x s = {t q | t r and q s} Assume that attributes of r(R) and s(S) are disjoint. (That is, R S = ). If attributes of r(R) and s(S) are not disjoint, then renaming must be used. Slide 47 Cartesian-Product OperationExample Relations r, s: r x s: Slide 48 Can build expressions using multiple operations Example: A=C(r x s) rxs A=C(r x s) Slide 49 Allows us to refer to a relation by more than one name. Example: x (E) returns the expression E under the name X If a relational-algebra expression E has arity n, then x (A1, A2, …, An) (E) returns the result of expression E under the name X, and with the attributes renamed to A1, A2, …., An. Slide 50 branch (branch-name, branch-city, assets) customer (customer-name, customer-street, customer-city) account (account-number, branch-name, balance) loan (loan-number, branch-name, amount) depositor (customer-name, account-number) borrower (customer-name, loan-number) Slide 51 Find all loans of over $1200 loan amount > 1200 (loan) Loannumber Branchname amount L-14 Downtown 1500 L-15 Perryridge 1500 L-16 Perryridge 1300 L-23 Redwood 2000 Find the loan number for each loan of an amount greater than $1200 loan-number ( amount > 1200 (loan)) Loannumber L-14 L-15 L-16 L-23 Slide 52 Find the names of all customers who have a loan, an account, or both, from the bank customer-name (borrower) customer-name (depositor) Find the names of all customers who have a loan and an account at bank. customer-name (borrower) customer-name (depositor) Slide 53 Find the names of all customers who have a loan at the Perryridge branch. customer-name ( branch-name=“Perryridge” ( borrower.loan-number = loan.loan-number(borrower x loan))) Find the names of all customers who have a loan at the Perryridge branch but do not have an account at any branch of the bank. customer-name ( branch-name = “Perryridge” ( borrower.loan-number = loan.loan-number(borrower x loan))) customer-name(depositor) Slide 54 – Result of borrower loan Slide 55 Result of branch-name = “Perryridge” (borrower loan) customer-name ( branch-name = “Perryridge” ( borrower.loan-number = loan.loan-number(borrower x loan)) customer-name ( branch-name = “Perryridge” ( borrower.loan-number = loan.loan-number(borrower x loan))) – customer-name(depositor) Customer-name Adams Slide 56 Customers With An Account But No Loan customer-name(depositor)- customer-name(borrower) Slide 57 Slide 58 Slide 59