pmit-6102-15-lec2-Intro+RelationalAlgebra

advertisement
PMIT-6102
Advanced Database Systems
ByJesmin Akhter
Assistant Professor, IIT, Jahangirnagar University
Lecture 02
Introduction to DDBMS
Overview of Relational
DBMS
 Introduction
Distributed DBMS Promises
Problem Areas
Architectural Models for Distributed
DBMSs
Slide 3


In centralized database systems, the only available
resource that needs to be shielded from the user is the
data.
In a distributed database environment
a second resource that needs to be managed in much the
same manner: the network.



The user should be protected from the operational details
of the network; possibly even hiding the existence of
the network.
Then there would be no difference between database
applications that would run on a centralized
database and those that would run on a distributed
database.
This type of transparency is referred to as network
transparency or distribution transparency.
Slide 4


From a DBMS perspective, distribution transparency
requires that users do not have to specify where data
are located.
Sometimes two types of distribution transparency
are identified:
location transparency
Naming transparency.
Slide 5

Location transparency refers to the fact that the
command used to perform a task is independent of

both the location of the data and the system on which an
operation is carried out.

Naming transparency means that a unique
name is provided for each object in the database.
In the absence of naming transparency, users are required
to embed the location name as part of the object name.
Slide 6
Distribute data in a replicated fashion
across the machines on a network.
 If one of the machines fails, a copy of the
data are still available on another
machine on the network

Increase reliability, and availability of data.
Increases the locality of reference.
Slide 7

Data are replicated, the transparency issue is:
The users should not be aware of the existence
of copies and the system should handle the
management of copies.
The users not to be involved with handling
copies and having to specify the fact that a
certain action can and/or should be taken on
multiple copies.
Slide 8


Increase performance, availability and reliability.
fragmentation can reduce the negative effects of
replication.
Each replica is not the full relation but only a
subset of it;
thus less space is required and fewer data items
need be managed.
Slide 9


Horizontal fragmentation: A relation is
partitioned into a set of sub-relations each of which
have a subset of the tuples (rows) of the original
relation.
Vertical fragmentation: Where each subrelation is defined on a subset of the attributes
(columns) of the original relation.
Slide 10

Improve reliability since they have replicated
components and, thereby eliminate single points of
failure.
The failure of a single site, or the failure of a
communication link which makes one or more sites
unreachable, is not sufficient to bring down the entire
system.
Slide 11



Proximity to its points of use (also called data
localization).
Requires some support for fragmentation and
replication.
This has two potential advantages:
Since each site handles only a portion of the database,
contention for CPU and I/O services is not as severe as for
centralized databases.
Localization reduces remote access delays that are usually
involved in wide area networks.
Slide 12


Issue is database scaling
One aspect of easier system expansion is
economics.
It normally costs much less to put together a system
of “smaller” computers with the equivalent power of a
single big machine.
Slide 13

First, data may be replicated in a distributed
environment.
 A distributed data base can be designed so that the entire
database, or portions of it, reside at different sites of a
computer network.

Second, if some sites fail (e.g., by either hardware
or software malfunction), or if some communication
links fail (making some of the sites unreachable)
While an update is being executed, the effects will not
be reflected on the data residing at the failing or
unreachable.

The third point is that since each site cannot have
instantaneous information on the actions currently
being carried out at the other sites,
The synchronization of transactions on multiple sites is
considerably harder than for a centralized system.
Slide 14
Possible ways in which a distributed DBMS may be architected:
(1) Autonomy of local systems,
(2) Their distribution, and
(3) Their heterogeneity.
Slide 15


Autonomy
Autonomy, refers to the distribution (or
decentralization) of control, not of data.
It indicates the degree to which individual DBMSs
can operate independently.

Autonomy is a function of a number of factors
such as
 whether the component systems (i.e., individual
DBMSs) exchange information,
whether they can independently execute
transactions, and whether one is allowed to modify
them.
Slide 16

Dimensions of Autonomy
Design autonomy
 Individual DBMSs are free to use the data models and
transaction management techniques that they prefer.
Communication autonomy
 Each
of the individual DBMSs is free to make its own
decision as to what type of information it wants to
provide to the other DBMSs or to the software that
controls their global execution.
Execution autonomy
 Each
DBMS can execute the transactions that are
submitted to it in any way that it wants to.
Slide 17



Distribution
The distribution dimension of the taxonomy deals
with data.
Physical distribution of data over multiple sites;
The user sees the data as one logical pool.

There are a number of ways DBMSs have been
distributed. Two classes:
client/server distribution
 peer-to-peer distribution (or full distribution).
Slide 18


Client/server distribution
The client/server distribution concentrates
data management duties at servers
while the clients focus on providing the
application environment including the user
interface.
 The communication duties are shared between
the client machines and servers.
Slide 19



Peer-to-peer distribution (or full
distribution).
In peer-to-peer systems, there is no distinction
of client machines versus servers.
Each machine has full DBMS functionality and
can communicate with other machines to
execute queries and transactions.
Slide 20




Heterogeneity
Hardware heterogeneity
Differences in networking protocols to variations in
data managers.
Heterogeneity in query languages
not only involves the use of completely different data
access paradigms in different data models.
but also covers differences in languages even when
the individual systems use the same data model.
Slide 21
 Overview
of Relational DBMS
Structure of Relational Databases
Relational Algebra
Slide 22

Most of the distributed database technology has
been developed using the relational model
Very simple model.
Often a good match for the way we think
about our data.

Example of a Relation: account (accountnumber, branch-name, balance)
Slide 23
Simplest approach (not always best): convert each Entity
Set to a relation and each relationship to a relation.
Entity Set  Relation
Entity Set attributes become relational attributes.
branch-name
balance
account-number
account
Becomes: account (account-number, branch-name,
balance)
Slide 24




Table = relation.
Column headers = attributes.
Row = tuple
Account
Relation schema = name(attributes) + other structure info.,
e.g., keys, other constraints. Example: Account (accountnumber, branch-name, balance)
 Order of attributes is arbitrary, but in practice we need to assume the
order given in the relation schema.


Relation instance is current set of rows for a relation schema.
Database schema = collection of relation schemas.
Slide 25
Relation as table
Rows = tuples
Columns = components
Names of columns = attributes
Set of attribute names = schema
REL (A1,A2,...,An)
A1 A2 A3 ... An
C
a
r
d
i
n
a
l
i
t
y
a1 a2 a3
an
b1 b2 a3
cn
a1 c3 b3
.
.
.
bn
x1 v2 d3
wn
Set theoretic
Domain — set of values
like a data type
n-tuples (V1,V2,...,Vn)
s.t., V1 D1, V2 D2,...,Vn Dn
Tuples = members of a relation inst.
Arity = number of domains
Attributes
Components = values in a tuple
Domains — corresp. with attributes
Cardinality = number of tuples
Tuple
Component
Arity
Slide 26



Each attribute of a relation has a name
The set of allowed values for each attribute is
called the domain of the attribute
Attribute values are (normally) required to be
atomic, that is, indivisible
 E.g. multivalued attribute values are not atomic
 E.g. composite attribute values are not atomic

The special value null is a member of every
domain
Slide 27



A1, A2, …, An are attributes
R = (A1, A2, …, An ) is a relation schema
E.g. Customer-schema =
(customer-name, customer-street, customer-city)
r(R) is a relation on the relation schema R
E.g. customer (Customer-schema)
Slide 28


The current values (relation instance) of a relation
are specified by a table
An element t of r is a tuple, represented by a row
in a table
attributes
(or columns)
customer-name
customer-street
customer-city
Jones
Smith
Curry
Lindsay
Main
North
North
Park
Harrison
Rye
Rye
Pittsfield
tuples
(or rows)
customer
Slide 29



A database consists of multiple relations
Information about an enterprise is broken up into parts,
with each relation storing one part of the information
E.g.: account : stores information about accounts
depositor : stores information about which
customer owns which account
customer : stores information about customers
Storing all information as a single relation such as
bank(account-number, balance, customer-name, ..)
results in
 repetition of information (e.g. customer own two account)
 the need for null values (e.g. represent a customer without an account)

Normalization theory deals with how to design relational
schemas
Slide 30
The customer Relation
The depositor Relation
The branch Relation
Account Relation
31
borrower Relation
The Loan Relation
Loannumber
Branchname
amount
L-11
Round Hill
900
L-14
Downtown
1500
L-15
Perryridge
1500
L-16
Perryridge
1300
L-17
Downtown
1000
L-23
Redwood
2000
L-93
Mianus32
500
Slide 32



Superkey is a set of attributes within a table whose values can be
used to uniquely identify a tuple.
A candidate key is a minimal set of attributes necessary to
identify a tuple, this is also called a minimal superkey.
For example, given an employee schema, consisting of the
attributes employeeID, name, job, and departmentID, we
could use the employeeID in combination with any or all other
attributes of this table to uniquely identify a tuple in the table.
 Examples of superkeys in this schema would be {employeeID},
{employeeID,
Name},
{employeeID,
Name,
job},
and
{employeeID, Name, job, departmentID}. The last example is
known as trivial superkey, because it uses all attributes of this table
to identify the tuple.


In a real database we do not need values for all of those attributes
to identify a tuple. We only need, per our example, the set
{employeeID}. This is a minimal superkey – that is, a minimal
set of attributes that can be used to identify a single tuple. So,
employeeID is a candidate key.
Although several candidate keys may exist, one of the candidate
Slide 33
keys is selected to be the primary key.



Strong entity set. The primary key of the entity
set becomes the primary key of the relation.
Weak entity set. The primary key of the relation
consists of the union of the primary key of the
strong entity set and the discriminator of the weak
entity set.
Relationship set. The union of the primary keys
of the related entity sets becomes a super key of
the relation.
 For binary many-to-one relationship sets, the primary key of
the “many” entity set becomes the relation’s primary key.
 For one-to-one relationship sets, the relation’s primary key can
be that of either entity set.
 For many-to-many relationship sets, the union of the primary
keys becomes the relation’s primary key
Slide 34


Language in which user requests information from
the database.
Categories of languages
 Procedural

User instructs the system to perform a sequence of operations on the
database to compute the desired result.
 non-procedural


User describes the desired information without giving a specific
procedure for obtaining that information.
“Pure” languages:
 Relational Algebra
 Tuple Relational Calculus
 Domain Relational Calculus
Slide 35


Procedural language
Six basic operators
 select
 project
 union
 set difference
 Cartesian product
 rename

The operators take two or more relations as
inputs and give a new relation as a result.
Slide 36
Select Operation – Example
• Relation r
•  A=B ^ D > 5 (r)
Slide 37



Notation:  p(r)
p is called the selection predicate
Defined as:
p(r) = {t | t  r and p(t)}
Where p is a formula in propositional calculus
consisting of terms connected by :  (and), 
(or),  (not)
Each term is one of:
<attribute> op
<attribute> or
<constant>
where op is one of: =, , >, . <. 
Slide 38
Example of selection:  branch-name = “Perryridge”
(loan)
 branch-name=“Perryridge”(loan)
Slide 39

Relation r:
  A,C (r)
Duplicate rows removed
=
Slide 40




Notation:
A1, A2, …, Ak (r)
where A1, A2 are attribute names and r is a
relation name.
The result is defined as the relation of k
columns obtained by erasing the columns that
are not listed
Duplicate rows removed from result, since
relations are sets
E.g. To eliminate the branch-name attribute of
account
account-number, balance (account)
Slide 41

Relations r, s:
r  s:
Slide 42




Notation: r  s
Defined as:
r  s = {t | t  r or t  s}
For r  s to be valid.
1. r, s must have the same arity (same number of attributes)
2. The attribute domains must be compatible (e.g., 2nd
column
of r deals with the same type of values as does the 2nd
column of s)
E.g. to find all customers with either an account or a loan
customer-name (depositor)  customer-name (borrower)
Slide 43
Union Operation
Names of All Customers Who Have Either a Loan or an Account
 customer-name (depositor)   customer-name (borrower)
Slide 44



Notation r – s
Defined as:
r – s = {t | t  r and t  s}
Set differences must be taken between
compatible relations.
 r and s must have the same arity
 attribute domains of r and s must be compatible
Slide 45

Relations r, s:
r – s:
Slide 46




Notation r x s
Defined as:
r x s = {t q | t  r and q  s}
Assume that attributes of r(R) and s(S) are
disjoint. (That is, R  S = ).
If attributes of r(R) and s(S) are not disjoint,
then renaming must be used.
Slide 47
Cartesian-Product OperationExample
Relations r, s:
r x s:
Slide 48

Can build expressions using multiple
operations
Example: A=C(r x s)
rxs

A=C(r x s)


Slide 49

Allows us to refer to a relation by more than one
name.
Example:  x (E)
returns the expression E under the name X
If a relational-algebra expression E has arity n,
then
x (A1, A2, …, An) (E)
returns the result of expression E under the
name X, and with the attributes renamed to
A1, A2, …., An.
Slide 50
branch (branch-name, branch-city, assets)
customer (customer-name, customer-street, customer-city)
account (account-number, branch-name, balance)
loan (loan-number, branch-name, amount)
depositor (customer-name, account-number)
borrower (customer-name, loan-number)
Slide 51
 Find all loans of over $1200
loan
 amount > 1200 (loan)
Loannumber
Branchname
amount
L-14
Downtown
1500
L-15
Perryridge
1500
L-16
Perryridge
1300
L-23
Redwood
2000
Find the loan number for each loan of an amount greater than $1200
 loan-number ( amount > 1200 (loan))
Loannumber
L-14
L-15
L-16
L-23
Slide 52

Find the names of all customers who have a loan,
an account, or both, from the bank
 customer-name (borrower)   customer-name (depositor)
Find the names of all customers who have a loan and an
account at bank.
 customer-name (borrower)  customer-name (depositor)
Slide 53

Find the names of all customers who have a loan at the
Perryridge branch.
 customer-name ( branch-name=“Perryridge”
( borrower.loan-number = loan.loan-number(borrower x loan)))
 Find the names of all customers who have a loan at the
Perryridge branch but do not have an account at any branch of
the bank.
 customer-name ( branch-name = “Perryridge”
( borrower.loan-number = loan.loan-number(borrower x loan)))
 customer-name(depositor)
Slide 54
–
Result of borrower  loan
Slide 55
Result of  branch-name = “Perryridge” (borrower 
loan)
 customer-name ( branch-name = “Perryridge”
(  borrower.loan-number = loan.loan-number(borrower x loan))
 customer-name (  branch-name = “Perryridge”
( borrower.loan-number = loan.loan-number(borrower x loan))) – customer-name(depositor)
Customer-name
Adams
Slide 56
Customers With An Account But No Loan
customer-name(depositor)- customer-name(borrower)
Slide 57
Slide 58
Slide 59
Download