Session 1 Basic Concepts of Distributed Database Systems

advertisement
SCC762 Distributed Database Systems
Session 1
Basic Concepts of Distributed Database Systems
1.1 Study points
1.2 What is a distributed database system?
1.3 Advantages and disadvantages of DDBSs
1.4 Research and development issues
1.5 Overview of relational DBMSs
1.6 Overview of computer networks
Session 1 - Basic Concepts of Distributed Database Systems
1-1
SCC762 Distributed Database Systems
1.1 Study points
On the completion of this session you should be able to:
1.1
understand the concepts of distributed database systems (DDBSs)
1.2
understand the issues concerned in research and development of DDBSs
1.3
identify the various levels of transparencies provided by a distributed database
management system (DDBMS)
1.4
have a basic knowledge on the relational data model and relational databases
1.5
be able to use the relational algebra to express data manipulation processes
1.6
be able to use SQL to express data definition and data manipulation processes
1.7
have a basic knowledge on computer networks and their impact on distributed
database systems
References: [VO]: Chapters 1, 2, and 3; [BG]: Chapters 1, 2, and 3.
1.2 What is a distributed database system?
Distributed database systems can be defined as follows:
Distributed database system (DDBS) = Databases + Computers + Computer Network +
Distributed database management system (DDBMS)
The database system manages data; the computer network makes the communication possible; and the DDBMS (distributed database management system) is the software that realises the mechanisms and policies for manipulating distributed data.
A distributed database system can be simply defined as a collection of multiple logically
interrelated databases distributed over a computer network and managed by a distributed
database management system.
A distributed database management system (DDBMS) is the software system that permits
the management of the DDBS and makes the distribution transparent to users.
From the architectural point of view, a distributed database system consists of a collection
of sites, connected together via a communication network, in which
➢
➢
1-2
Each site is a database system site in its own right, but
The sites have agreed to work together (if necessary), so that a user at any
site can access data anywhere in the network exactly as if the data were all
stored at the user’s own site.
Session 1 - Basic Concepts of Distributed Databases
SCC762 Distributed Database Systems
Figure 1 depicts the architecture of a distributed database.
Communication
Network
CS
CS
CS
CS
...
Server
Client
Server
Client
Client
Site 1
Site 2
Site 3
Site N
Figure 1: Architecture of a DDBS
➢
➢
➢
Client: the frontend of a DDBS where the access requests are issued.
Server: the backend of a DDBS where the database is stored.
Communication system (CS): it enables the communication between clients
and servers.
Figure 2 depicts the evolution of distributed databases.
Program 1
Data
Specification
Program 1
File 1
Data Specification
...
...
...
Database
...
Data Specification
Program N
Data
Specification
File N
Program N
Traditional file processing: decentralised; unintegrated
Database processing: centralised; integrated
Site 1
Communication Network
Site 2
Site N
Site 3
Distributed database processing: distributed; integrated
Figure 2. Evolution
Session 1 - Basic Concepts of Distributed Database Systems
1-3
SCC762 Distributed Database Systems
The first generation of data processing is decentralised and unintegrated where data are
stored in individual files and specifications are embedded into the programs that manipulate
the data. Files are therefore not shared, and any changes in the file structure will affect the
data specifications in the programs. The second generation of data processing is centralised
and integrated in which data are stored in a centralised database and data specification are
stored in a centralised location, normally the same location as the database. The advantages
of this model is that changes in database may only affect data specifications but not the
programs. The third generation of data processing is distributed and integrated in which data
and their local specifications are distributed in a network and there also exists a global view
of all the data stored in the network.
1.3 Advantages and disadvantages of DDBSs
Three issues have led us to the distributed database systems:
➢ Distributed system (hardware/software). Nowadays, most of our computer
systems are connected by networks and therefore are distribtued. We are
already working in such a distribtued environment and therefore a DDBS that is
capable of coordinating the work of distributed recources is needed.
➢
Distributed application. Many of our applications are distributed in nature. For
example, the banking system, the point of sales system, the order and inventory
management system, etc. A DDBS that can support such applications is
needed.
➢
Distributed data. Most of our data are distribtued in nature. Many organisations,
such as Deakin University, have multiple locations. Therefore data are already
distributed. A DDBS that is capable of manaing the distribtued data is needed.
Advantages of a DDBS can be briefly summarised as transparent management of
distributed and replicated data, reliability through distributed transactions, improved
performance, and easier system expansion. They are outlined as follows:
➢
Transparency: One of the most important issue in distributed database is the
maintenance of transparencies in a DDBMS. It refers to separation of the
higher level semantics of a system from lower-level implementation issues. It
mainly includes data independence, network transparancy, replication transparency, and fragmentation transparency.
•
1-4
Data independence: it is a fundamental form of transparency within a
single DBMS. It includes both logical data independence and physical
data independence. Logical data independence: The ability to modify the
conceptual database schema without having to change external schemas
or application programs. The possible changes in the conceptual schema
are adding a new record type or a new field in the current record, or
deleting a record type from the database or a field from a record type. In
the latter case, external schemas that refer only to the unchanging record
types should be able to remain unchanged. Physical data independence:
The ability to modify the physical database schema without causing
application programs to be rewritten. Modifications in the physical
schema level include changes to the length of fields and records, changes
to indexes on data files, changes to the organisation of records in files,
Session 1 - Basic Concepts of Distributed Databases
SCC762 Distributed Database Systems
➢
and so forth.
•
Network (Distribution) Transparency: From the user point of view, you
don’t know whether you access a single database or distributed database.
It is classified into location transparency and naming transparency.
Location transparency refers to the fact that data objects to be accessed
without the knowledge of their location. Naming transparency means that
a unique name is provided for each object in the database.
•
Replication transparency: it enables multiple instances of data objects to
be used to increase reliability and performance without knowledge of the
replicas by users or application programs.
•
Fragmentation transparency: logically related data may be fragmented
and stored in multiple locations to increase the reliablity and
performance. This information should be hidden from users and
application programs.
However, who should provide these transparencies? The data access layer of
the network? The OS? Or the DBMS? Each method has its pros and cons and
the DBMS should take the responsibility of providing transparencies related to
data and its access.
Improved performance: it is well-known that in general,80% of the information
generated locally is consumed locally; only 20% locally generated information
will be consumed by (accessed by or sent to) external entities. Therefore, have
data stored locally will improve most local activities.
➢
Improved reliability / availability: reliability is the probability that if a system S
is working at time t0, then S will continue to work in time interval (t0, t).
Availability is the probability that S is operational at a given time t. When data
are distributed, then the faulty of one part of the system will not cause the
whole system to stop working.
➢
Economics: a parallel super-computer costs millions of dollars while the same
amount of money can buy a distributed computer system (servers, PCs and
LAN connections) with a much high computing power.
➢
Expandability: It is easier to expand in a distributed environmen by simply
adding in more workstations and servers.
➢
Sharability: users in a distributed system can easily share their resources.
There are also some disadvantages in using a distributed database system. Some of them are
listed below:
➢ Lack of experience and standards: many people are used to centralised control
and there are not many standards for distributed databases to follow.
➢
Complexity: in psychology there is a rule saying that as human bings our limit
to understand the number of things happening in parallel is 7 ±2 . However, a
distributed database system normally has hundreds and thousands of events
happening simultaneously. It is very hard for human beings to fully understand
such a complicated system.
➢
Cost: there must be some ininitial cost of shifting from a centralised system to a
distributed system and of setting up and maintaining the computer network.
➢
Distribution of control: data equal to power and money in the sense of business.
Some people are reluctant to lose their control of data.
Session 1 - Basic Concepts of Distributed Database Systems
1-5
SCC762 Distributed Database Systems
➢
Security: It is more difficult to maintain the security in a distributed
environment than to maintain the security in a centralised environment.
➢
Difficulty of change: it is hard to persude people to change from a familiar
environment to a new environment even the new one provides more benefits.
1.4 Research and development issues
We can list the following issues on the research and development of distributed database
systems:
➢ Distributed database design: The two fundamental design issues are
fragmentation, the separation of the database into partitions called fragments,
and distribution, the optimum distribution of fragments.
1-6
➢
Distributed query processing: Query processing deals with designing
algorithms that analyze queries and convert them into a series of data
manipulation operations. The problem is how to decide on a strategy for
executing each query over the network in the most cost-effective way. The
factor to be considered are the distribution of data, communication costs, and
lack of sufficient locally available information. The objective is to optimise
where the inherent parallelism is used to improve the performance of executing
the transaction.
➢
Distributed directory management: A directory contains information about data
items in the database. It may be global to the entire DDBS or local to each site;
it can be centralized at one site or distributed over several sites; there can be a
single copy or multiple copies.
➢
Distributed concurrency control: It involves the synchronisation of accesses to
the distributed database, such that the integrity of the database is maintained.
Two fundamental primitives are locking, which is based on the mutual
exclusion of accesses to data items, and timestamping, where the transactions
are executed in some order.
➢
Distributed deadlock management: The deadlock problem in DDBSs is similar
in nature to that encountered in operating systems. The competition among
users for access to a set of resources can result in a deadlock if the
synchronisation mechanism is based on locking. The well-known alternatives
of prevention, avoidance, and detection/recovery also apply to DDBSs.
➢
Reliability of DDBMSs: It is important that mechanisms be provided to ensure
the consistency of the database as well as to detect failures and recover from
them.
➢
Operating system support: In distributed environments there is the problem of
having to deal with multiple layers of network software. The work is to provide
both adequate support for distributed database operations and general operating
system support for other applications.
➢
Heterogeneous databases: When there is no homogeneity among the databases
at various sites, it becomes necessary to provide a translation machanism
between database systems. This translation mechanism usually involves a
canonical form to facilitate data translation, as well as program templates for
translating data manipulation instructions.
Session 1 - Basic Concepts of Distributed Databases
SCC762 Distributed Database Systems
Figure 3 shows the relationships among these issues.
Directory management
Query processing
Distributed database design
Reliability
Concurrency control
Deadlock management
Figure 3: Relationship among problem areas
1.5 Overview of relational DBMSs
A DDBS consists of many DBSs and most of them are relational DBSs. Therefore most
DDBSs are relational as well.
The important issues to understand in relational databases are:
➢
The relational data model
➢
Normalisation
➢
Relational algebra
➢
the entity/relationship model
➢
The ANSI-SPARC 3-level architecture
➢
SQL
1.5.1 The relational data model
We give the following term definitions:
A domain D is a set of atomic values and is associated with a type. For example, INTEGER,
STRING, DATE, are domains. A domain has a name, a type, and a format. For example,
BIRTHDATE could be a domain's name, its type could be DATE, and its format could be
dd/mm/yy.
A relation schema R, denoted by R(A1, A2,..., An), is made up of a relation name R and a
list of attributes A1, A2, ..., An. The domain of attribute Ai is denoted as dom(Ai), i=1, ..., n).
A relation schema is used to describe a relation. R is called the name of the relation and n
the degree of the relation.
A relation r of the relation schema R(A1, A2, ..., An), denoted by r(R), is a set of m-tuples r
= {t1, t2, ..., tm}. Each m-tuple t is an ordered list of n values t = < v1, v2, ..., vn>, where each
value vi, 1 <= i <= n is an element of dom(Ai), or is a special null value.
A relation schema is an abstract kind of object, and a relation (table) is a concrete picture
Session 1 - Basic Concepts of Distributed Database Systems
1-7
SCC762 Distributed Database Systems
(instance) of such an abstract object. The primary key of a relation is a unique identifier that
can uniquely identify a tuple of that relation.
Table 1 lists some formal terms and their informal equivalents.
Table 1: Example data models
Formal Terms
Informal Equivalents
relation
table
tuple
row or record
cardinality
number of rows
attribute
column or field
degree
number of columns
primary key
unique identifier
domain
pool of legal values
We can list the properties of a relation as follows:
➢ There are no duplicate tuples - so there is always a primary key.
➢
Tuples are unordered (top to bottom).
➢
Attributes are unordered (left to right).
➢
All (simple) attribute values are atomic. Or equivalently, relations do not
contain repeating groups A relation satisfying this condition is said to be
normalised.
A relational database contains tables (relations), nothing but tables.
1.5.2 Normalisation
Normalisation provides a method of representing data and their relationships precisely in a
tabular format that makes the database easy to understand and operationally efficient. It is
basically a formalisation of avoiding redundancy; of trying to achieve “one fact one place.”
There are five normalisation levels (forms). Each form's rules place additional conditions
on the database design. By satisfying the first set of rules, the data is said to be in the first
normal form or 1NF. Moving on to the second set results in the second normal form (2NF),
and so on up to the fifth normal form (5NF). Figure 4 depicts these normal form levels.
1-8
Session 1 - Basic Concepts of Distributed Databases
SCC762 Distributed Database Systems
Universe of relations
(Normalised and unnormalised)
1NF relations
2NF relations
3NF relations
4NF relations
5NF relations
Figure4. Levels of normal forms
One additional normal form, called Boyce-Codd Normal Form (BCNF), can be defined
between 3NF and 4NF.
We summarise the conditions addressed by each normal form as follows:
➢
1NF: Repeating groups.
➢
2NF: A column’s dependency on only part of a composite key (partial
dependency).
➢
3NF: A nonkey column’s representing a fact about another nonkey column
(transitive dependency).
➢
4NF: Two or more independent, multivalued facts occurring for an entity.
➢
5NF: Interdependent columns (symmetric constraint).
A designer does not have to work through the 1NF and 2NF to get the 3NF. The rules
leading to and including the 3NF can be summed up in a single statement:
Each attribute must be a fact about the key, the whole key, and nothing but the key.
Designers can use this statement to test whether their designs are in 3NF.
For the majority of database designers, the 3NF is the extent of normalisation needed or
desired. Two additional normal forms can improve the design in certain situations, however.
The 4NF and 5NF apply only when the database includes one-to-many and many-to-many
relations, and then only in some special situations.
An example of normalisation
Unnormalised relation:
TRADESMAN
Tradesman ID
Tradesman Name
Company ID
Company Name
Company Location
Dependent Name
Dependent Date of Birth
1-9
Session 1 - Basic Concepts of Distributed Databases
SCC762 Distributed Database Systems
Skill ID
Skill Name
Skill Where Attained
Skill Level
1NF: remove the repeating groups, shown as in Figure 5.
DEPENDENT
TRADESMAN
Primary Key
Tradesman ID
Tradesman Name
Primary Key
Tradesman ID
Dependent Name
Dependent Date of Birth
Company ID
Company Name
Company Location
SKILL
Primary Key
Tradesman ID
Skill ID
Skill Name
Where Skill Attained
Skill Level
Figure 5. First normal form
2NF: remove partial dependency, shown as in Figure 6.
SKILL
is in 1NF but not in 2NF:
Partial dependency
SKILL
Primary Key
Tradesman ID
Skill ID
Skill Name
Where Skill Attained
Skill Level
SKILL
and TSKILL are now in 2NF:
TSKILL
Primary Key
SKILL
Primary Key
Tradesman ID
Skill ID
Skill ID
Skill Name
Where Skill Attained
Skill Level
Figure 6. Second normal form
3NF: remove transitive dependency, shown as in Figure 7.
1-10
Session 1 - Basic Concepts of Distributed Databases
SCC762 Distributed Database Systems
TRADESMAN is in 2NF but not in 3NF:
Transitive dependency
TRADESMAN
Primary Key
Tradesman ID
Tradesman Name
Company ID
Company Name
Company Location
TRADESMAN and DEPARTMENT
are in 3NF:
TRADESMAN
Primary Key
DEPARTMENT
Tradesman ID
Tradesman Name
Primary Key
Company ID
Company ID
Company Name
Company Location
Figure 7. Third normal form
1.5.3 Relational algebra
Relational algebra consists of a collection of high-level operators that operate on relations.
Each such operator takes one or two relations as its input and produces a new relation as its
output. The original eight operators defined by Codd (the pioneer of the relational model)
are:
➢
The
traditional
(mathematical)
set
of
operations:
union,
intersection,difference, and Cartesian product.
➢
The special relational operations: restrict, project, join, and divide.
Figure 8 depicts how these operators work on tables.
1-11
Session 1 - Basic Concepts of Distributed Databases
SCC762 Distributed Database Systems
Restrict
Project
a1 a2
b1 b2
a1 a2 x1 x2
x1 x2
Product
y1 y2
=
a1 a2 y1 y2
b1 b2 x1 x2
c1 c2
b1 b2 y1 y2
c1 c2 x1 x2
c1 c2 y1 y2
Union
Intersection
a1 a2
b1 b2
c1 c2
a2 x1
(Natural)
Join
a3 x2
c2 x3
Difference
a1 a2
=
a1 a2 x1
b1 b2
c1 c2 x3
a1 c2
Divide
a2
=
a1
c2
Figure 8: The original eight operators
The RESTRICT operation is called SELECT in relational languages.
We use the Drinker database of Figure 9 to review some of the operations in relational
databases.
1-12
Session 1 - Basic Concepts of Distributed Databases
SCC762 Distributed Database Systems
FREQUENT
SERVES
LIKE
(The Drinker frequently goes to the Bar)
(The Bar serves the Beer)
(The Drinker likes the Beer)
Drinker
Bar
Bar
Beer
Drinker
Beer
Ullman
Manuel’s
Manuel’s
Miller Lite
Ullman
Miller Lite
Ullman
Orchard Night
Manuel’s
Tiger
Ullman
Tiger
Ullman
Faculty Club
Orchard Night
VB
Ullman
Anchor
Ullman
Dynasty
Manuel’s
Qindao
Jones
Anchor
Sam
Manuel’s
Faculty Club
Tiger
Sam
Anchor
Sam
Orchard Night
Faculty Club
Miller Lite
Smith
Dynasty
Dynasty
Anchor
Figure 9: Sample data of the Drinker database
Restrict and Project: List the name of bars that Ullman frequently goes to.
π bar ( σ drinker = ′Ullman′ ( Frequent ) )
Union: List names of bars that Sam and Smith frequently go to.
π bar  σ drinker = ′Sam′ ( Frequent )

∪ σdrinker = ′Smith′( Frequent )
Or:
π bar ( σ ( drinker = ′Sam′ ) ∨ drinker = ′Smith′ ( Frequent ) )
Intersection: List drinkers who like both “Miller Lite” and “VB”
π drinker ( σ beer = ′Miller Lite′ ( Likes ) ) ∩ π
(σ
( Likes ) )
drinker beer = ′VB′
But the following is wrong (no beer is named both Miller Lite and VB):
π drinker ( σ beer = ′Miller Lite′ ∧ beer = ′VB′ ( Likes ) )
Difference: List bars who do not serve “Miller Lite”.
1-13
Session 1 - Basic Concepts of Distributed Databases
SCC762 Distributed Database Systems
➢
Step 1: all bars that serve “Miller Lite”
B 2 = π bar ( σ beer = ′Miller Lite′ ( Serves ) )
➢
Step 2: all bars
B1 = π bar ( Frequent ) ∪ π bar ( Serves )
➢
Step 3: the difference result: B 1 – B 2.
Product: Frequent × Serves
Join: List beers that Ullman drinks.
➢
Step 1: all Frequents tuples regarding Ullman
σ drinker = ′Ullman′ ( Frequent )
➢
Step 2: Link with bar
σ drinker = ′Ullman′ ( Frequent )∞Serves
➢
Step 3: Project onto beer
π beer ( σ drinker = ′Ullman′ ( Frequent )∞Serves )
Other types of joins: semi-join, θ−joins.
1.5.4 The entity/relationship model
The Entity/Relationship (E/R) data model is most often used as a tool for communication
between database designers and end users during the analysis phase of database
development. The E/R model is used to construct a conceptual data model, which is a
representation of the structure of a database that is independent of the software (such as the
DBMS) that will be used to implement the database.
An E/R model is usually expressed as an E/R diagram, which is a graphical representation
of the E/R model.
➢ Entity: things which can be distinctly identified. A weak entity is an entity that
is existence-dependent on some other entity. A regular entity is an entity that is
not weak.
1-14
➢
Property: entities have properties (or attributes). All entities of a given type
have certain properties in common. Each property draws its values from a
corresponding value set. Properties can be simple or composite, key, single- or
multi-valued, missing, base or derived.
➢
Relationship: an association among entities. The entities involved in a given
relationship are said to be the participants in that relationship. The number of
participants in a given relationship is called the degree of that relationship. The
participation can be total or partial. An E/R relationship can be one-to-one, oneto-many, or many-to-many.
➢
Supertype and Subtype: A supertype is a generic entity that is subdivided into
Session 1 - Basic Concepts of Distributed Databases
SCC762 Distributed Database Systems
subtypes. For example, a TRADESMAN can be divided into ELECTRICIAN
and PLUMBER subtypes. A subtype is then a subset of a supertype that shares
common attributes or relationships distinct from other subsets. Entity subtypes
behave in exactly the same way as any entity type.
➢
Generalisation and categorisation: Generalisation is the concept that some
things (entities) are subtypes of other, more general things (entities).
Categorisation is the opposite concept that an entity comes with various
subtypes.
Entity/Relationship diagrams constitute a technique for representing the logical structure of
a database in a pictorial manner.
➢ Entities: each entity type is shown as a rectangle, labeled with the name of the
entity type in question. For a weak entity, the rectangle is shown with double
lines.
➢
Properties: are shown as ellipses, labeled with the name of the property in
question and attached to the relevant entity by means of a straight line. The
ellipse is dotted if the property is derived and double lined if the property is
multivalued. If the property is composite, its component properties are shown
as further ellipses, connected to the ellipse for the composite property in
question by means of a straight line. Key properties are underlined.
➢
Relationships: each relationship is shown as a diamond, labeled with the name
of the relationship in question. The participants in each relationship are
connected by means of straight lines; each such line is associated with one of
the symbols..
➢
Subtypes and supertypes: Let Y be a subtype of X. Then we draw a straight line
from Y to X, marked with a hook to represent the mathematical “subset of"
operator.
An entity/relationship diagram is an abstract database design. Such a design can be easily
mapped into a relational database definition. Figure 10 shows an E/R model for a tradesman
database design.
TID
CID
Name
Location
COMPANY
BELONGTO
SID
Name
TRADESMAN
Address
HAS
GrantedBy
Figure 10. Example E/R model
The E/R model is mapped into the following relations:
COMPANY(CID, Name, Location)
TRADESMAN(TID, Name, Address, CID)
1-15
Session 1 - Basic Concepts of Distributed Databases
SKILL
Name
Level
SCC762 Distributed Database Systems
SKILL(SID, Name, Level)
HAS(TID, SID, GrantedBy)
1.5.5 The ANSI/SPARC 3-level architecture
Figure 11 shows the three levels of the DBMS architecture proposed by ANSI/SPARC
Study Group on Data Base Management Systems.
External Level
External View 1
...
External View n
External/Conceptual Mapping
Conceptual Schema
Conceptual Level
Conceptual/Internal Mapping
Internal Schema
Internal Level
Stored Database
...
Figure 11. The three levels of the ANSI/SPARC architecture
Broadly speaking,
➢ The internal level is the lowest level of abstraction at which one describes how
the data is physically stored. It has an internal schema.
➢
The conceptual level is the next higher level of abstraction at which one
describes what are actually stored in the database and the relationships that
exist among the data. It has a conceptual schema.
➢
The external level is the level closest to the users at which one describes the
entire database which is of interest to individual users. It has a number of
external schemas or external views, one for each group of users.
There are two levels of mapping:
➢ Conceptual/internal mapping: defines the correspondence between the
conceptual view and the stored database; it specifies how conceptual records
and fields are represented at the internal level.
➢
1-16
External/conceptual mapping: defines the correspondence between a particular
external view and the conceptual view.
Session 1 - Basic Concepts of Distributed Databases
SCC762 Distributed Database Systems
Figure 12 shows a greatly simplified view of a database system.
DBMS
Database
Application Programs
End Users
Figure 12. Simplified picture of a database system
1.5.6 SQL
Data definition
A relation (table) can be defined by using the following statement:
create table <relation_name> (
<attribute_name> <type> [<attribute_constraint>]
{, <attribute_name> <type> [<attribute_constraint>]}
{, [relation_constraint>]} ) ;
The <relation_name> has to be unique for an individual user only. The <attribute_name>
has to be unique within the relation. The <type> of an attribute is one of the following (in
Oracle):
Table 2: Common Oracle types
1-17
number
a numeric field of maximum 40 digits
number(m. n)
decimal number of m digits, n to the right
of the decimal point
number(m)
a numeric field of maximum m digits
char(n)
a character string of fixed length n
Session 1 - Basic Concepts of Distributed Databases
SCC762 Distributed Database Systems
Table 2: Common Oracle types
number
a numeric field of maximum 40 digits
varchar(n)
a variable length character string of maximum length n
varchar2(n)
as varchar(n)
date
valid dates
Example:
create table dept (deptid number(4) primary key,
dname varchar(20) not null,
location varchar2(20));
Data Manipulation
SELECT
UPDATE
DELETE
INSERT
The general form of the SELECT statement:
SELECT [DISTINCT] item(s)
FROM table(s)
[WHERE condition]
[GROUP BY field(s)]
[HAVING condition]
[ORDER BY field(s)] ;
Here are some examples of using SQL.
Query: List the name of bars that Ullman frequently goes to.
SQL:
SELECT Bar
FROM Frequent
WHERE Drinker = ‘Ullman’;
Query: List names of bars that Sam and Smith frequently go to.
SQL:
SELECT Bar
1-18
Session 1 - Basic Concepts of Distributed Databases
SCC762 Distributed Database Systems
FROM Frequent
WHERE Drinker = ‘Sam’ or
Drinker = ‘Smith’;
Query: List drinkers who like both “Miller Lite” and “VB”
SQL:
SELECT Drinker
FROM Like
WHERE Beer = ‘Miller Lite’
INTERSECT
SELECT Drinker
FROM Like
WHERE Beer = ‘VB’;
But the following is wrong (no beer is named both Miller Lite and VB):
SELECT Drinker
FROM Like
WHERE Beer = ‘Miller Lite’
and Beer = ‘VB’;
Query: List bars who do not serve “Miller Lite”.
SQL:
SELECT Bar
FROM Frequent
UNION
SELECT Bar
FROM Serves
MINUS
SELECT Bar
FROM Serves
WHERE Beer = ‘Miller Lite’;
Query: List beers that Ullman drinks.
SQL:
SELECT Beer
FROM Frequent, Serves
WHERE Drinker = ‘Ullman’;
1.6 Overview of computer networks
1.6.1 A brief history of computer networks
A computer network is a collection of computers interconnected through a data network,
and the data network enables computers to exchange information among them. Computer
networks form the basis of a distributed database system. Over the years, the public and
1-19
Session 1 - Basic Concepts of Distributed Databases
SCC762 Distributed Database Systems
private data networks have evolved from packet-switching network in 10s and 100s of kilobits per second (kbps), to frame relay network operating at up to 2 mega-bits per second
(Mbps), and now Asynchronous Transfer Mode (ATM) operating at 155 Mbps or more.
Fast Ethernet has evolved from 10-Mbps to 100-Mbps and now Gigabit Ethernet. The Table
3 shows a brief networking history.
Table 3: Milestones of data network development
Year
Milestones
1966
ARPA packet-switching experimentation
1969
First Arpanet nodes operational
1972
Distributed e-mail invented
1973
For non-U.S. computer linked to Arpanet
1975
Arpanet transitioned to Defence Communications Agency
1980
New host added every 20 days
1981
TCP/IP experimentation began
1983
TCP/IP switchover completed
1986
NSFnet backbone created
1990
Arpanet retired
1991
Gopher introduced
1991
WWW invented
1992
Mosaic introduced
1995
Internet backbone privatized
1.6.2 The ISO/OSI reference architecture
The OSI (Open System Interconnection) Reference Model was developed by ISO
(International Standards Organisation) as a model for implementing data communication
between cooperating systems. It has seven layers.
➢ Application: providing end-user applications.
1-20
➢
Presentation: translating the information to be exchanged into terms that are
understood by the end systems.
➢
Session: for the establishment, maintenance, and termination of connections.
➢
Transport: for reliable transfer of data between end systems.
➢
Network: for routing the data to its destination and for network addressing.
➢
Data link: preparing data in frames for transfer over the link and error detection
and correction in frames.
Session 1 - Basic Concepts of Distributed Databases
SCC762 Distributed Database Systems
➢
Physical: defining the electrical and mechanical standards and signaling
requirements for establishing, maintaining, and terminating connections.
Figure 13 depicts the OSI process and peer-to-peer communication, where SDU represent
service data unit, Hi (i = 2, ...,7) headers of layer i, and T2 is the trailer for layer 2.
User
Data
H7 Data
H6
H5
H3
H2
SDU
SDU
H4
SDU
SDU
SDU
User
Peer-to-peer communication
T2
Bits
Application
Application
Presentation
Presentation
Session
Session
Transport
Transport
Network
Network
Network
Data Link
Data Link
Data Link
Physical
Physical
Physical
Source node
Intermediate node
Destination node
Figure 13.OSI process and peer-to-peer communication
1.6.2 Internet architecture
Internet is the largest data network in the world. It is an interconnection of several packetswitched networks and has a layered architecture. Figure 14 shows the comparison of
Internet and OSI architectures.
1-21
Session 1 - Basic Concepts of Distributed Databases
SCC762 Distributed Database Systems
Application
Presentation
Application
Session
Transport
Transport
Network
Data Link
Physical
Internet
Network Access
Physical
Figure 14. Comparison of Internet and OSI architectures
Internet has the following layers:
➢ Network access layer: It relies on the data link and physical layer protocols of
the appropriate network and no specific protocols are defined.
➢
Internet layer: The Internet Protocol (IP) defined for this layer is a simple
connectionless datagram protocol. It offers no error recovery and any error
packets are simply discarded.
➢
Transport layer: Two protocols are defined: the Transmission Control Protocol
(TCP) and the User Datagram Protocol (UDP). TCP is a connection-oriented
protocol that permits the reliable transfer of data between the source and
destination end users. UDP is a connectionless protocol that offers neither error
recovery and nor flow control.
➢
User process layer (application): It describes the applications and technologies
that are used to provide end-user services.
Large computer networks such as the Internet are formed through the help of bridges and
routers. Bridges and routers are intermediate systems that provide a communication path
and perform the necessary relaying and routing functions so that data can be exchanged
between devices attached to different subnetworks in the internet.
➢ Bridge: operates at layer 2 of the OSI architecture and acts as a relay of frames
between like networks.
➢
Router: operates at layer 3 of the OSI architecture and route packets between
potentially different networks.
Both bridge and router assume the same upper-layer protocols are in use.
1-22
Session 1 - Basic Concepts of Distributed Databases
Download