31Q5: Lecture 3 Lecture 3: Today Course Review:

advertisement
31Q5: Lecture 3
Lecture 3: Today
Monday, 20 Sep, 12:00, A1
• Website link for my part of the Course:
• http://www.cs.stir.ac.uk/~ahu/31q5/
• ANSI-SPARC Architecture
• Course Review: So far ... (covered by Richard Bland in first 2
• Database Languages
Lectures last week)
– DDL, DML
• Manual File-based Systems
• 4th Generation Languages
• Electronic File-based (De-centralized) Information Systems
• Problems with both of above
• Data Models
• Alternative approach: Database - Database & DBMS
• ANSI-SPARC Architecture: 3-levels, Database Schemas, Mapping,
Data Independence – completed today…
• Relational Data Model - intro.
Slide 44
Database Schemas
Slide 45
Mapping Between Schemas
• The overall description of a database is called the database schema.
• There are three different types of schema corresponding to the three
levels in the ANSI-SPARC architecture.
• The external schemas describe the different external views of the data.
– There may be many external schemas for a given database.
• The conceptual schema describes all the data items and relationships
between them, together with integrity constraints (later).
– There is only one conceptual schema per database.
• The DBMS is responsible for mapping between the three types of
schema (i.e. how they actually correspond with each other).
• It must also check the schemas for consistency.
– Each external schema must be derivable from the conceptual schema.
• Each external schema is related to the conceptual schema by the
external/conceptual mapping.
• This enables the DBMS to map data in the user’s view onto the
relevant part of the conceptual schema.
• A conceptual/internal mapping relates the conceptual schema to the
internal schema.
• At the lowest level, the internal schema contains definitions of the
stored records, the methods of representation, the data fields, and
indexes.
• This enables the DBMS to find the actual record or combination of
records in physical storage that constitute a logical record in the
conceptual schema.
– There is only one internal schema per database.
Slide 46
Slide 47
Example of the Different Levels
External View
External View
Sno
StaffNo
Fname Lname Age Salary
• The two external views are based on the conceptual view.
Lname Bno
– The Age field is derived from the DOB (Date of Birth) field.
External/Conceptual
mapping
view 1
Conceptual Level
Notes on the Example
StaffNo
Fname Lname DOB Salary
– The Sno field is mapped onto the StaffNo field of the conceptual record.
• The conceptual level is mapped onto the internal level.
BranchNo
• The internal level contains a physical description of the structure for
the conceptual record expressed in a high-level language.
Conceptual/Internal
mapping
Internal Level
Struct STAFF {
int StaffNo;
int BranchNo;
char Fname[15];
char Lname[15];
struct date DateOfBirth;
float Salary;
struct STAFF *next; // pointer to next record
};
• Note that the order of the fields in the physical structure is different
from that of the conceptual record.
• The physical structure contains a “pointer”, next. This will be simply
the memory address at which the next record is stored. Thus the set of
staff records may be physically linked together to form a chain.
Slide 48
Slide 49
Data Independence
Data Independence
• A major objective of the ANSI-SPARC architecture is to provide data
independence meaning that upper levels are isolated from changes to
lower levels.
• There are two kinds of data independence:
• Logical data independence refers to the immunity of external schemas
to changes in the conceptual schema.
– Changes to the conceptual schema (adding/removing entities, attributes, or
relationships) should be possible without having to change existing
external schemas or rewrite application programs.
• Physical data independence refers to the immunity of the conceptual
schema to changes in the internal schema.
– Changes to the internal schema (using different storage structures or file
organisations) should be possible without having to change the conceptual
or external schemas.
Slide 50
External
Schema
External
Schema
External/Conceptual
mapping
External
Schema
Logical data
independence
Conceptual
Schema
Conceptual/Internal
mapping
Physical data
independence
Internal
Schema
Slide 51
Database Languages
The Data Definition Language
• A DBMS typically provides a data sub-language with which the
database and its various schemas can be manipulated.
• The DDL is a descriptive language that allows the user to describe and
name the entities required and the relationships that may exist between
the different entities.
• Data sub-languages consist of two parts:
– Data Definition Language (DDL)
• The database schema is specified by a set of definitions expressed in
the DDL.
– Data Manipulation Language (DML)
• The DDL is used to specify the database schema and the DML is used
to both update the database and extract information from it.
• They are called sub-languages because they do not include all of the
facilities that one might expect of a high-level language.
– i.e. There are no loops or conditional statements, etc.
• Thus, many systems allow the sub-language to be embedded in a highlevel language like COBOL, Fortran, Pascal, Ada, or C.
• Most sub-languages also provide interactive commands that can be
entered directly at a terminal and do not require embedding.
• The DDL may be used only to define a schema or modify an existing
one. It cannot be used to manipulate data.
• Theoretically, there could be a different DDL for each type of database
schema (external, conceptual, internal).
• However, in practice, the DBMS typically provides a single
comprehensive DDL that allows specification of at least the external
and conceptual schemas.
Slide 52
The Data Manipulation Language
Slide 53
Fourth Generation Languages (4GLs)
• The DML is a language that provides a set of operations supporting the
manipulation of the data held in the database.
• Data manipulation operations usually include:
– Inserting new data or Deleting old data.
• There is no consensus about what constitutes a 4GL.
• It is essentially a shorthand programming language.
• For our purposes, 4GLs are generally non-procedural in nature.
• That is, they allow the user to specify what must be done without
saying how it should be done.
– Modifying or Retrieving existing data.
• There are two types of DML which are distinguished by their
underlying data retrieval constructs:
• With procedural DMLs, the programmer specifies what data is
required and how to obtain it.
– This is much different to using a conventional language like Java (a 3GL)
in which you have to specify how to do things.
– With a 4GL, the how part is determined by the system.
• Thus, 4GL tools are usually generators or wizards of some sort.
– In this case, information retrieval is rather like writing a program.
• With non-procedural DMLs, the user specifies what data is to be
retrieved in a single statement without specifying how the data should
be obtained.
– In this case, the DBMS is responsible for translating the DML statement
into an optimal series of data manipulation operations.
Slide 54
– It is a bit like having your program written automatically.
• Examples include:
– Form/Report generators
– Application generators
Slide 55
Data Models
Data Models
• A database schema is usually expressed using the DDL of a particular
DBMS.
• However, this type of language is too low-level to describe the data
requirements of an organisation in a readily understandable manner.
• A data model comprises three components:
– A structural part, consisting of a set of rules according to which databases
can be constructed.
– A manipulative part, defining the types of operations that are allowed on
the data.
– Possibly a set of integrity rules, which ensure that the stored data is
accurate.
• People require a higher-level description of the schema that is
organised using the concepts of a particular data model.
• Many data models have been proposed over the years.
• A data model is an integrated collection of concepts for describing
data, relationships between data, and constraints on the data in an
organisation.
– Some are used to describe data at the external and conceptual levels, while
others describe data at the internal level.
– Some have greater success than others in hiding from end-users the
underlying details of the physical storage of data.
Slide 56
Data Models
Slide 57
The Relational Data Model
• Examples:
• Textbook (Connolly & Begg): Chapter 3
– Network Model
• The relational data model was first proposed by Edward Codd in a
paper written in 1970.
– Hierarchical Model
– Relational Model
• The relational model has a sound theoretical foundation, which is
lacking in some of the other models we will study.
• (Entity-Relationship Model)
– Object-Oriented Model
• The objectives of the relational model are:
– Object-Relational Model
• The first two are older than the others, and by far the majority of
database systems these days are based on the relational model.
– To allow a high degree of data independence. Users’ interactions with the
database must not be affected by changes to the internal view of data,
particularly record orderings and access paths.
• The last two are currently of interest, and are increasingly being used.
As their names suggest, they incorporate the object-oriented approach
to data representation.
– To provide substantial grounds for dealing with the problems of data
semantics, consistency, and redundancy. In particular, Codd’s paper
introduces the concept of normalised relations.
• We concentrate on the relational model.
– To enable the expansion of set-oriented data manipulation languages.
Slide 58
Slide 59
The History of the Relational Data Model
Relations
• One of the most significant implementations of the relational model
was System R which was developed by IBM during the late 1970’s.
• The relational model is based on the mathematical concept of a
relation.
• System R was intended as a “proof of concept” to show that relational
database systems could really be built and work efficiently.
– Since Codd was a mathematician, he used terminology from that field, in
particular set theory and logic.
• It gave rise to two major developments:
– We will not deal with these concepts as such, but we need to understand
the terminology as it relates to the relational data model.
– A structured query language called SQL which has since become an ISO
standard and de facto standard relational language.
– The production of various commercial relational DBMS products during
the 1980’s such as DB2, SQL/DS, and ORACLE.
• There are now several hundred commercial relational database systems
for both mainframes and microcomputers.
• A relation is represented as a two-dimensional table containing rows
and columns (much like a spreadsheet).
• Relations are used to hold information about the entities to be
represented in the database.
– The rows correspond to individual records.
– The columns correspond to attributes or fields.
• In the labs we will use either Oracle or Microsoft Access.
– The order of the attributes is unimportant. They can appear in any order
and the relation will remain the same.
Slide 60
Basic Terminology
Basic Terminology
• Here is an example of the Branch entity represented as a relation.
Primary
Key
Record
Bno
B5
B7
B3
B4
B2
Street
22 Deer Rd
16 Argyll St
163 Main St
32 Manse Rd
56 Clover Dr
• An attribute is a named column of a relation.
– Correspondingly, a relation comprises one or more named columns
representing the attributes of a particular entity type.
Relation name
Attribute
Area
Sidcup
Dyce
Partick
Leigh
BRANCH
City
London
Aberdeen
Glasgow
Bristol
London
Postcode
SW1 4EH
AB2 3SU
G11 9QX
BS99 1NZ
NW10 6EU
Tel_No
0171-886-1212
01224-67125
0141-339-2178
0117-916-1170
0181-963-1030
Slide 61
• Each attribute has an associated domain, i.e. the set of allowable
values.
Cardinality
(number of
rows)
– A domain is similar to a data type.
– For example, the domain of the Street attribute is the set of all street names
in Britain.
• The rows of a relation are called records, or more formally, tuples.
Degree (number of columns)
• As you can see, each column contains the values of a single attribute.
• For example, the Bno column (the primary key) contains only the
numbers of existing branches.
– Each record/tuple represents one instance of a particular entity type.
– In the BRANCH relation, each record contains six values, one for each
attribute, representing the information about a particular branch.
– The order of the rows of a relation is not important. The rows can appear
in a different order and the relation remains the same.
Slide 62
Slide 63
Basic Terminology
Properties of Relations
• The degree of a relation is the number of attributes it contains.
– For example, the BRANCH relation has six attributes so it has degree 6.
– The minimum degree for a valid relation is 1.
– so we can have two attributes called Name in separate relations, but not in
the same relation.
• The cardinality of a relation is the number of records it contains.
– For example, the BRANCH relation has five rows so it has cardinality 5.
– A valid relation may have a cardinality of 0.
• A relational database is a collection of normalised relations.
• The order of attributes within a relation has no significance.
– We will cover the topic of normalisation later…
– i.e. if we re-order the columns of a relation it does not become a different
relation.
• Summary of corresponding terms:
– Tuple
⌦
– Attribute ⌦
⌦
File
Row
⌦
Record
Column ⌦
• The values of an attribute are all from the same domain.
– i.e. we should not allow a postcode to appear in a salary column for
example.
– i.e. we may have a relation that has no records.
Table
– i.e. No two relations may have the same name.
• The name of an attribute is unique only within its relation.
– i.e. all relations must have at least one attribute.
– Relation ⌦
• The name of a relation is unique.
• The order of rows within a relation has no significance.
– i.e. if we re-arrange the rows of a relation it does not become a different
relation.
Field
Slide 64
Slide 65
Properties of Relations
Keys
• Each cell of a relation may contain at most one value.
• We need to be able to uniquely identify each row in a relation by the
values of its attributes.
– For example, we cannot store two phone numbers in the same cell.
– Relations that satisfy this property are said to be normalised.
– This enables a particular row to be retrieved or related to other records.
– In particular, they are in first normal form.
– We use relational keys for this purpose.
• Firstly, a superkey is an attribute or set of attributes that uniquely
identifies a particular row within a relation.
• Each record within a relation is distinct.
– i.e. if we examine the values in each row, no two rows have exactly the
same values (no duplicates).
– Therefore, two rows in a relation must differ in the value of at least one
attribute.
• Using this definition and the BRANCH relation as an example, the
following would qualify as superkeys:
– (Bno)
– (Bno, Street, Area)
– Note that database systems do not generally enforce this property.
– (Bno, Postcode, Tel_No)
– In particular, Microsoft Access allows relations to contain duplicate
records.
• In other words, any combination of attributes that contains Bno would
be a superkey.
Slide 66
Slide 67
Problem with Superkeys
Candidate Keys
• The problem with superkeys is that they may contain attributes that are
not strictly required for unique identification.
• For example:
– (Bno, Street, Area)
Street and Area are not required.
– (Bno, Postcode, Tel_No)
Postcode and Tel_No are not required.
• We define a candidate key to be a minimal superkey.
– A superkey is minimal if removing any attributes prevents it from
providing unique identification.
• For example:
– (Bno) is a candidate key for the BRANCH relation but (Bno, Postcode)
is not.
• Ideally, we are interested in the superkeys that contain only the
attributes necessary for unique identification.
• Note that a candidate key may involve more than one attribute and if
so is referred to as a composite key.
Slide 68
Properties of Candidate Keys
Slide 69
Finding Candidate Keys
• A candidate key, K, for a relation R has the following properties:
• An instance of a relation cannot be used to prove that an attribute or
combination of attributes is a candidate key.
• Uniqueness:
• In other words, we cannot examine a relation and decide, for example,
that Postcode is a candidate key simply because no two rows share the
same post code.
– In each row of R, the values of K uniquely identify that row.
– Therefore, no two rows of R can have the same values for K.
• Irreducibility:
• The fact that there are no duplicates at a particular moment in time
does not guarantee that duplicates are not possible.
– No proper subset of K has the uniqueness property.
– Therefore, K cannot consist of fewer attributes.
• This definition is simply a formal way of stating what we know about
candidate keys.
• Some relations may have several candidate keys.
• However, the presence of duplicates may be used to show that a certain
attribute or combination of attributes is not a candidate key.
• Therefore, to correctly identify a candidate key, we need to be aware of
the meaning of attributes in the real world and think about whether
duplicates could arise for a given choice of key.
– e.g. A Staff relation may have StaffNo and NIN as candidate keys.
Slide 70
Slide 71
Primary Keys
Foreign Keys
• For a given relation, the primary key is one of the candidate keys. It is
chosen from the candidate keys in order to uniquely identify records
within the relation.
• Since a relation cannot have duplicate rows (by definition), it is always
possible to uniquely identify each row.
– This means that every relation has at least one candidate key.
– Hence, a primary key can always be found.
• In the worst case, the entire set of attributes could serve as the primary
key, but usually some smaller subset is sufficient.
• However, since many database systems allow relations to contain
duplicates, this theoretical property does not necessarily apply in
practice.
• We are often best served by introducing an artificial key attribute if a
natural primary key cannot be found. Perhaps simply a record number.
Slide 72
• When an attribute appears in more than one relation, its appearance
usually represents a relationship between records of the relations.
• For example, consider this simple example involving the relationship
between Staff and Departments within the University:
Sno
SG86
SP52
SJ12
SQ63
ST AFF
Name
DeptNo
David Hulse
31
Paul Kingston
49
Michael Smith
55
Alan Dearle
31
DEPART M ENT
DeptNo
Name
NumRooms
31
Computing Science
18
49
Management
15
55
Basket-W eaving
3
• The rows in these relations are linked via the DeptNo field.
• When the primary key of one relation appears as an attribute in another
relation it is called a foreign key.
Contd. in next lecture
Slide 73
Download