Uploaded by ihatearabic786

CMPSC 431W - Notes

advertisement
M 3.1: Database Design Flow
When examining the pieces which go into database design and implementation, it is useful to
keep in mind the overall flow of the process.





Requirements collection and analysis
o This important step is often downplayed in courses due to the necessity
of focusing on the main concepts of the course. This should in no way
minimize the importance of this step.
Conceptual Design
o The data requirements from the requirements collection are used in the
conceptual design phase to produce a conceptual schema for the
database.
Functional Analysis
o In parallel with the conceptual design, functional requirements are used
to determine the operations which will need to be applied to the
database. This process will not be discussed here as it is covered in other
courses.
Logical Design
o Starts implementation; conceptual schema transformed from high-level
data model to implementation data model; produces a database schema.
Physical Design
o Specifies internal storage, file organizations, indexes, etc. Produces
internal schema. This type of design is important for larger databases,
but is not needed when working with relatively small databases. Physical
design will not be covered in this course.
This module presents basic ER model concepts for conceptual schema design
M 3.2: The Building Blocks of the ER
Model
In the Entity-Relationship (ER) Model, data is described as entities, relationships,
and attributes. We will start by looking at entities and attributes.
An entity is the basic concept that is represented in an ER model. It is a thing or object in the
real world. An entity may be something with a physical existence, such as a student or a
building. An entity may also be something with a conceptual existence, such as a bank account
or a course at a university . The concept of an entity is similar to the concept of an object in C++
or Java.
Each entity has attributes, which are properties which describe the entity.
Attributes
There are many choices for attributes for the student entity. Some you might have considered
include, name, age, height, weight, gender, hair color, eye color, photo, social security number,
student id, home address, home telephone, cell phone number, campus (or local) address,
campus telephone, major, class rank, spouse name, how many children (and names), whether
working on campus, car make and model, car license, etc.
There are, of course, many additional attributes which are not listed above. You might have
listed some of these additional attributes. Since a database is only representing some aspect of
the real world (sometimes referred to as a miniworld in the text), not all attributes we can
think of will be included in the database, but only those required for the miniworld the
database is representing. Which attributes to include are determined during this design step
and are directed by the requirements.
There are several types of attributes that can be used in an ER model. The types can be
classified as follows:




Composite vs. Simple (Atomic) Attributes: A composite attribute can be divided into
parts which represent more basic attributes. An example of this is an address, which
can be subdivided into street, city, state, and zip (of course there are additional
ways to subdivide an address). If an attribute cannot be further subdivided, it is
known as a simple or atomic attribute. If there is no need to look at the components
individually (such as the city part of an address), then the attribute (address) can be
treated as a simple attribute.
Single-Valued vs. Multivalued Attributes: Most attributes can take only one
value. These are called single-valued attributes. An example of a single-valued
attribute would be social security number for a student. An example of a
multivalued attribute would be names of dependents.
Stored vs. Derived Attributes: When two or more attribute values are related, it is
sometimes possible to derive the value of one given the value of the other. For
example, if age and birth date are both attributes of an entity (such as student), age
can be stored directly in the database, in which case it is a stored attribute. Rather
than being stored directly in the database, age can also be calculated from the birth
date and the current date. If this option is chosen, age is calculated from (or derived
from) birth data and is known as a derived attribute.
NULL Value: In some cases, an attribute does not have an applicable value. When
this is true, a special value, called the NULL value is used. This can be used in a few
cases. The first is when the value does not exist, for example the SSN of a spouse
when the person is not married. The second is when it is known that the value
exists, but is missing, such as a person's unknown birth date. Finally, NULL is used

when it is not known whether the value exists, for example the license plate number
of a person's car (since the person may not own a car).
Complex Attributes: Composite and multivalued attributes can be nested, e.g.,
Address(Street_Address(Number, Street, Apartment), City, State, Zip). These are
called complex attributes.
Entity Types and Entity Sets
An entity is defined by its attributes. This is the type of data we want to collect about the entity.
As a very simple example, for a student entity, we may want to collect name, student id, and
address for each student. This would define the entity type, which can be thought of as a
template for the student entity. Each individual would, of course, have his/her own values for
the three attributes. The collection of all individual entities (i.e. the collection of all
entity instances) of a particular entity type in the database at any point in time is known as
an entity set or entity collection.
Keys and Domains
Keys
One important constraint that must be met by an entity type is called
the key constraint or uniqueness constraint. An entity type will have one attribute (or a
combination of attributes) whose values are unique for each individual entity in the entity set.
Such an attribute is called a key attribute. For example, in a student entity, a given student id
can be used as a key since the id will be assigned to one and only one student. The values of a
key attribute can be used to uniquely identify an entity instance. This means that no two entity
instances may have the same value for the key attribute. We will discuss keys further in this and
later modules.
Value Sets (Domains) of Attributes
Each simple attribute of an entity type is associated with a value set (or domain of values). This
indicates the set of allowable values which may be assigned to that attribute for each entity
instance. Value sets are normally not displayed on basic ER diagrams. For example, if we want
to include the weight of a student in the database, we may indicate that the weight will be in
pounds and we will store the weight rounded to the nearest integer. The integer values may
range from 50 to 500 pounds. Most basic domains are similar to the data types such as integer,
float, and string contained in most programming languages
M 3.3: Initial Design of Database
As indicated earlier, the database design starts with a set of requirements. We will work
through a simple example where we build an ER diagram from a set of requirements.
A simple set of requirements for the example is:
This database will contain data about certain aspects of a university.
1. The university needs to keep information about its students. It wants to store the
university assigned ID of the student. This is recorded as a five-digit number. The
university also wishes to keep the student's name. At this time, the name can be
stored as a full name, in first name-last name order. There is no need to break the
name down any further. The major program of the student needs to be recorded
using the five letter code assigned by the university. The student's class rank also
needs to be kept as a two-letter code (FR for freshman, SO for sophomore, etc.).
Finally, the university wants to keep track of the academic advisor assigned to the
student. The advisor will be a faculty member in the student's department and will
be identified by the faculty ID assigned to the advisor.
2. The university also needs to keep information about its faculty. It wants to store the
university assigned ID of the faculty member. This is recorded as a four-digit
number. The faculty member's name must also be kept and like the name of a
student it will be stored as a full name and will not need to be broken down any
further. Finally, each faculty member is assigned to a department and this will be
stored as a five letter code, which also represents the major program of a student.
3. Information about each department will also be kept. This includes the assigned
code which represents the department, the full department name, and a code for
the building in which the department office is located.
4. Course information must also be kept. This includes the course number which
consists of the code for the department offering the course and the number of the
course within the department. This is stored as a single item, such as CMPSC431.
The course name will also be kept. This name will as shown in the university catalog
and may include abbreviations, such as Database Mgmt Sys. The credit hours earned
for the course will be kept, as will and the code for the department offering the
course. The code will most often be the first part of the course name, but the
department code will be repeated as a separate item.
5. Finally, information must be maintained about each section of a course. A section ID
will be assigned to each section offered. This is a six digit number representing a
unique section - a specific offering of a specific course. For each section, the course
number, the semester offered, and the year offered must also be stored.
These requirements are admittedly not as comprehensive as they would be had a full
requirements study been done, but they will suffice for our simple example.
The next step is to use the requirements to determine the entities which should be included in
the design. For each entity, the attributes describing the entity should then be listed.
After this, the first part of the ER diagram will be drawn. Take a few minutes to develop the list
of entities and associated attributes on your own.
Then watch the video where we will work through this together and introduce the first part of
the ER diagram.
I encourage you to get in the habit of writing a "first draft" of your ER Diagram. Then take a
look at the draft and determine how the positioning and spacing of the diagram looks. Then
note modifications and adjustments you want to make to the visual presentation of the
diagram. Then draw a "clean" copy of the diagram, possibly using a tool such as Visio.
Sometimes you can draw a first draft followed by the final diagram. Other times you may need
to work through several drafts before producing the final copy.
Below is a final copy of the diagram for the draft copy shown on the white board in the above
video. It also includes a few additional points concerning ER diagrams.
ER Diagram and Additional Notes
Figure 3.1: ER Diagram - Part 1
There are a few things to note here. As I suggested, I took my initial draft from the white board
and adjusted it when I created the diagram. Also, I cheated a bit since I knew where the
diagram was heading in the next step and "prearranged" the placement here. Normally this
step and the next are taken together when drawing the draft diagram, so the initial draft would
include more than this draft did.
Note that in this diagram attribute names are reused for different entities (ID is an example of
this). This works fine from a "syntax" standpoint, but is not encouraged unless the ID values are
the same. In this case they are not, and it is better in most cases to give them unique names
such as Student_id, Faculty_id, etc.
As a reminder, the following conventions are used:



Entities are placed in a rectangular box. The entity name is included inside the box in
all capitals.
Attributes are places in an oval. The attribute name is included inside the oval with
only the first letter capitalized. If multiple words are used for clarity, they are
separated by underscores. Words after the first are not capitalized. Attribute names
are underlined for attributes which are the key (or part of the key) for the entity.
Attributes are connected to the entity they represent by a line.
M 3.4: Relationship Types,
Relationship Sets, Degree, and
Cardinality
When looking at the preliminary design of entity types, it can be seen that there are some
implicit relationships among some of the entity types. In general, when an attribute of one
entity type refers to another entity type, some relationship exists between the two entities. In
the example database, a faculty member is assigned to a department. The department attribute
of the faculty entity refers to the department to which the faculty member is assigned. In this
module, we will work through how to categorize these relationships and then how to represent
them in an ER diagram.
In a pattern similar to entity types and entity instances, the term relationship type refers to a
relationship between entity types. The term relationship set refers to the set of all relationship
instances among the entity instances.
The degree of a relationship type is the number of entities which participate in the relationship.
The most common is a relationship between two entities. This is a relationship of degree two
and is called a binary relationship. Also fairly common are relationships between three entities
(a relationship of degree three) which is called a ternary relationship. This and other higherdegree relationships tend to be more complex and will be further discussed in the next module.
We will also discuss in the next module the possibility of a relationship between an entity and
itself. This is often called a unary relationship, and is referred to in the text as
a recursive or self-referencing relationship.
For the rest of this module, we will focus only binary relationships. Relationship types normally
have constraints which limit the possible combinations of entities that may participate in the
relationship set. For example, the university may have a rule that each faculty member is placed
administratively in exactly one department. This constraint should be captured in the schema.
The two main types of constraints on binary relationships are cardinality and participation.
The cardinality ratio for a binary relationship is the maximum number of entity instances which
can participate in a particular relationship. In the above example of faculty and departments,
we can define a BELONGS_TO relationship. This is a binary relationship since it is between two
entities, FACULTY and DEPARTMENT. Each faculty member belongs to exactly one department.
Each department can have several faculty members. For the DEPARTMENT:FACULTY
relationship, we say that the cardinality ratio is 1:N. This means that each department has
several faculty (the "several" is represented by N, indicating no maximum number), while each
faculty belongs to at most one department. The cardinality ratios for binary relationships are
1:1, 1:N, and M:N. Note that the text also shows N:1, but this is usually converted to a 1:N by
reversing the order of the entities.
Note that so far we have considered maximum number of entity instances which can
participate in a relation. We will wait until the next module to discuss whether it makes sense
to talk about a minimum number of instances which can participate.
An example of 1:1 binary relationship would be if we wished to keep the department chair in
the database. To simplify, we will specify that the chair is a faculty member of the department,
which is true in the vast majority of the cases. In this case, the faculty would be the chair of at
most one department, and the department would have at most one chair. 1:1 binary
relationships are explored in more depth in the next sub-module.
If the relationship between two entities is such that an instance entity one may be related to
several instances of entity two and also that an instance of entity two may be related to several
instances of entity one, we say that this is a M:N binary relationship. The use of M and N, rather
than using N twice, indicates that the number of related instances need not be the same in
both directions. We will look more closely at M:N relationships in the next module.
The next step is to examine the entities and their attributes to determine any relationships
which exist among the entities. For each relationship, the degree of the relationship should be
determined, as should the cardinality ratio of the relationship.
After this, the next part of the ER diagram will be drawn. Take a few minutes to develop the list
of relationships on your own. Be sure to determine both the degree and cardinality ratio. To
keep this simple at first, we will look at only 1:1 and 1:N binary relationships.
Then watch the video where we will work through this together and add the next part of the ER
diagram.
Hopefully this video reinforces the suggestion to develop the habit of writing a "first draft" of
your ER Diagram. The positioning and spacing is less of an issue when beginning work with only
the entities and attributes as we did in the earlier video. When you begin to include
relationships, the position of the entities becomes important to permit the relationships to be
added while keeping a clean look to the diagram. Note that the overall positioning of the
entities was not too bad in the video, but it would benefit from minor adjustments. The spacing
between entities in the draft developed in the video was not sufficient in a few cases. The
relationship diamond was too cramped between entity boxes. Making the lines longer would
make the cardinality labels (the 1 and the N) much easier to read.
The choice of names is not always straightforward. We will discuss some guidelines to consider
and naming suggestions in the next module.
Below is a final copy of the diagram for the draft copy shown on the white board in the above
video. The attributes are also included in the diagram below.
ER Diagram and Additional Notes
Figure 3.2: ER Diagram - Part 2
There are a few things to note with this diagram also. The first item is that on the board I used
lower case with an initial capital when I put the names in the relationship boxes. This does not
follow the convention used in the text where relationship names are entered in all caps. I
followed the text convention when I drew the final copy of the ER diagram.
Next, binary relationship names are normally chosen so that the ER diagram is read from left to
right and top to bottom. To make the diagram consistent with this guideline, I changed the
name of the relationship between student and department from MAJORS to MAJORS_IN.
Notice that a binary relationship can be "read" in either direction with a slight modification of
the name. Since the SECTION entity was below the COURSE entity in the diagram on the board,
I used the name of OFFERING for the relationship. Since I moved SECTION to the left of COURSE
in the final diagram, I changed the relationship name to OFFERING_OF since a section is a
specific offering of a course. Finally, I was going to change the name of the relationship
between DEPARTMENT and COURSE to OFFERS to make it read from top to bottom. I then
realized that I did not like the use of a form of "offer" twice, so I changed the name of the
relationship to OWNS.
Also note that the diagram on the board was somewhat cramped. The cardinality ratios were
appropriately placed on the participating edge, but it did not clearly show that it is customary
to place the cardinality ratio closer to the relationship diamond than to the entity rectangle.
As a reminder, the new conventions are listed first below. They are followed by conventions we
have used before.





Relationships are placed in a diamond. The relationship name is included inside the
box in all capitals.
For a binary relationship, the cardinality ratio value is placed on the edge between
the entity rectangle and the relationship diamond. It is placed closer to the
relationship diamond.
Entities are placed in a rectangular box. The entity name is included inside the box in
all capitals.
Attributes are places in an oval. The attribute name is included inside the oval with
only the first letter capitalized. If multiple words are used for clarity, they are
separated by underscores. Words after the first are not capitalized. Attribute names
are underlined for attributes which are the key (or part of the key) for the entity.
Attributes are connected to the entity they represent by a line.
M 3.5: An Example of a 1:1 Binary
Relationship
In the previous sub-module we proposed an example of 1:1 binary relationship. The example
considered a desire to keep the department chair in the database. We added the simplification
that the chair is a faculty member of the department, which is true in the vast majority of the
cases. In this case, the faculty would be the chair of at most one department, and the
department would have at most one chair. We can call the relationship CHAIR_OF. Another
simplification is that each department has a chair (it is possible that the spot is vacant, but we
will ignore that for now). Of course, there would be many faculty who are not chair, but this
does not invalidate the 1:1 relationship.
This relationship will not be added to the ER diagram from the last sub-module. The ER diagram
below shows how this part would be included. It includes the new CHAIR_OF relationship with
the BELONGS_TO relationship included for context.
No new symbols are needed for this relationship. The cardinality ratio is shown as before, but in
this case both values are 1.
Figure 3.3 ER Diagram Showing 1:1 Binary Relationship
M 4.1: Continued Design of a
Database
For our example in the last module, we worked from a set of requirements which produced a
conceptual design which contained several entities and their associated attributes. It also
contained several 1:N relationships. Here, we expand the example. This starts by expanding the
requirements and then by adding the necessary parts to the design and ER diagram.
A expanded set of requirements for the example is given below. Note that only point #6 has
been added; the rest of the requirements remain the same
This database will contain data about certain aspects of a university.
1. The university needs to keep information about its students. It wants to store the
university assigned ID of the student. This is recorded as a five-digit number. The
university also wishes to keep the student's name. At this time, the name can be
stored as a full name, in first name-last name order. There is no need to break the
name down any further. The major program of the student needs to be recorded
using the five letter code assigned by the university. The student's class rank also
needs to be kept as a two-letter code (FR for freshman, SO for sophomore, etc.).
Finally, the university wants to keep track of the academic advisor assigned to the
student. The advisor will be a faculty member in the student's department and will
be identified by the faculty ID assigned to the advisor.
2. The university also needs to keep information about its faculty. It wants to store the
university assigned ID of the faculty member. This is recorded as a four-digit
number. The faculty member's name must also be kept and like the name of a
student it will be stored as a full name and will not need to be broken down any
further. Finally, each faculty member is assigned to a department and this will be
stored as a five letter code, which also represents the major program of a student.
3. Information about each department will also be kept. This includes the assigned
code which represents the department, the full department name, and a code for
the building in which the department office is located.
4. Course information must also be kept. This includes the course number which
consists of the code for the department offering the course and the number of the
course within the department. This is stored as a single item, such as CMPSC431.
The course name will also be kept. This name will as shown in the university catalog
and may include abbreviations, such as Database Mgmt Sys. The credit hours earned
for the course will be kept, as will and the code for the department offering the
course. The code will most often be the first part of the course name, but the
department code will be repeated as a separate item.
5. Information must be maintained about each section of a course. A section ID will be
assigned to each section offered. This is a six digit number representing a unique
section - a specific offering of a specific course. For each section, the course number,
the semester offered, and the year offered must also be stored.
6. Finally, a transcript much be kept for each student. This must contain data which
includes the ID of the student, the ID of the section, and the grade earned by the
student in that section. Of course, there will be several instances for each student one for each section that was taken by the student.
Although these requirements are slightly expanded, they are still not comprehensive. Again,
they will suffice for our expanded, but still relatively simple example.
Since the earlier requirements were not modified, and the only change was an additional
requirement (requirement #6), the next step is to use the modified requirements to determine
if any of the original entities and their attributes need to be modified. If so, the modifications
should be noted. Then any new entities which should be included in the design should be
noted. For each new entity, the attributes describing the entity should then be listed.
After this, an expanded ER diagram will be drawn. Take a few minutes to develop the list of
modifications as well as new entities and associated attributes on your own.
Then keep this list handy as you move to the next sub-module.
M 4.2: The M:N Relationship
Hopefully, as you looked at the new set of requirements, you realized that with the first five
requirements remaining unchanged, the entities and attributes from the last module will not
need to be modified. The relationships identified in the last module will also not need to be
modified. That leads to the sixth (the new) requirement. Will it dictate changes to any of the
earlier work with entities or attributes? Will new entities, attributes, and/or relationships need
to be created?
At first it seems that a new entity, TRANSCRIPT, should be created. It will have attributes
STUDENT ID, SECTION ID, and GRADE. After noting that a grade is associated with a specific
student in a specific section, realize that a student will be in many sections while in school and
each section will consist of many students.
This leads to the realization that TRANSCRIPT is different. If STUDENT ID is chosen as the key
(remember that a key must be unique), SECTION_ID and GRADE would need to become
multivalued attributes since each student would have many sections and grades associated
with the STUDENT_ID. If SECTION_ID is chosen as the key, STUDENT_ID and GRADE would need
to become multivalued attributes since each section would have many students and grades
associated with the SECTION_ID. We will see in the next module that the relational model does
not allow attributes to be multivalued. Since that is the case, neither STUDENT_ID nor
SECTION_ID can be the key for a TRANSCRIPT entity. This will be further discussed in the
module on normalization later in the course.
Since a given student will take many sections over his/her time at the university, and since a
given section will contain many students, the relationship between student and section is
many-to-many. A many-to-many relationship is called an M:N relationship. Based on this, rather
than being an entity like we saw in the last module, the TRANSCRIPT "entity" is actually formed
from the relationship between the STUDENT entity and the SECTION entity. It will be shown as
a relationship (diamond) in the ER diagram.
What about the GRADE attribute? It will not be an attribute of the STUDENT entity since a given
student will have many grades. This would lead to GRADE being a multivalued attribute.
Similarly GRADE will not be an attribute of the SECTION entity since each section having many
students would again lead to GRADE being a multivalued attribute. It turns out that GRADE is an
attribute of the relationship. As such, the GRADE attribute will be shown in the ER diagram in an
oval which is connected to the the diamond containing the TRANSCRIPT relationship. Note that
TRANSCRIPT is not a good name for a relationship. Even though I used that name in the video, a
more appropriate name for a relationship is used in the ER diagram presented after the
video. We will see in later modules how this relationship will actually become an entity in the
schema design.
What about keys for this relationship? Since it is represented in diagram as a relationship with a
single attribute, GRADE, there is no actual key that is identified for the relationship (note that
GRADE does not qualify). As such, the GRADE is not underlined in the ER diagram. Later, as we
develop a schema design, we will develop this as an entity with the joint key of STUDENT_ID
and SECTION_ID. This is not stated at the ER diagram level, but will come out during schema
design.
At this point, take a few minutes to sketch how you think this would look when added to the ER
diagram.
Then watch the video, "ER Diagram - Part 3," where we will work through this together and
update the ER diagram.
ER Diagram - Part 3
ER Diagram and Additional Notes
Figure 4.1: ER Diagram Showing Inclusion of M:N Relationship
There are a few things to note with this diagram. First, since grade is actually an attribute of the
relationship rather than an attribute of one of the entities, the grade attribute is attached to
the relationship diamond.
On the board I used the name TRANSCRIPT for the relationship. Not following my sample
diagram closely enough during the production of the video, I looked at the new requirement
instead. Since the requirement discusses a transcript, I used that name on the board. Note that
this is an entity-type name since "transcript" looks like an entity on first reading. Since this is
actually a relationship, I changed the name to the more relationship appropriate name
ENROLLED_IN in the final diagram. This indicates that a student is (or was, depending on the
time we are looking at it) enrolled in a particular section.
Also on the board, I used N as the cardinality value on both sides of the transcript diamond and
noted that the N can represent a different value on the two sides. It just means "many" in this
context. This is common with many, but not all, authors even though the many-many
relationship is indicated by M:N . The authors of the text use M on one side and N on the other
as a reminder that the values can be different. I used this in the final diagram. Note as you
review the chapter that they do switch to using N on both sides when they introduce the (min,
max) cardinality notation later in the chapter.
We need no new symbols to represent the M:N relationship.
ER Diagram Naming Guidelines
The choice of names for entity types, attributes, and relationship types is not always
straightforward. We will discuss some guidelines to consider and naming suggestions. The
choice of names for roles will be discussed in the Recursive Relationships sub-module. As much
as possible, choose names that convey the meaning of the construct being named. The text
chooses to use singular (rather than plural) names for entity types since the name applies to
each individual entity instance in the entity type. Entity type and relationship type names are in
all caps, while attribute names and role names have only the first letter capitalized.
When looking at requirements for the database, some nouns lead to entity type names. Other
nouns describe these nouns and lead to attribute names. Verbs lead to the names of
relationship types.
When choosing names for binary relationships, attempt to choose names that make the ER
diagram readable from left to right and from top to bottom. Note that when it seems a
relationship type name should read from bottom to top, adjusting the name can usually allow
the reading to go from top to bottom.
M 4.3: Relationship Types of Degree
Higher than Two
In Module 3 we defined the degree of a relationship type as the number of entities which
participate in a relationship. So far, we have only looked at relationships of degree two, which
are called binary relationships. It is worth noting again, that degree and cardinality should not
be confused. Binary relationships (degree two) have cardinality of 1:1, 1:N, and M:N. Here we
examine relationships of degree three (ternary relationships). We will also look at the
differences between binary and higher-degree relationships. We will consider when to choose
each, and we will conclude by showing how to specify constraints on higher-degree
relationships.
Consider a store where the entities CUSTOMER, SALESPERSON, and PRODUCT have been
identified. Assume the store, such as an appliance store or a furniture store, sells higher priced
items where unlike other stores such as a grocery store, most customers buy only one or a few
items at a time from a salesperson who is on commission. The customer is identified by a
customer number, the product is identified by an item number, and the salesperson is
identified by a salesperson number. It is desirable to keep track of which items a customer
purchased on a particular day from which salesperson. Although not common, a customer may
buy more than one of a particular item on a given day. An example would be to buy two
matching chairs. Since that is a possibility, we want to record the quantity of a particular item
which is purchased as well as the date purchased. We can then define a ternary relationship,
we'll call it PURCHASE, between CUSTOMER, SALESPERSON, and PRODUCT. Note that other
names for the relationship will work, and may be preferable, depending on the actual
application. Also note that the three entities will have other attributes. To avoid clutter which
may detract from the main point being demonstrated here, only the key attribute is shown for
the three entities. How this is represented in ER diagram notation is shown in the figure below.
ER Diagram Representation of a Ternary Relationship
The ER diagram above shows how the three entities are related. Since the date of sale and the
quantity sold are both part of an individual purchase of the customer, salesperson, and product
(there may be several of these purchases over time) these two attributes are attached to the
PURCHASE relationship rather than to any of the individual entities.
It is possible to define binary relationships between the three entities. We can define an M:N
relationship between CUSTOMER and SALESPERSON. Let's call it BOUGHT_FROM. We can also
define an M:N relationship between CUSTOMER and PRODUCT and name the relationship
BOUGHT. Finally, we can define an M:N relationship between SALESPERSON and PRODUCT and
call it SOLD. A representation of this in ER diagram notation is shown below.
ER Diagram Representation of Three Binary Relationships
among the Three Entities
Are the three binary relationships, taken together, equivalent to the ternary relationship? At
first glance, the answer appears to be "yes." However, closer examination shows that the two
methods of representation are not exactly equal. If customer Fred buys two matching chairs
from salesperson Barney on November 15, 2020, that information can be captured in the
ternary relationship (PURCHASE in this case). In the three binary relationships we can capture
much, but not all of the information. The fact that Barney bought items from Fred can be
captured in the BOUGHT FROM relationship. The fact that Barney bought at least one chair can
be captured in the BOUGHT relationship. The fact that Fred sold chairs can be captured in the
SOLD relationship.
What is not clearly captured is that Fred sold the chairs to Barney. It is possible that Fred sold
something, say a table, to Barney. Fred sold chairs to someone, say Wilma. Barney bought the
chairs but not from Fred. Maybe he bought them from Betty. If you examine this closely, the set
of facts will generate the same information in the binary relationships as the sale in the last
paragraph did. Also, it would not be possible to capture the date of sale or quantity sold in the
same manner in three binary relationships. This should become even clearer in the next module
which discusses relational database design.
M 4.4: Recursive Relationships
In some cases, a single entity participates more than once in a relationship type. This happens
when the entity takes on two different roles in the relationship. The text refers to this as
a recursive relationship or self-referencing relationship. Looking at this further, this is a
relationship of degree one since there is only one entity involved in the relationship. A
relationship of degree one is known as a unary relationship. A cardinality can be assigned to
such a relationship depending on the number of entity instances that can participate in one role
and the number of entity instances that can participate in the other role.
As an example, consider the FACULTY entity. Assume that all department chairs are also faculty.
As in the past this is mostly true, but not always. We will make the assumption for simplicity. If
dictated by the requirements, we can consider a SUPERVISES relationship where the chair
supervises the other faculty in the department. Since only one entity instance represents the
"chair side" of the relationship, but many entity instances represent the "other faculty in the
department" side of the relationship, the cardinality of the relationship is 1:N. This, then,
represents a 1:N unary relationship.
In a unary relationship, role names become important since we are trying to capture the role
played by the participating entity instances. We can define one role as chair, and the other
as regular faculty. There are probably better names for the roles (especially the second one).
Can you think of any?
This relationship will again not be added to the larger ER diagram. The ER diagram below shows
how this relationship would be captured. Note how the roles are depicted.
ER Diagram Representation of a Recursive Relationship
Figure 4.4: ER Diagram of Recursive Relationship
M 4.5 Participation Constraints and
Existence Dependencies
In the last module, we looked at cardinality ratios (1:1, 1:N, and M:N). These
indicate the maximum number of entity instances which can participate in a particular
relationship. Is there a concern about the minimum number of instances which can (in this case
it would be must) participate?
The text discusses this concept by discussing participation constraints. The participation
constraint indicates whether in order to exist, an entity instance must be related to an instance
in another entity involved in the relationship. Since this constraint specifies
the minimum number of entities that can participate in the relationship, it is also called
the minimum cardinality constraint.
Participation constraints are split into two types - total and partial. Going back to the ER
diagram for the example database, FACULTY and DEPARTMENT were related by the
BELONGS_TO relation. If there is a requirement that all faculty members must be assigned to a
department, then a faculty instance cannot exist unless it participates in at least one
relationship instance. This means that a faculty instance cannot be included unless it is related
to a department instance in the BELONGS_TO relationship instance.
This type of participation is called total participation, which means that every faculty
record must be related to a department record. Total participation is also called existence
dependency.
As a different example, consider that STUDENT and DEPARTMENT are related by the
MAJORS_IN relationship. If the requirement for students and majors specifies that a student is
not required to have a declared major until junior year, that means that not every student is
required to participate in the MAJOR relationship at all times. This is an example of partial
participation meaning that only some or part of all students participate in the MAJOR
relationship.
The cardinality ratio and participation constrains taken together are known as the structural
constraints of a relationship. This is also called the (min, max) cardinality. This is further
discussed in the toggle at the end of this sub-module.
For ER diagrams, the text uses cardinality ratio/single-line/double-line notation. To keep the
example fairly simple, we will use Figure 4.1 as an example. This uses the cardinality ratio as we
did in Figure 4.1. In that figure, we also used a line (single-line) to connect entities and
attributes and to connect all entities and relationships. We will modify this to connect an entity
to a relationship in which it totally participates by using a double-line rather than a single line. If
the participation is partial, then we will leave the connection as a single line.
Figure 4.1: ER Diagram Showing Inclusion of M:N Relationship
This forces us to obtain additional information about relationships as the requirements are
gathered. Below is shown the ER diagram of Figure 4.1 modified to include double-line
connectors as required by the following assumptions.






As in the example above, for the BELONGS_TO relationship we will assume that all
faculty members must be assigned to a department. We will further assume that
every department must have at least one faculty member. This implies that there is
total participation in both directions. This means that there will be double-lines
connecting both entities to the relationship.
As in the second example above, we will assume that a student is not required to
declare a major until junior year. Since this indicates partial participation, there will
be only a single line connecting student to the MAJORS_IN relationship. We will
further assume that some departments may be "support departments" which do
not offer majors, so this is partial participation also. This means that there will be
single lines connecting both entities to the relationship.
To continue with the relationships, assume that each student must have an advisor,
but not all faculty advise students. So student fully participates in the ADVISED_BY
relationship and should be connected by a double line. Faculty, on the other hand,
only partially participates so will be connected by a single line.
For the OWNS relationship, we will assume that each department must own at least
one course and every course must be owned by a department. This again means
that there is total participation in both directions and there will be double-lines
connecting both entities to the relationship.
For the OFFERING_OF relationship, each section must be must be an offering of a
particular course, so there is total participation and a double-line in that direction.
To allow the possibility that a new course can be added without immediately being
scheduled, we will not require that every course will have a section associated with
it (although we assume that it will in the near future). This is then partial
participation and the connection is a single-line.
Finally, for the ENROLLED_IN relationship, we will allow for a student (probably a
new student) to be entered into the STUDENT entity if the student is not yet
enrolled in courses. This implies a partial participation in the relationship. Similarly,
we will allow a new section to be posted to the SECTION entity before any students
are enrolled. Again, this implies a partial participation in the relationship. Since both
sides have partial participation, the single lines will remain in place.
The revised ER diagram below shows participation constraints.
Figure 4.5: ER Diagram Showing Participation Constraints
An Example of (min, max) notation
We will start this example by again considering the BELONGS_TO relationship discussed above.
In the example, we assumed that all faculty members must be assigned to a department. We
will clarify that the faculty member can be assigned to only one department. We further
assumed that every department must have at least one faculty member. We will clarify that
most departments have several faculty members. Based on the assumptions, this gives
DEPARTMENT:EMPLOEE a cardinality ratio of 1:N. Since every faculty must be assigned to a
department, and every department must have at least one faculty member, this implies that
there is total participation in both directions. Total participation means that the minimum
cardinality must be at least one. Based on this, the (min, max) notation for the department side
of the relationship would be (1, N). The department must have at least one faculty member and
may have several faculty members. The (min, max) for the faculty side of the relationship is (1,
1). This indicated that the faculty member must be in at least one department and may be in at
most one department. This was represented in Figure 4.5 by using double lines to connect both
the FACULTY and DEPARTMENT entities to the BELONGS_TO relationship, indicating that the
minimum cardinality on both sides is one. In figure 4.5, the 1 on the DEPARTMENT side
indicates that the maximum cardinality for FACULTY is one, while the N on the faculty side
indicates that the maximum cardinality for DEPARTMENT is N.
Using the (min, max) notation in an ER diagram is one of the many alternative notations
mentioned in the text. Using this notation for the BELONGS_TO relationship and the two
entities would look like the following. Note that the attributes are omitted to allow focus on the
notation.
Figure 4.6: ER Diagram Showing use of (min, max) Notation
The notation used in the text for total participation (the double line) indicates that the
minimum cardinality is one. What if the minimum cardinality is greater than one? This cannot
be directly represented by the text notation. If there is a requirement that each department
have at least three faculty assigned, that fact cannot be directly represented by the notation
used in the text (nor can it be represented in most notations). It can only represent a full or
partial participation - a minimum of zero or one. The (min, max) notation allows representation
of this. The (1, N) on the DEPARTMENT side would be replaced by (3, N), thereby indicating that
there must be at least three faculty in the department.
A note is in order here about which side of the relationship should be used to place the various
values. This is not consistent across various notations. Back in the 1990s, the terms Look
Across and Look Here were introduced to indicate where the cardinality and participation
constraints were placed in the various notations. Looking again at the BELONGS_TO
relationship in Figure 4.5, the notation followed in the text uses Look Across for the cardinality
constraints. A faculty member can work for for only one department. A department can have
many faculty. So the 1 for faculty is placed across the relationship diamond, on the
DEPARTMENT side, while the N for the department is placed across the relationship, on the
FACULTY side. Different notations use Look Here and would reverse the placement.
The text then switches to Look Here for the participation constrains. BELONGS_TO is not a good
relationship to demonstrate this since the minimum cardinality on both sides of of the
relationship is one, so both entities are connected by double lines to the relationship. Looking
at the ADVISED_BY relationship, we see that participation is partial on one side and total on the
other side. Since a student must have an advisor, the STUDENT entity is connected by a double
line to the ADVISED_BY relationship. Since the double line is on the STUDENT side, that is a
Look Here notation. Similarly, some faculty do not have advisees, so that is partial participation
(minimum cardinality of zero). This means that the FACULTY entity is connected to the
relationship by a single line. Since this line is placed on the FACULTY side, it again shows the
Look Here guidelines of the notation.
Returning to Figure 4.6, the text uses Look Here when illustrating the (min, max) notation. A
faculty belongs to at least one and at most one department, so Look Here places the (1, 1) on
the FACULTY side. A department has at least one, but possibly many faculty, so Look Here
places the (1, N) on the DEPARTMENT side. As the text indicates, other conventions use Look
Across when using (min, max) notation.
To avoid confusion, we will stick with the primary notation used by the text. This is used in
Figure 4.5. Just be aware that if you are involved with creating or reading ER diagrams in the
future, be sure to check to see what notation is being used since it might be different from
what we are using here.
M 4.6: Design Choices, Topics Not
Covered, and Additional Notes
In section 3.7.3, the text covers design choices for ER conceptual design. It is sometimes
difficult to decide how to capture a particular concept in the miniworld. Should it be modeled
as an entity type, an attribute, as relationship type, or something else? Some general guidelines
are given in the text. It is important to again note that conceptual schema design overall, and
the production of an ER diagram should be thought of as an iterative process. An initial design
should be created and then refined until the most appropriate design is obtained. The text
suggests the following refinements.


In many cases, a concept is first modeled as an attribute and then refined into a
relationship when it is determined that the attribute is a reference to another entity
type. An example of this is looking at the attribute "Advisor" for the STUDENT entity
in sub-module 3.3. This attribute is a reference to the FACULTY entity type and is
captured as the ADVISED_BY relationship type in sub-module 3.4. The text then
indicates that in the notation used in the text, once an attribute is replaced by a
relationship, the attribute itself should be removed from the entity type to avoid
duplication and redundancy. This removal was done in the "complete" ER diagram
shown in Figure 3.2 in the text. However, this type of attribute was included in later
figures (e.g. Figure 3.8) which show the development of an ER diagram and would be
produced earlier in the design process. In the actual development process, Figure
3.2 would be produced by the end of the iterative process. The text shows it at the
beginning of the chapter so it can refer back to it during the chapter. We have not
removed those attributes in the diagrams so far produced. Many authors will not
remove those attributes, depending on the ER diagram notation being used. They
are removed in the diagram in the next module when the diagram is being used as
an example for illustrating the steps of an algorithm for mapping an ER diagram to a
relational model.
Sometimes an attribute is captured in several entity types. This occurred in the
example where Department Number was captured as an attribute in the FACULTY,
COURSE, and STUDENT (as Major) entity entity types. This is an attribute which
should be considered for "promotion" to its own entity type. Note that in the
example, DEPARTMENT had already been captured as its own entity type, so this
type of consideration did not apply to the example. If it had not already been

captured as an entity type, such a promotion would be considered based on this
guideline.
The reverse refinement can be considered in other cases. For example, suppose
DEPARTMENT was identified in the initial design and the only attribute identified
was Department_code. Further, assume that this was only identified as needed as
an attribute on one other entity type, say FACULTY. It should then be considered to
"demote" DEPARTMENT to an attribute of FACULTY.
In order to avoid additional complexity, a few topics that could be covered in this module are
intentionally not discussed in this course. Some are important in designing certain aspects of
very large databases. Others can be included in the design at the conceptual (ER model) level,
but will be "modified out" when the design is used to create a relational model. The intent here
is to develop a good understanding of the basics. If you are working on a design, possibly even
something you work on for a portion of your final project, and want to represent something
that just doesn't seem to fit into what we have covered here, please come back to the text and
explore the items we omitted.
Some specifics are:



Multivalued and composite attributes are not allowed in a relational schema. It is
usually better to modify the conceptual design to not include them.
Weak entity types (Section 3.5 in the text) can be handled by key selection in the
relational schema. We will handle them the same way as we handled other entity
types and relationships.
Whether or not an attribute will be derived is a design choice better made when
looking at the DBMS being used. This will be specified like other attributes in the ER
diagram.
It should again be pointed out that there are several alternate notations for drawing ER
diagrams. Some of the alternate notations are shown in Appendix A of the text. Most have both
good and not-so-good features. Often, it is what you become accustomed to use or what your
company has decided to use. It is not important in this course that you learn them all. You just
need to be aware that alternatives do exist.
We are not going to discuss (or use) UML Class Diagrams in this course. They are presented in
Section 3.8 in the text. You are likely already familiar with them and have seen a more detailed
presentation in earlier courses.
M 5.1: The Relational Database Model
and Relational Database Constraints
The relational data model was first introduced in in 1970 in a (now) classic paper by Ted Codd
of IBM. The idea was appealing due to its simplicity and its underlying mathematical
foundation. It is based on the concept of a mathematical relation which, in simple terms, is a
table of values. The computers at that time did not have sufficient processing power to make a
DBMSs built on this concept commercially viable due to the slow response time. Two research
relational DBMSs were built in the mid-1970s and became well known at the time: System R at
IBM and Ingres at the University of California Berkeley. During the 1980s, computer processing
power had improved to the point where relational DBMSs became feasible for many
commercial database applications and several commercial products were released, most
notably DB2 from IBM (based on System R), a commercial version of Ingres (which was faulted
for its poor user interface), and Oracle (from a new company named Oracle). These products
were well received, but at their initial release still could not be used for very large databases
which still required more processing power to effectively use a relational database. As
computer processing power kept improving, by the 1990s, most commercial database
applications were built on a relational DBMS.
Today, major RDBMSs include DB2 and Oracle as well as Sybase (from SAP) and SQLServer and
Microsoft Access (both from Microsoft). Open source systems are also available such as MySQL
and PostgreSQL.
The mathematical foundations for the relational model include the relational algebra which is
the basis for many implementations of query processing and optimization. The relational
calculus is the basis for SQL. Also included in the foundations are aspects of set theory as well
as the concepts of AND, OR, NOT and other aspects of predicate logic. This topic is covered in
much greater detail in Chapter 8 of the text. Please read this chapter if you are interested in
this topic, but it will not be covered in this introductory course.
M 5.1.1: Concepts and Terminology
The relational model represents the database as a collection of relations. A relation is similar to
a table of values, or a flat file of records. It is called a flat file since each record has a simple flat
structure. A relation and a flat file are similar, but there are some key differences which will be
discussed shortly. Note that a relational database will look similar to the example database
from Module 1.
When looking at a relation as a table of values, each row in the table represents a set of related
values. This is called a tuple in relational terminology, but in common practice, the formal term
tuple and the informal term row are used interchangeably.
The table name and the column names should be chosen to provide, or at least suggest, the
meaning of the values in each row. All values in a given column have the same data type.
Continuing with the more formal relational terminology, a column is called an attribute and
a table is called a relation. Here also, in common practice the formal and informal terms are
used interchangeably.
The data type which indicates the types of values which are allowed in each column is called
the domain of possible values.
More details and additional concepts are discussed below.
Domains
A domain defines the set of possible values for an attribute. A domain is a set of atomic
values. Atomic means that each value in the domain cannot be further divided as far at the
relational model is concerned. Part of this definition does rely on design issues. For example, a
domain of names which are to be represented as full names can be considered to be atomic in
one design because there is no need to further subdivide the name for this design. Another
design might require the first name and last name to be stored as separate items. In this design,
the first name would be atomic and the last name would be atomic, but the full name would
not be atomic since it is further divided in this design.
Each domain should be given a (domain) name, a data type, and a format. One example would
be phone_number as a domain name. The data type can be specified as a character string. The
format can be specified as ddd-ddd-dddd, where "d" represents a decimal digit. Note that there
are some further restrictions (based on phone company standards) which are placed on both
the first three digits (known as the area code) and the second set of three digits (known as the
exchange). These can be completely specified to reduce the chance of a typo-like error, but this
is usually not done.
Another example would be item_weight as a domain name. The data type could be an integer,
say from 0 to 500. It could also be a float from 0.0 to 500.0 if fractions are important, especially
at lower weights. An additional format specification would not be needed in this case.
However, in this case a unit of weight (pounds or kilograms) should be specified.
It is possible for several attributes in a relation to have the same domain. For example, many
database applications require a phone number to be stored. This can be assigned
the phone_number domain given above. In addition, it is becoming common for databases to
store home phone, work phone, and cell phone. In this case there would be three attributes. All
would use the same phone_number domain.
Relation Schema
If we skip the mathematical nomenclature, a relation schema consists of a relation name and a
list of attributes. Each attribute names a role played by some domain in the relation schema.
The degree (also known as arity) of a relation is the number of attributes in the relation. If a
relation is made up of five attributes, the relation is said to be a relation of degree five.
Relation State
The relation state (sometimes just relation) of a relation schema, is defined by the set of rows
currently in a relation. If the degree of the relation is n, each row consists of an n-tuple of
values. Each value in the n-tuple is a member of the domain of the attribute which matches the
position of the value in the tuple. The exception is that a tuple value may contain the value
NULL, which is a special value. Sometimes, the schema is known as the relation intension and
state is known as the relation extension.
The value NULL is assigned to a value in a tuple in three different cases. First, it is used when a
value does not apply. For example, many databases include information about a
customer's company in their sales record. Assume the database keeps track of both corporate
and individual customers. If the sale is personal (to an individual) and not on behalf of the
company, all information related to the customer's company would be given the value of NULL
in that record. A second case where NULL is used is when a value does not exist. An example is a
customer home phone. While the value does apply in general, many people no longer have land
line phones at home. If a customer does not have a home phone, that value would be stored as
NULL in the record. Finally, NULL is also used when it is unknown whether a value exists.
Consider again the phone example above. In the above example we knew that the customer did
not have a home phone. There is also a case where we don't know whether or not the
customer has a home phone. If the customer has a home phone, we do not know the number.
We would need to record a NULL value in this case also.
M 5.1.2: Characteristics of Relations
A relation can be viewed as similar to a file or a table. There are similarities, but there are
differences as well. Some of the characteristics which make a relation different are discussed
below.
Ordering of Tuples in a Relation
A relation is defined as a set of tuples. In mathematics, there is no order to elements of a set.
Thus, tuples in a relation are not ordered. This means that in a relation, the tuples may be in
any order - one ordering is as good as any other. There is no preference for any particular
order.
In a file, on the other hand, records are stored physically (on a disk for example), so there is
going to be an ordering of the records which indicates first, second, and so on until the last
record. Similarly, when we display a relation in a table, there is an order in which the rows are
displayed: first, second, etc. This is just one particular display of the relation and the next
display of the same relation, may show the table with the tuples in a different order.
However, there is a restriction on relations that no two tuples may be identical. This is
different from a file, where duplicate records may exist. In many file uses, there are checks to
make sure that duplicate records do not exist, but the nature of a file does not place
restrictions on duplicate records. In a relational DBMS, duplicate tuples will not be allowed. No
additional software checks will need to be performed.
Ordering of Values within a Tuple
The ordering of values in a tuple is important. The first value in a tuple indicates the value for
the first attribute, the second value in a tuple indicates the value for the second attribute, etc.
There is an alternative definition where each tuple is viewed as a set of ordered pairs, where
each pair is represented as (<attribute>, <value>). This provides self-describing data, since the
description of each value is included in the tuple. We will not use the alternative definition in
this course.
Values in the Tuples
Each value in a tuple must be an atomic value. This means that it cannot be subdivided within
the current relational design. Therefore, composite and multivalued attributes are not allowed.
This is sometimes called the flat relational model. This is one of the assumptions behind the
relational model. We will see this concept in more detail when we discuss normalization in a
later module. Multivalued attributes will be represented by using separate relations. In the
strictly relational models, composite attributes are represented by their simple component
attributes. Some extensions of the strictly relational model, such as the object-relational model,
allow complex-structured attributes. We will discuss this in later modules. For the next several
modules, we will assume that all attributes must be atomic.
Interpretation (Meaning) of a Relation
A relation schema can be viewed as a type of assertion. As an example, consider the example
database presented in Module 1. The COURSE relation asserts that a course entity has Number,
Name, Credit_hours, and Department. Each tuple can be viewed as a fact, or an instance of the
assertion. For example, there is a COURSE with Number CMPSC121, Name "Introduction to
Programming, Credit_hours: 3, and Department "CMPSC".
Some relations, such as COURSE represent facts about entities. Other relations represent facts
about relationships, such as the TRANSCRIPT relation in the example database. The relational
model represents facts about both entities and relationships as relations. In the next major
section of this module, we will examine how different constructs in an ER diagram are
converted into relations.
M 5.1.3: Relational Model Constraints
There are usually many restrictions or constraints that should be put on the values stored in a
database. The constraints are directed by the requirements representing the miniworld that the
database is to represent. This sub-module discusses the restrictions that can be specified for a
relational database. The constraints can be divided into three main categories.
1. Constraints from the data model, called inherent constraints. These are modelbased.
2. Constraints which are schema-based or explicit constraints. These are expressed in
the schemas of the data model, usually by specifying them in the DDL.
3. Constraints which cannot be expressed in either #1 or #2. They must be expressed in
a different way, often by the application programs. These are called semantic
constraints or business rules.
Constraints of type #1 are driven by the relational model itself and will not be further discussed
here. An example of a constraint of this type would be the fact that no relation can have two
identical tuples. Constraints of type #3 are often difficult to express and enforce within the data
model. They relate to the meaning and behavior of the attributes. These constraints are often
enforced by application programs that update the database. In some cases, this type of
constraint can be handled by assertions in SQL. We will discuss these in a future module.
An additional category of constraints is called data dependencies. These include functional
dependencies and multivalued dependencies. This category focuses on the quality of the
relational database design. Database normalization uses these. We will discuss normalization in
a later module.
Below, the type #2 constraints, the schema-based constraints are listed and discussed.
Domain Constraints
Domain constraints indicate the requirement for the value of a particular attribute in each tuple
of the relation by specifying the domain for the attribute. In an earlier sub-module, we
discussed how domains are specified. Some data types associated with domains are: Standard
data types for integers and real numbers, characters, Booleans, and both fixed and variablelength strings. Other special data types are available. These include date, time, timestamp, and
others. This is further discussed in the next module.
It is also possible to further restrict data types, for example by taking an int data type and
restricting the range of integers which are allowed. It is also possible to provide a list of values
which are allowed in the attribute.
Key Constraints and Constraints on NULL Values
We indicated above that in the definition of a relation, no two tuples may have the exact same
values for all their elements. It is often the case that many subsets of attributes will also have
the property that no two tuples in the relation will have the exact same values for the subset of
attributes. Any such set of attributes is called a superkey of the relation. A superkey specifies a
uniqueness constraint in that no two tuples in the relation will have the exact same values for
the superkey. Note that at least one superkey must exist; it is the subset which consists of all
attributes. This subset must be a superkey by the definition of a relation.
A superkey may have redundant attributes - attributes which are not needed to insure the
uniqueness of the tuple. A key of a relation is a superkey where the removal of any attribute
from the set will result in a set of attributes which is not longer a superkey. More specifically, a
key will have the properties:
1. Two different tuples cannot have identical values for all attributes in the key. This is
the uniqueness property.
2. A key must be a minimal superkey. That means that it must be a superkey which
cannot have any attribute removed and still have the uniqueness property hold.
Note that this implies that a superkey may or may not be a key, but a key must be a superkey.
Also note that a key must satisfy the property that it is guaranteed to be unique as new tuples
are added to the relation.
In many cases, a relation schema may have more than one key. Each of the keys is called
a candidate key. One of the candidate keys is then selected to be the primary key of the
relation. The primary key will be underlined in the relation schema. If there are multiple
candidate keys, the choice of the primary key is somewhat arbitrary. Normally, a candidate key
with a single attribute, or only a small number of attributes is chosen to be the primary key. The
other candidate keys are often called either unique keys or alternate keys.
One additional constraint that is applied to attributes specifies whether or not NULL values are
permitted for the attribute. Considering the example database from Module 1, if every student
must have a declared major, then the MAJOR attribute is constrained to be NOT NULL. If it is
acceptable for a student to at times not have a declared major, then the attribute is not so
constrained.
Entity Integrity, Referential Integrity, and Foreign Keys
The entity integrity constraint requires that no tuple may have a NULL value in any attribute
which makes up the primary key. This follows from the fact that the primary key is used to
uniquely identify tuples in a relation.
Both entity integrity constraints and key constraints deal with a single relation. On the other
hand, referential integrity constraints are used between two relations. They are used to
maintain consistency between the tuples of the two relations. A referential integrity constraint
indicates that a tuple in one of the relations refers to a tuple in a second relation. Further, it
must refer to an existing tuple in the second relation. For example, in the example database,
the MAJOR attribute in the STUDENT relation refers to and must match the value in some tuple
in the CODE attribute of the DEPARTMENT relation.
More specifically, we need to define the concept of a foreign key. A set of attributes in Relation
One is a foreign key that references Relation Two if it satisfies:
1. The attributes of the foreign key in Relation One have the same domains as the
primary key attributes in Relation Two. The foreign key is said to reference Relation
Two.
2. The value of the foreign key in a tuple of Relation One must either occur as a value
of a primary key in Relation Two or must be NULL.
Note that it is possible for a foreign key to refer to the primary key in its own relation. This can
occur in a unary or recursive relationship.
Integrity constraints should be indicated on the relational database schema. Many of these
constraints can be specified in the DDL and automatically enforced by the DBMS.
M 5.1.4: Update Operations and
Dealing with Constraint Violations
Operations applied to an RDBMS can be split into retrievals and updates. Queries are used to
retrieve information from a database. Retrievals will be discussed later. Retrieving information
does not violate any integrity constraints. There are three types of update operation:
1. Insert (sometimes referred to as create) is used to add a new tuple to the database.
2. Update (sometimes referred to as modify) is used to change the value of one or
more attributes in an existing tuple.
3. Delete is used to remove a tuple from the database.
Each of the three update operations can potentially violate integrity constraints as indicated
below. How the potential violations can be handled is also discussed.
Insert Operation
The insert operation provides a list of attribute values for a new tuple to be added to a relation.
This operation can violate:
1. Domain constraints if a value is supplied for an attribute which is not of the correct
data type or does not adhere to valid values in the domain.
2. Key constraints if the key value for the new tuple already exists in another tuple in
the relation.
3. Entity integrity if any part of the primary key of the new tuple contains the NULL
value.
4. Referential integrity if the value of any foreign key in the new tuple refers to a tuple
which does not exist in the referenced relation.
In most cases, when an insert operation violates any of the above constraints, the insert is
rejected. There are other options, but they are not commonly used.
Delete Operation
The delete indicates which tuple should be deleted. This operation can only violate referential
integrity.
This violation happens when the tuple being deleted is referenced by foreign keys in other tuples
in the database. When deleting a tuple would result in a referential integrity violation, there are
three main options available:
1. Restrict - reject the deletion.
2. Cascade - try to propagate the deletion by deleting not only the tuple specified in
the delete operation, but also by deleting all tuples in referencing relations that
reference the tuple being deleted.
3. Set to NULL - set all referencing attribute values to NULL, then delete the tuple
specified in the delete operation. Note that this option will not work if one of the
attributes in a referencing tuple is part of the primary key in the referencing
relation. This would cause a different violation since no part of a primary key can be
set to NULL.
Update Operation
The update operation is used to change the values of one or more attributes in a tuple. This can
lead to several possibilities:
1. Updating a attribute which is not part of a primary key or a foreign key is usually
acceptable. However this type of update can violate domain constraints if the new
value supplied for an attribute is not of the correct data type or does not adhere to
valid values in the domain.
2. Modifying the value of an attribute which is part of a primary key causes, in effect, a
deletion of the original tuple and an insertion of a new tuple with the new value(s)
forming the key of the new tuple. This can have any of the impacts for delete and
insert discussed above.
3. Modifying the value of a foreign key attribute involves verifying that the new foreign
key references an existing tuple in the referenced relation. The options for dealing
with any resulting violation are similar to those discussed in the delete operation.
relational database schema
M 5.2: Relational Database Design
Using ER-to-Relational Mapping
In previous modules we examined taking a set of requirements and creating a conceptual
schema in the form of an ER diagram. The next step in the process is creating a logical database
design by creating a relational database schema based on the conceptual schema. This is known
as data model mapping. This part of the module presents a procedure for creating a database
schema from an ER diagram.
The text presents a seven-step algorithm for converting basic ER model concepts into relations.
The algorithm presented here modifies the algorithm in the text to account for (intentional)
omissions and modifications to some of the material presented in the text. Specifically, step 2
from the text has been eliminated since we are treating weak entity types as regular
relationships. Also, step 6 from the text has been eliminated since we are not allowing
multivalued attributes. They are eliminated during the design when the ER diagram is
constructed. This leaves the five-step algorithm presented here.
Consider the following ER diagram as an example as we work through the steps of the
algorithm. Note that this is similar to the ER diagram in the second ER diagram module, but it
has been extended to illustrate additional points. Note that following the design choices from
the text which are highlighted in sub-module 4.6, the attributes which are refined into a
binary relationship have been removed from the entity type. Also note that participation
constraints are not needed in the mapping algorithm, so single lines are used for all connections
between entities and relationships regardless of the nature of participation.
Example ER Diagram
Figure 5.1: ER Diagram to Use as Example for Algorithm Steps
The mapping algorithm follows.
Step 1: Mapping of Regular Entity Types
For each regular entity type in the ER schema, create a relation that includes all the attributes.
Note that unlike the text, we did not allow composite attributes, so all attributes will be simple
and will be included in the relation.
Next consider all the candidate keys identified for an entity. Choose one of the candidate keys
as primary key. Repeat this for each relation created in this step. Note that foreign keys and
relationship attributes (if there are any) are not addressed in this step. They will be addressed
in steps below.
The relations created during this step are sometimes called entity relations because each tuple
in these relations represents an entity instance.
In our example ER diagram, the regular entity types STUDENT, FACULTY, DEPARTMENT,
COURSE, and SECTION will lead to the creation of relations STUDENT, FACULTY, DEPARTMENT,
COURSE, and SECTION. All attributes shown with the entities in the diagram will be included in
the corresponding relation.
For the STUDENT relation, (student) ID will be the primary key. It might be possible to declare
(student) Name as a candidate key, but since there is at least a slight possibility that duplicate
names (e.g. John R. Smith) will be present at some point during the life of the database, this
cannot be a candidate/primary key unless we have a mechanism in the name attribute for
keeping the names unique. An admittedly awkward way to do this might be "John R. Smith -1"
and "John R. Smith - 2" as values in the name field. Similarly, we would use (faculty) ID as the
primary key for the FACULTY relation. We would use (department) Code as the primary key for
the DEPARTMENT relation, Number as primary key for the COURSE relation, and (section) ID as
the primary key for the SECTION relation.
Note that we made these key choices when originally drawing the ER diagram. If we do this, the
key determinations at this step have already been made. This just reviews some of the
decisions made at that earlier point.
The result after this step is shown here.
Relations after Step 1
STUDENT
Student_id
Name
FACULTY
Faculty_id
DEPARTMENT
Code
Name
COURSE
Number
Name
SECTION
Section_id
Semester
Figure 5.2 Entity Relations after Step 1
Step 2: Mapping of Binary 1:1 Relationship Types
For each binary 1:1 relationship type in the ER schema, identify the two relations that
correspond to the the entity types participating in the relationship. Call the two relations R1
and R2. There are three ways to handle this. By far the most common is the first approach. The
other two are useful in special cases.
1. Foreign Key Approach: Choose one of the relations - we'll choose R1 - and include as
a foreign key in R1, the primary key of R2. If there are any attributes assigned to
the relationship (and not to either relation directly), those attributes will be
included as attributes of R1.
2. Merged Relation Approach: This approach involves merging the two entity types
into a single relation. If this approach is chosen, it is preferable for both
participations to be total. This implies that there are the same number of tuples in
both tables.
3. Cross-reference or Relationship Relation Approach: This approach involves setting
up a third relation, which we'll call RX, which exists for cross-referencing the primary
keys of relations R1 and R2. We will see in a later step, that this is the approach used
for a binary M:N relationship. The relation RX is known as a relationship
relation because each tuple in RX indicates a relationship between a tuple in R1 and
a tuple in R2.
The pros and cons of the three approaches can be further illustrated with an
example.
Example Illustrating the Three Approaches
In this simple example, suppose we are modeling a company. Two of the entities we have
captured are SALESPERSON and OFFICE. Keeping the example simple, the primary key of the
SALESPERSON entity is a unique Salesperson Number assigned to each salesperson. Also of
interest are the attributes: Salesperson Name, Salary, and Year of Hire. The primary key of the
OFFICE entity is Office Number. Assume that all offices are in the same building, so the Office
Number is unique. We are also interested in the size of the office and whether or not the office
has a window.
Also assume that this is a sales force that does most of its work from the office, so a
salesperson will be in his/her office most of the time. Since a salesperson will be on the phone
much of the time, only one salesperson will be assigned to a given office to minimize
distractions to phone calls with customers. Note that in other cases where salespeople share
offices, such as when they are usually on the road and only use an office occasionally, this
would become a 1:N relationship and would be covered in the next step.
Examining the foreign key approach first, we'll choose SALESPERSON as R1 and OFFICE as R2.
We create the SALESPERSON relation with all its attributes and designate Salesperson Number
as the primary key. We also include as an attribute, the foreign key Office Number. We then
create the OFFICE relation with primary key Office Number and attributes Size and Window. We
could also have used the design where OFFICE is chosen as R1 and SALESPERSON is chosen as
R2. With this design, SALESPERSON would not have a foreign key in its table, but OFFICE would
now include the foreign key Salesperson Number.
Since either design meets the outline of the approach, which design works better? It depends
on the actual situation in the miniworld being modeled. If there is total participation by one
relation in the relationship, that relation should be chosen as R1. If our miniworld indicates that
every salesperson has an office, but there is the possibility that one or more offices are empty
(do not have a salesperson currently assigned), then there is total participation by
SALESPERSON in the relationship (which might be called WORKS_IN), but not total participation
by OFFICE. Therefore, we should choose SALESPERSON to be R1. Since every salesperson has an
office, the foreign key attribute (which can also be called Office Number) will always have a
value in each tuple. There may be one or more offices in the OFFICE relation which are not
occupied. That would not change tuples in the OFFICE relation. It just may be that some of the
tuples in the relation are not currently related to a SALESPERSON tuple. With the same
assumptions, if we choose OFFICE to be R1, some of the OFFICE tuples will have NULL values in
the foreign key field (which might be called Salesperson Number). In this scenario, the first
choice of SALESPERSON as R1 is preferred.
If the actual situation in the miniworld is modified, where all offices are occupied, but not all
salespersons have offices, the choice needs to be reexamined. This might be the case if some of
the salespersons are "home based," but others travel. The mostly home based salespersons are
assigned an office, but those who mostly travel are not. In this case, all offices are occupied, but
not all salespersons have an office. This means that there is total participation by OFFICE in the
relationship (which this time we might call OCCUPIED_BY). Therefore OFFICE should be chosen
as R1. Since every office is occupied, the foreign key attribute (which can also be called
Salesperson Number) will always have a value in each tuple. There may be one or more
salespersons in the SALESPERSON relation who do not have offices. That would not change
tuples in the SALESPERSON relation. It just may be that some of the tuples in the relation are
not currently related to an OFFICE tuple. With the same assumptions, if we choose
SALESPERSON to be R1, some of the SALESPERSON tuples will have NULL values in the foreign
key field. In this scenario, the second design choice of OFFICE as R1 is preferred.
Next we'll examine the merged relation approach. One reason that this approach is not favored
is that the two entities are thought of as separate and this is reinforced by the fact that they
were drawn separately in the ER diagram. Despite this, the approach can work when both
participations are total. In the example, this would mean that every salesperson works in an
office and every office is occupied by a salesperson. Then either office number or salesperson
number can be chosen as the primary key.
If both participations are not total, then either there can be empty offices or some salespersons
may not have offices, or both. If there are empty offices, then office number must be chosen as
the primary key and in some tuples the information related to salespersons will be NULL. If
some salespersons do not have offices, then salesperson number must be chosen as the
primary key and in some tuples the information related to offices will be NULL. If neither
participation is total, some of the tuples will have non-NULL for all attributes, some will have
NULL values for the salesperson information, and others will have NULL information for the
office information. This leads to the question of which attribute will be the primary key? Closer
examination will show that whatever attribute is picked, it is possible that the primary key value
will be NULL for some tuples. This is not allowed, so this approach cannot be used when neither
participation is total.
Finally, there is the Cross-reference or Relationship Relation Approach. This involves keeping
the R1 and R2 relations with only their immediate attributes. We then create a new relation RX
to handle the cross-reference. In the example, we might call this relation WORK_LOCATION.
This relation will have tuples with only two attributes: Salesperson Number and Office Number.
The two are used together to form the primary key of this relation. Each will also be a foreign
key, one in the SALESPERSON table, and one in the OFFICE table. This approach has the
drawback that it requires an extra relation and this will require additional processing for certain
queries. As mentioned, this approach is required in an M:N relationships, but it is not common
for 1:1 relationships.
In our example from Figure 5.1, the CHAIR_OF relationship is a 1:1 binary relationship. We will
choose the foreign key approach for mapping this relationship. This requires that we include
the primary key of one of the relations as a foreign key in the other relation. If we include the
primary key of DEPARTMENT (Code) as a foreign key in FACULTY, there will be may NULL values
in the attribute set up as a foreign key since only a relatively small number of faculty are chairs.
However, if we use the primary key of FACULTY (Faculty_id) as a foreign key in DEPARTMENT,
there will be no NULL values in this attribute since every department must have a chair. So, we
will use the second choice and include an attribute, which we'll call Chair_id, in the
DEPARTMENT relation. Note that this will be a foreign key referencing the FACULTY relation.
The modified DEPARTMENT relation is shown below.
Department Relation after Step 2
DEPARTMENT
Code
Name
Figure 5.3 Modified DEPARTMENT Relation after Step 2
Step 3: Mapping of Binary 1:N Relationship Types
For each binary 1:N relationship type in the ER schema, there are two possible approaches. By
far the most common is the first approach since it reduces the number of tables.
1.
1. Foreign Key Approach: Find the relation for the N-side of the
relationship. We'll call this R1. Include as a foreign key in R1,
the primary key of the other relation, which can be called R2.
2. The Relationship Relation Approach: As with the third
approach in the previous step, this approach involves setting
up a third relation, which we'll call RX. This relation exists for
cross-referencing the primary keys of relations R1 and R2. The
attributes of RX are the primary keys of R1 and R2. This
approach can be used if there are only a few tuples of R1
which participate in the relationship. Using the first approach
would cause many NULL values in the foreign key field of R1.
In our example from Figure 5.1, there are five 1:N binary relationships. We will choose the
foreign key approach for mapping these relationships. To do this, we will select the relation on
the N-side of the relationship and include in this relation as a foreign key, the primary key of the
relation on the 1-side. Specifically, in our example:

For the ADVISED_BY relationship, we will include in the STUDENT relation (the Nside) a foreign key which contains the primary key of the FACULTY relation
(Faculty_id). We will call this new attribute Advisor_id. Note that this name is
appropriate in the context of the STUDENT relation, but realize that it is the
Faculty_id from the FACULTY relation.
Using similar logic, for the BELONGS_TO relationship, we will include in the FACULTY
relation (the N-side) a foreign key which contains the primary key of the
DEPARTMENT relation. We will call this new attribute Department.
For the OWNS relationship, we will include in the COURSE relation (the N-side) a
foreign key which contains the primary key of the DEPARTMENT relation. We will
call this new attribute Department.
For the MAJORS_IN relationship, we will include in the STUDENT relation (the Nside) a foreign key which contains the primary key of the DEPARTMENT relation. We
will call this new attribute Major.
For the OFFERING_OF relationship, we will include in the SECTION relation (the Nside) a foreign key which contains the primary key of the COURSE relation. We will
call this new attribute Course_number.




The modified relations are shown below.
Relations after Step 3
STUDENT
Student_id
Name
Class_rank
FACULTY
Faculty_id
Name
DEPARTMENT
Code
COURSE
Name
Number
Name
SECTION
Section_id
Semester
Figure 5.4 Entity Relations after Step 3
Step 4: Mapping of Binary M:N Relationship Types
Since we are following the traditional relational model and not allowing multivalued attributes,
the only option for the M:N relationships is to use the relationship relation (cross-reference)
approach.
For each binary M:N relationship type create a new relation RX to represent the relationship.
Include in RX, as foreign keys, the primary keys of the the relations that represent the entities
involved in the relationship. The combination of the foreign keys will form the primary key of
RX. Add to the relation RX any attributes which are associated with the relationship as shown in
the ER diagram.
In our example from Figure 5.1, there is one M:N binary relationship, ENROLLED_IN. We will
create a new relation to represent the relationship. We could name the relation
ENROLLED_IN. However, the requirement which led to identifying this relationship is based on
a desire for transcript information, so we will name the new relation TRANSCRIPT. Following
the algorithm, we include the primary keys of the two relations STUDENT and SECTION as
foreign keys in the TRANSCRIPT relation. The two keys together will form the primary key of the
TRANSCRIPT relation. We will also include any attributes associated with the relationship. In
this case, there is only one: Grade.
The new TRANSCRIPT relation is shown below.
TRANSCRIPT Relation after Step 4
TRANSCRIPT
Student_id
Section_id
Figure 5.5 TRANSCRIPT Relation after Step 4
Step 5: Mapping of N-ary Relationship Types
For this mapping, we again use the relationship relation approach.
For each N-ary relationship type (where N > 2) create a new relation RX to represent
the relationship. Include in RX, as foreign keys, the primary keys of the relations that represent
all entities involved in the relationship. The combination of the foreign keys will form the
primary key of RX. Add to the relation RX any attributes which are associated with
the relationship as shown in the ER diagram.
Note that this step does not apply to the ER diagram in Figure 5.1 since there are no N-ary
relationships in the diagram. A separate example of how to apply this step is given below.
Example
Consider the example for n-ary relationships from Module 4.3. We looked at a store where the
entities CUSTOMER, SALESPERSON, and PRODUCT have been identified. Assume the store sells
higher priced items, such as an appliance store or a furniture store, where unlike other stores
like a grocery store, most customers buy only one or a few items at a time from a salesperson
who is on commission. The customer is identified by a customer number, the product is
identified by an item number, and the salesperson is identified by a salesperson number. It is
desirable to keep track of which items a customer purchased on a particular day from which
salesperson. We want to record the quantity of a particular item which is purchased as well as
the date purchased. We defined a ternary relationship, called PURCHASE, between CUSTOMER,
SALESPERSON, and PRODUCT. Quantity and Date Purchased become attributes of the
PURCHASE relationship. In this example, we would define relations for CUSTOMER,
SALESPERSON, and PRODUCT in step 1. In this step, we define a relation called PURCHASE. This
relation contains as foreign key attributes, the primary keys of CUSTOMER, SALESPERSON, and
PRODUCT. These attributes are combined to form the primary key of PURCHASE. Also included
as attributes of PURCHASE are Quantity and Date Purchased since these are attributes of
the relationship and not independent attributes of any of the three entities which participate in
the relationship.
If you want to refresh your memory, the ER diagram shown as Figure 4.2 can be seen by clicking
on the toggle below.
A Copy of Figure 4.2
Figure 4.2: ER Diagram Representation of Ternary Relationship
M 6.1: SQL Background
Relational DBMSs are still the most widely used DBMSs in industry today. Almost, if not all,
RDBMSs support the SQL standard.
The standard had its origins with the SEQUEL language developed by IBM in the 1970s for use
with its experimental RDBMS, SYSTEM R. As relational systems began to move from prototype
systems to production systems, each vendor had its own proprietary RDBMS and its own
proprietary DML/DDL language. Many were similar to SEQUEL (later renamed SQL: Structured
Query Language). The differences caused problems for users who wanted to switch RDBMS
vendors. The cost of conversion was often quite high since switching systems often required
table modification to run on the new system, and many programs which accessed the DBMS
often required extensive modification to work with the new system. Customers who wanted to
run two or more systems from different vendors faced similar problems since interface
programs written for one system would not run directly on the other system.
Under pressure from users, vendors moved to help alleviate this problem. Under the leadership
of the American National Standards Institute (ANSI) a standard was developed for SQL (called
SQL1 or SQL-86). This first standard was quite similar to IBM's version of SQL. Suggested
improvements to the IBM version were delayed to SQL2 to allow the users of the IBM RDBMS
(named DB2 when SYSTEM R was released as a production version) to delay likely necessary
changes to their system. This made sense at the time since the majority of early RDBMS users
ran DB2 on IBM computers.
If vendors adhered to the standard, users could switch vendors or run systems from multiple
vendors with much less rework required. Although most vendors continued to add features
beyond those specified in the standard, users could choose to ignore the vendor-specific
features to minimize rework when moving to a different system.
The revised and greatly expanded standard known as SQL2 (officially SQL-92) was released in
1992. Creating an official standard takes several years, with the final approval process usually
taking well over a year since there are opportunities for user comments. The committee
discusses the comments and decides which to include in the official standard. Some vendors
often have features in their DBMS which later become part of the standard. Also, most vendors
have not implemented all the features which are included in the standard by the time the
standard is released. It often takes vendors a few years before they include all the features
specified in a new standard.
The standards have been updated over time. Major revisions have been:



SQL:1999 (also known as SQL3);
SQL:2003 which began support for XML;
SQL:2006 which added additional XML features;




SQL:2008 which added more object database features
SQL:2011
SQL:2016
SQL:2019 is currently in progress
SQL is comprehensive, including statements for data definitions, queries, and updates. This
means that it is both a DDL and a DML. It includes many additional facilities including rules for
embedding SQL statements into programming languages such as Java and C++.
Beginning in 1999, the standards were split into a core specification plus specialized extensions.
The core is to be implemented by all SQL compliant vendors. The extensions can be
implemented as optional modules if the vendor chooses.
M 6.2: SQL Data Definition and Data
Types
SQL uses the (informal) terms table, row, and column for the formal relational
terms relation, tuple, and attribute. These are most often used interchangeably when
discussing SQL.
Below we use all upper case for SQL keywords such as CREATE TABLE or PRIMARY KEY.
However, SQL is case insensitive when examining keywords. Specifying CREATE TABLE, create
table, or Create Table will all be equivalent in SQL. The same is not true for character string data
stored in rows (tuples) of the database. The database is case sensitive when processing data.
The concept of a schema was introduced in SQL2. This allowed a user to group tables and
related constructs that belonged to the same schema (often just called a database). This
feature allows an RDBMS to host many different databases. An SQL schema is given a schema
name and an authorization identifier which indicates which user owns the schema. It also
includes descriptors for each element in the schema. These schema elements include such
items as tables, types, constraints, views, domains, and others. A schema is created using the
CREATE SCHEMA statement. This must include the schema name. It can include the definitions
for no, several, or all schema elements. The elements not defined in the statement can be
defined later.
An example for a university database would be:
CREATE SCHEMA UNIVERSITY;
Note that SQL statements end with a semicolon. The semicolon is optional in some DBMSs and
SQL statements will run fine without them. It is required in other DBMSs.
The next step in creating a database is to create the tables (relations). This is done using the
CREATE TABLE command. This command is used to name the table and give its attributes and
initial constraints. Often the schema name is inferred from the environment and need not be
specified explicitly, but it can be specified explicitly if desired or if needed. An example without
and then with the schema would be:
CREATE TABLE STUDENT (
rest of specification
);
CREATE TABLE UNIVERSITY.STUDENT (
rest of specification
);
Inside the parentheses, attributes are declared first. This includes the name given to the
attribute, a data type to provide the domain, and optional attribute constraints such as NOT
NULL. An example for two of the tables in the example database from Figure 1.2 follows.
CREATE TABLE STUDENT
(
STU_ID
CHAR(5)
NOT NULL,
SNAME
VARCHAR(25)
NOT NULL,
RANK
CHAR(2),
MAJOR
VARCHAR(5),
ADVISOR
INT,
PRIMARY KEY(STU_ID)
);
CREATE TABLE DEPARTMENT
(
CODE
VARCHAR(5)
NOT NULL,
DNAME
VARCHAR(25)
NOT NULL,
BUILDING
VARCHAR(35),
NOT NULL
PRIMARY KEY(CODE)
);
Tables (relations) created using CREATE TABLE are called base tables. This means that the
tables and their rows are stored as actual files by the DBMS. A later module will discuss the
CREATE VIEW statement. This statement creates virtual tables, which may or may not be
represented by a physical file. In SQL, attributes specified in a base table are considered to be in
the order as specified in the CREATE TABLE statement. Remember, however, that row ordering
is not relevant so rows are not considered to be ordered in a table.
Keys were discussed in the last module. The primary key is specified as shown above. Foreign
keys can be specified in the CREATE TABLE statement. However, doing this might lead to errors
if the foreign key references a table which has not yet been created. The foreign keys can be
added later using the ALTER TABLE statement. We will discuss this statement in a later module.
Foreign keys have been left off of the CREATE TABLE example above.
There are several basic data types in SQL. They are listed below with brief descriptions.
Additional details can be found in the text. We will cover additional details as they are needed.

o
o
o
o
o
Numeric data types include integer numbers of different sizes (INT and
SMALLINT). They also include floating-point numbers of different
precisions ( FLOAT, REAL, and DOUBLE PRECISION). SQL also includes
formatted numbers using DEC(i,j) where i specifies the total number of
digits and j specifies the number of digits after the decimal point.
Character-string data types can be either fixed length specified by
CHAR(n) where n is the length of the string or varying length specified by
VARCHAR(n) where n is the maximum length of the string.
Bit-string data types are also either fixed length of n bits (BIT(n)) or
variable length of maximum length n (BIT VARYING (n)).
The Boolean data type has the traditional values of TRUE and FALSE.
However, to allow for the possibility of NULL values, SQL also allows the
value of UNKNOWN. This three-valued logic is not quite the same as the
traditional two-valued Boolean logic. It is discussed in more detail in a
later module.
The DATE data type consists of both DAY and TIME components.
Additional, more specialized, data types exist. Some are not part of the standard, but have been
implemented by individual vendors. They will be discussed if they are needed.
M 6.3: Specifying Constraints in SQL
There are several basic constraints which can be specified in SQL as part of table creation.
These include key and referential integrity constraints, restrictions on attribute domains and
use of NULLs, and constraints on individual tuples by using the CHECK clause. An additional type
of constraint, called an assertion, is discussed in a future module.
Specifying Attribute Constraints and Attribute Defaults
By default, SQL allows NULL to be used as a value for an attribute. A NOT NULL constraint can
be specified for any attribute where a value of NULL is not permitted. Examples of this were
shown in the last sub-module. Remember from the last module that the entity integrity
constraint requires that no tuple may have a NULL value in any attribute which makes up the
primary key. Therefore, any attribute specified as a primary key will not allow nulls by default,
but any other attributes where NULL should not be permitted will require the NOT NULL
constraint to be specified.
Unless the NOT NULL constraint is specified, the default value assigned to an attribute in a tuple
is NULL. If a different default value is desired, it can be defined by appending the "DEFAULT
<value>" clause to the attribute definition. An example of this would be to assign the default
value of "3" to the credit hours attribute in the COURSE table. This would be done by specifying
the attribute as:
Credit_hours
INT
NOT NULL
DEFAULT 3;
An additional constraint can be added to restrict the domain values of an attribute. This can be
done using the CHECK clause. If there is university policy that no course may be offered for
more that six credit hours, this can be specified as:
Credit_hours
INT
NOT NULL
CHECK (Credit_hours > 0 AND Credit_hours < 7);
Specifying Key and Referential Integrity Constraints
The PRIMARY KEY clause specifies the primary key for the table. Examples of specifying the
primary key were shown in the last sub-module. If multiple attributes are required to make up
the primary key, all key attributes are listed in the parentheses. For example, in the
TRANSCRIPT table, the primary key is a combination of student ID and section ID. In the
TRANSCRIPT table, this would be specified as:
PRIMARY KEY (Stu_id, Section_id);
We have already discussed alternate or candidate keys. These are also unique, but were not
chosen as the primary key for the relation. This can be specified in a UNIQUE clause. An
example of this would be if we assume department name is unique in the DEPARTMENT
relation, we can specify:
UNIQUE (Dname);
For both primary and alternate keys, if the key is made up of only one attribute the
corresponding clause can be specified directly. Examples for the DEPARTMENT table would be:
Code
VARCHAR(5)
PRIMARY KEY;
Dname VARCHAR(25) UNIQUE;
Referential integrity constraints are specified by using the FOREIGN KEY clause. Remember
from an earlier module that it is possible to violate a referential integrity constraints when a
tuple is added, deleted, or modified (the update operations). Specifically, the modify would be
to a primary key or foreign key value. Also remember that there are three main options for
handling an integrity violation: restrict, cascade, and set to NULL.
The default for SQL is RESTRICT which causes a rejection of the update operation that would
cause the violation. However the schema designer can specify an alternative action. For an
insertion, RESTRICT is the only option which makes sense, so alternative actions are not
specified for insert operations. The alternatives for modify and delete are SET NULL, CASCADE,
and SET DEFAULT. The option must be qualified by ON DELETE or ON UPDATE (this is a
reference to a modify operation, not to all three operations which update the database).
Example Tables Referenced
STUDENT
Student ID
Name
Class Rank
17352
John Smith
SR
19407
Prashant Kumar
JR
22458
24356
27783
Rene Lopez
SO
Rachel Adams
FR
Julia Zhao
FR
FACULTY
Faculty ID
Name
1186
Huei Chuang
5368
Christopher Johnson
6721
Sanjay Gupta
7497
Martha Simons
Note that the values in the tables have been modified to better demonstrate the concepts
below.
For example, we might specify a foreign key in the STUDENT table of our example database by
the clause:
FOREIGN KEY (Advisor_id) REFERENCES FACULTY (Faculty_id);
Since the default is RESTRICT, we will not be able to add a tuple to the STUDENT table unless
the value in the Advisor_id attribute matches a value of Faculty_id of some tuple in the
FACULTY table. Similarly, we cannot modify the value of the Faculty_id attribute if that faculty
member has been assigned an advisee using the old Faculty_id. Neither can we modify the
value of the Advisor_id attribute unless we modify it to the value of Faculty_id in some other
tuple in the table. Finally, we cannot delete a row from the FACULTY table if that faculty is the
advisor to at least one student. All these actions are prohibited by the RESTRICT option.
Consider the tables above. For the sake of simplicity, assume this is our "entire" database for
now. The RESTRICT option (default) would not allow the addition of a new student with an
Advisor value of 4953 since there is no tuple in the FACULTY relation with the faculty id of 4953.
We could not modify the faculty id of Huei Chuang since there are two students who show their
advisor as 1186. Changing that value in FACULTY tuple would create a referential integrity
infraction. This would not be allowed with RESTRICT. In a similar manner, we cannot delete
faculty Martha Simons since that would again remove a faculty id which is used as the advisor
value for two students who show their advisor as 7497. This deletion would again create a
referential integrity infraction.
Suppose we modify the FOREIGN KEY clause as follows:
FOREIGN KEY (Advisor_id) REFERENCES FACULTY (Faculty_id) ON DELETE SET NULL ON
UPDATE CASCADE;
The add operation would work as above since the RESTRICT option would still be in effect for
add. When the row for a faculty is deleted and that faculty has advisees, the delete operation is
now allowed, but for all students who had the (soon to be) deleted faculty as advisor, the
Advisor_id attribute would be changed to NULL. If the Faculty_id is modified, again that
operation is now permitted and with CASCADE, the Advisor_id attribute for all students who
have the faculty as advisor would have the value changed to the new Faculty_id given to the
faculty member.
Again, consider the tables above. Since RESTRICT would still be in effect for add. The DBMS
would still not allow the addition of a new student with an Advisor value of 4953 since there is
no tuple in the FACULTY relation with the faculty id of 4953. Since the option for DELETE is now
SET NULL, we can delete faculty Martha Simons. This would be allowed and cause the advisor
value for two students who show their advisor as 7497 (John Smith and Rene Lopez) to have
that value changed from 7497 to NULL in their student tuples. Since the option for UPDATE is
now CASCADE, we could modify the faculty id of Huei Chuang. If we change his ID to 1215, not
only is the faculty id value in his faculty record changed to 1215, but the two students (Prashant
Kumar and Rachel Adams) who have their advisor value as 1186 would automatically have the
values modified to 1215. Note that other checks might restrict us from modifying Huei Chuang's
id since it is the primary key of the relation, but the change would no longer be prevented by
the referential integrity constraint.
Using CHECK to Specify Constraints on Tuples
In addition to being used to further restrict the domain of an attribute, the CHECK clause can be
used to specify row based constraints which are checked whenever a row is inserted or
modified. This type of constraint is specified by adding the CHECK clause at the end of other
specifications in a CREATE TABLE statement. As an example of how this can be used, suppose
our database contains a PRODUCT table. Among other attributes, we store the regular price at
which the item is sold in the attribute Regular_price. However, we often put products on sale.
We store the current sale price of the item in the attribute Sale_price. If an item is not currently
on sale, we set the value of Sale_price to the value of Regular_price in the tuple. We do want to
make sure, however, that we do not accidentally set the sale price above the regular price. We
can make sure this does not happen by adding the following CHECK clause at the end of the
CREATE TABLE statement for the PRODUCT table.
CREATE TABLE PRODUCT(
rest of specification
CHECK (Sale_price <= Regular_price)
);
If the CHECK condition is violated, the insertion or modification of the offending tuple would
not be allowed to be stored in the database.
M 6.4: INSERT, DELETE, and UPDATE
Statements in SQL
The three commands used to change the database are INSERT, DELETE, and UPDATE. They will
be considered separately.
The INSERT Command
The basic form of the INSERT command is used to add a single row to a table. The table name
and a list of values for the row must be specified. The values must be listed in the same order
that the corresponding attributes were listed in the CREATE TABLE command. For example, in
the STUDENT table created in sub-module 6.2, we can specify:
INSERT INTO
VALUES
STUDENT
('17352', 'John Smith', 'SR', 'SWENG', 1186);
Note that the first value (the student id) is specified in single quotes since it was specified as a
CHAR(5) value and character values must be enclosed in single quotes. The last value (advisor)
is not specified in quotes since it was specified as an INT and numbers are not enclosed in
quotes.
The DELETE Command
The DELETE command removes tuples from a table. It includes a WHERE clause similar to the
WHERE clause in the SELECT statement (which will be discussed on sub-module 6.6. The WHERE
clause is used to specify which tuples should be deleted. The WHERE clause often specifies an
equal condition on the primary key of the table whose tuple should be removed. However, it
can be used to specify other conditions which may delete multiple tuples. One example of using
the DELETE command is:
DELETE FROM
WHERE
STUDENT
Student_id = '24356';
If a tuple is found with the specified student id, that tuple is removed from the table. If no tuple
matches the student id, no action is taken.
A second example is:
DELETE FROM
WHERE
STUDENT
Major = 'CMPSC';
This removes any student tuple where the student is a CMPSC major. (I'm not sure why such a
drastic action would be taken, but that is what the command would do.)
The UPDATE Command
The UPDATE command allows modification of attribute values of one or more tuples. Like
DELETE, the UPDATE command includes a WHERE clause which selects the tuples to be
modified from the indicated table. This commands also contains a SET clause to specify which
attribute(s) should be modified and what the new values should be.
For example, to change the major of student 19497 to SWENG, we would issue the following:
UPDATE
SET
WHERE
STUDENT
Major = 'SWENG'
Student_id = '19497';
This would update the value for one tuple. For an example using multiple tuples, consider that
the faculty member with id 3468 is retiring and this faculty has several advisees. All of the
advisees are now being assigned to faculty with id 5793 as their new advisor. This would be
accomplished with the following.
UPDATE
SET
WHERE
STUDENT
Advisor = 5793
Advisor = 3468;
It is also possible to specify NULL as the new attribute value.
In these examples, we only specified one table in the command. It is not possible to specify
multiple tables in an UPDATE command. To update multiple tables, an individual UPDATE
command must be issued for each table.
M 6.5 Creating and Populating a
Database Using SQL
In Assignments 1 & 2, you built and populated a database using the screens provided by
NetBeans. In this module, we will start to build and populate a database using SQL commands.
You will finish the build and population of the database in Assignment 6. You will then perform
a few basic queries. We will discuss queries in upcoming sub-modules.
In this module, using SQL we are going to create a database named ASSIGNMENT6. We will
then create a table and add some rows. The steps are listed below. Please follow along in your
own version of NetBeans. At the end of the module you can click to download or view a Word
document that contains more detail and screenshots which might be helpful.
Steps required:










Open NetBeans. If the Services tab is not visible, click on "Window" and then click
the "Services" selection.
If there is a + to the left of Java DB, click to expand. If there is no +, see the Word
document.
Right click on the “jdbc:derby://localhost…” and then click “Connect”.
Right click on the “jdbc:derby://localhost…” again and this time click on “Execute
Command…”
Type in “CREATE SCHEMA ASSIGNMENT6;” (don’t forget the semi-colon).
Click on the “Run SQL” button (first button to the right of where you entered the
command).
The output window which is shown in the bottom part of the screen should indicate
“Executed successfully…”
You will probably not see ASSIGNMENT6 in the Services window. Click the + to
expand "Other schemas".
You should see ASSIGNMENT6. Right click on it and select “Set as Default Schema”.
This should move it to the top and the name will be in bold.
Expand, then right click on TABLES, select “Execute Command…” In this window, you
need to type in the table definition. We will create the STUDENT table using:
CREATE TABLE STUDENT
(
STU_ID
CHAR(5)
NOT NULL,
SNAME
VARCHAR(25)
NOT NULL,
RANK
CHAR(2),
MAJOR
VARCHAR(5),
ADVISOR
INT,
PRIMARY KEY(STU_ID)
);
Note that you can type in the entire definition in the window. An alternate, and probably better
way is to type the definition into a simple text file (Notepad on Windows works well) and
then copy and paste the entire definition from the text document to the window.
Also note that unlike the guidelines we have been following from the text, I changed all
attribute names to all upper case. This is because the SQL interpreter in Java DB treats attribute
names the same way that it treats keywords – as case insensitive. If you put the attribute
names in mixed case as we have done so far, when you run queries you will need to put the
attribute names in single quotes. This can be a pain, so it easier to put them into the database
as all caps.






Again click the “Run SQL” button. The output window should again indicate
“Executed successfully…”
Now that the table has been created, the next step is to add tuples (rows) to the
table.
In the Services tab, right click on the STUDENT table, select “Execute Command…”
In this window, we will type in one INSERT command for each tuple. We will
populate the STUDENT table by using:
INSERT INTO STUDENT
VALUES
('17352', 'John Smith', 'SR', 'SWENG', 1186);
INSERT INTO STUDENT
VALUES
('19407', 'Prashant Kumar', 'JR', 'CMPSC', 3572);
INSERT INTO STUDENT
VALUES
('22458', 'Rene Lopez', 'SO', 'SWENG', 2842);
INSERT INTO STUDENT
VALUES
('24356', 'Rachel Adams', 'FR', 'CMPSC', 4235);
INSERT INTO STUDENT
VALUES
('27783', 'Julia Zhao', 'FR', 'SWENG', 3983);
Click the “Run SQL” button. The output window should again indicate “Executed
successfully…” five times, one for each of the INSERT commands.
Finally, right click on STUDENT and select “View Data…” This should show the tuples
you just entered.
Now that we have created a database, added a table, and populated the table with a few tuples
(rows), we will look at basic queries in the next sub-module.
M 6.6: Basic Retrieval Queries in SQL
There is one basic statement in SQL for retrieving information from (or querying) a database:
the SELECT statement. This is a somewhat unfortunate choice of name for the keyword, since
there is a SELECT operation in the relational algebra which forms some of the background for
SQL. If you look into relational algebra, please note that relational algebra operation and the
SQL statement are NOT the same.
The SELECT statement in SQL has numerous options and features. We will introduce them
gradually over the next several modules.
Note that the results of many operations will form new tables in SQL. When some of these
tables are created, they will contain tuples which are exact duplicates. This is OK in
the practical SQL, but it violates the definition of a relation in the formal model. When
duplicate tuples are not desired, we will see methods for eliminating the duplicates. Since each
base table will have a unique primary key, there will not be duplicate tuples in base tables.
The SELECT-FROM-WHERE Structure
SELECT
FROM
WHERE
<attribute list>
<table list>
<condition>;
where



<attribute list> is a list of attribute names whose values are retrieved by the query.
<table list> is a list of the names of the tables required to process the query.
<condition> is a conditional expression that indicates the tuples which should be
retrieved by the query.
The basic operators for the condition are the comparison operators: =, <, >, <=, >=, and <>. You
should already be familiar with these from C++ or Java. The exception might be the "<>"
notation for "not equal".
Click to see Figure 1.2 with the attributes named as in Figure 5.6
An example from the database in Figure 1.2 with the attributes named as in Figure 5.6 would
be to find class rank and major for Julia Zhao. This would be specified as:
SELECT Class_rank, Major
FROM
STUDENT
WHERE Name = 'Julia Zhao';
Since this query only requires the STUDENT table, it is the only table name specified in the
FROM clause. The query selects (relational algebra terminology) the tuple(s) from the table
which meet the condition specified in the WHERE clause. Since the Name attribute is specified
as a one of the CHAR types, the value specified in the WHERE clause to be matched with values
in the database must be enclosed in single quotes. In this case, there is only one tuple where
the student's name is Julia Zhao. The result is projected (again relational algebra terminology)
on the attributes listed in the SELECT clause. This means that the result has only those
attributes listed. The result of the above query would look like:
Class_rank
Major
FR
SWENG
You can consider this query working by "looping through" all tuples in the STUDENT table and
applying the condition from the WHERE clause to each. If the tuple meets the condition, it is
placed in what can be considered a "result set," where the term set is used somewhat loosely. If
the tuple does not meet the condition, it is ignored. Then the attribute values which are listed
in the SELECT clause are displayed for all tuples in the result set.
Another example would be to find all Juniors and Seniors and list their student ID, name, and
class rank. This would be specified as:
SELECT Student_id, Name, Class_rank
FROM
STUDENT
WHERE Class_rank = 'JR' OR Class_rank = 'SR';
Since this query again only requires the STUDENT table, it is the only table name specified in the
FROM clause. The query selects (relational algebra terminology) the tuple(s) from the table
which meet the condition specified in the WHERE clause. In this case, it selects from the table
all tuples where the class rank is either JR or SR. The result is projected on the three attributes
listed in the SELECT clause. This means that the result has only those attributes listed. The
result of the above query would look like:
Student_id
Name
17352
John Smith
19407
Prashant Kumar
Note that we used the Boolean operator "OR" as part of the WHERE clause. Both AND and OR
are valid keywords and perform their normal Boolean function.
These are examples of simple SELECT statements using a single table. In the next module we
will consider queries that require more than one table to answer the question.
Cla
Unspecified WHERE Clause and Use of the Asterisk
It is permissible to specify a query without a WHERE clause. When this is done, all tuples in the
table specified in the FROM clause meet the condition and are selected. For example consider:
SELECT
FROM
Name
STUDENT;
This would display the names of all the students. Note that by default, the student names can
be displayed in any order. This order is usually based on the order that the result can be
retrieved and displayed most quickly. We will see shortly how we can specify an ordering.
In order to retrieve all attribute values of the selected tuples, it is not necessary to list all the
attribute names. We can simply specify the asterisk (*) in the SELECT clause. Consider the
following example query.
SELECT *
FROM
STUDENT
WHERE Major = 'SWENG';
This will list all attributes for the students who are majoring in software engineering.
Ordering of Query Results
As mentioned above, the default allows query results to be displayed in any order. The ordering
can be specified by using the ORDER BY clause. Consider the following query.
SELECT
Name, Advisor_id
FROM
STUDENT
ORDER BY Name;
This query would select all student tuples (no WHERE clause) and retrieve the Name and
Advisor_id of each. Rather than the default random order of the tuples, the tuples would be
ordered by Name. They would be displayed in ascending order by default. Note that using the
full name has kept things simple, but it will cause problems with the ordering, since the sort will
be based on first name. This means that "John Smith" will be ordered before "Prashant Kumar"
since the "J" sorts before "P" even though Smith sorts after Kumar. First name and last name
would need to be separate attributes to perform the sort in what would be considered
"normal" order.
If a sort in descending order is desired, the keyword DESC is used. For example:
SELECT
Name, Advisor_id
FROM
STUDENT
ORDER BY Name DESC;
This would perform the same retrieval as the last example, but this time the results would be
displayed in descending order. Again, the issue of full name would cause the results to be
correct for a sort on the full name, but this would not be what is normally expected.
Two or more attributes can be specified in the ORDER BY clause. The primary sort would be on
the first attribute listed. A secondary sort would be performed based on the second attribute
listed, etc.
SELECT
Name, Advisor_id
FROM
STUDENT
ORDER BY Advisor_id, Name;
Again all STUDENT tuples are selected. Name and Advisor_id are displayed for all students. The
main ordering would be by the value in the Advisor_id attribute. Since there would often be
several students with the same advisor, for those tuples where the value of Advisor_id is the
same, the tuples would be ordered by student name within that group. Without adding Name
as a secondary sort value, all students would be put in order by Advisor_id value, but when the
Advisor_id values are identical, those tuples will appear in random order within the group of
those with the same advisor.
M6.7: Querying a Database Using SQL
In sub-module 6.5 we built and populated a database using SQL commands. In this sub-module,
we will run a few simple queries of the ASSIGNMENT6 database using the SQL SELECT
command. You will write some additional queries as part of Assignment 6.
To a large extent, this repeats the queries in sub-module 6.6. This module demonstrates
running the queries in NetBeans. Please follow along in your own version of NetBeans. At the
end of the module you can click to download or view a Word document that contains more
detail and screenshots which might be helpful.
The steps for querying a database in NetBeans are very similar to the steps followed earlier in
sub-module 6.5. To query an existing database using SQL in NetBeans, perform the following
steps.





Open NetBeans.
Click on the Services tab.
Expand the Databases selection by clicking on the + box next to “Databases”.
Right click on the “jdbc:derby://localhost…” and then click “Connect”.
Right click on the “jdbc:derby://localhost…” again and this time click on “Execute
Command…”



Enter the query directly into the screen or put the query in a simple text document
and copy and paste into the screen.
Click the “Run SQL” button.
View the results.
We will demonstrate this with the queries below.
To start with a simple query, let’s show the value of all attributes for all tuples in the table.
Note, of course, that this is not practical once the number of tuples gets into the hundreds and
thousands of tuples.
This can be done with the statement:
SELECT
FROM
*
STUDENT;
Remember that the * indicates to display all attributes. All tuples are selected by not including a
WHERE clause.
Enter the above (again remember you can type the statement into a simple text file and then
copy and paste).
Then click the “Run SQL” button to view the results.
Next, let’s run the query to find class rank and major for Julia Zhao. We did this with the
following:
SELECT RANK, MAJOR
FROM STUDENT
WHERE SNAME = 'Julia Zhao';
Note that this query uses the upper-case attribute names used in sub-module 6.5 when the
database was actually created. Since the SNAME attribute has one of the char data types, there
are single quotes around the name we are looking for.
This query can be typed in or pasted over the prior query.
Then click the “Run SQL” button to view the results.
Now, let’s run the query to list all attributes for the students who are majoring in software
engineering. We did this with the following:
SELECT *
FROM
STUDENT
WHERE MAJOR = 'SWENG';
Note that this query again uses the upper-case attribute name used in sub-module 6.5 when
the database was actually created. Since the MAJOR attribute has one of the char data types,
there are single quotes around the name.
Enter the query, then click the “Run SQL” button to view the results.
Finally, we will run two queries to order the output, first in ascending order by student name,
then in descending order by student name. To list the name and advisor for each student where
the output is ordered by name, we use the ORDER BY clause in a query as follows:
SELECT
SNAME, ADVISOR
FROM
STUDENT
ORDER BY SNAME;
Again, note that using the full name has kept things simple, but it will cause problems with the
ordering, since the sort will be based on first name. "John Smith" will be ordered before
"Prashant Kumar" since the "J" sorts before "P" even though Smith sorts after Kumar. First
name and last name would need to be separate attributes to perform the sort in what would be
considered "normal" order.
Enter the query, then click the “Run SQL” button view the results.
To sort in descending order, the keyword DESC is used. For example:
SELECT
SNAME, ADVISOR
FROM
STUDENT
ORDER BY SNAME DESC;
This would perform the same retrieval as the last example, but this time the results would be
displayed in descending order.
Enter the query, then click the “Run SQL” button view the results.
We will practice and extend what we did in sub-modules 6.5 and 6.7 in Assignment 6.
M 7.1: Introduction to Joins
The basic use of the SELECT statement was presented in the last module. You might have
realized at some point during the module that all questions in that module could be answered
by using a SELECT statement that required only one table from the database. This meant that
only one table needed to be specified in the FROM clause. However, there are some questions
which cannot be answered with information from only one table. Information from two or
more tables is required. This type of query is accomplished by joining the tables.
Figure 1.2 with the attributes named as in Figure 5.6
STUDENT
Student_id
Name
Class_rank
Major
17352
John Smith
SR
SWENG
19407
Prashant Kumar
JR
CMPSC
22458
Rene Lopez
SO
SWENG
24356
27783
Rachel Adams
FR
CMPSC
Julia Zhao
FR
SWENG
FACULTY
Faculty_id
Name
De
2469
Huei Chuang
5368
Christopher Johnson
6721
Sanjay Gupta
S
7497
Martha Simons
S
DEPARTMENT
Code
Name
Building
CMPSC
Computer Science
Hopper Center
MATH
Mathematics
Nittany Hall
Software Engineering
Hopper Center
SWENG
COURSE
Number
Name
Credit_hours
CMPSC121
Introduction to Programming
3
CMPSC122
Intermediate Programming
3
MATH140
Calculus with Analytic Geometry
4
O-O Software Design and Construction
3
SWENG311
SECTION
Section_id
Course_number
Semester
044592
CMPSC121
Fall
046879
MATH140
Fall
059834
CMPSC122
Spring
061340
MATH140
Spring
063256
CMPSC121
Fall
063593
SWENG311
Fall
TRANSCRIPT
Student_id
Section_id
Grad
22458
044592
A
22458
046879
B
22458
059834
A
22458
063256
C
24356
061340
A
24356
063256
B
Figure 1.2 An example database that stores information about a university
An example of this, again using the example database from Figure 1.2, would be to list the IDs
of all students majoring in software engineering. Since students have their major listed by code,
not department name, we need to use the DEPARTMENT table to find the department code for
the department named software engineering. That code (which turns out to be SWENG) is then
used in the STUDENT table to find all students with the major of SWENG. So, in the query, we
need information from both the STUDENT and DEPARTMENT tables. To answer the question
asked, the query would be:
SELECT
FROM
WHERE
Student_id
STUDENT, DEPARTMENT
DEPARTMENT.Name = "Software Engineering" AND Code = Major;
The result would look like:
Student_id
17352
22458
27783
There are several things to note here. Since both the STUDENT and DEPARTMENT tables are
needed, both are listed in the FROM clause. Since we need all students whose department
name is "Software Engineering" we have the first condition in the WHERE clause. Since "Name"
is the name of an attribute in both tables, if WHERE Name = "Software Engineering" and Code =
Major; had been specified in the WHERE clause, the system would not know whether to use the
Name attribute in the STUDENT table or the Name attribute in the DEPARTMENT table. In this
database then, the attribute name "Name" is ambiguous. Using the same attribute name in
different tables is allowed in SQL. However, doing so then requires that in a SELECT statement
which includes both tables, the attribute name must be qualified with the name of the table we
are referencing. The attribute name is qualified by prefixing the table name to the attribute
name and separating the two by a period. Since we need to look up "Software Engineering" in
the DEPARTMENT table (and not in the STUDENT table), we must use DEPARTMENT.Name in
the WHERE clause.
This will select the tuple from the DEPARTMENT table with the name "Software Engineering".
This part of the WHERE clause is known as the selection condition. We then must match the
Code value from the selected tuple in the DEPARTMENT table with the Major value from tuples
in the STUDENT table. When we combine tuples from two tables it is known as a join. How the
tuples are combined is specified in the above query by the second part of the WHERE
clause, Code = Major. This is known as the join condition and will combine the two tuples
where the student's major matches the department code associated with the name Software
Engineering. Note that since Code and Major are unique attribute names across the two tables,
the qualification by table name is not required. It can be added for clarity, but as shown here, it
is not required. The attribute specified in the SELECT clause, in this case Student_id, then has its
values displayed for all joined tuples which satisfy the conditions in the WHERE clause.
As a reminder, we used the Boolean operator "AND" as part of the WHERE clause. Both AND
and OR are valid keywords and perform their normal Boolean function.
Another example would be to write a query to list the name of John Smith's advisor. Since the
advisor's name is not available in the STUDENT table, to answer this question we need to look
up the ID of John Smith's advisor in the STUDENT table. Once we get that ID, we need to go to
the FACULTY table and find the name of the faculty member with the ID found in the STUDENT
table. That would be accomplished with a join. The query would be:
SELECT
FROM
WHERE
FACULTY.Name
STUDENT, FACULTY
STUDENT.Name = "John Smith" AND Advisor = Faculty_id;
In looking at the sample data in the tables, it can be seen that the ID of the advisor for John
Smith is 1186. There is no faculty in the sample data with ID 1186. If we assume that in the full
database, there is a faculty record where there is a faculty ID of 1186 and the name of the
faculty with this ID is Mary Hart, the result of the above query would look like:
FACULTY.Name
Mary Hart
In this query, the selection condition is STUDENT.Name = "John Smith". All tuples with the
matching student name (in this case only one) will be selected. The join condition is Advisor =
Faculty_id. This will combine the two tuples where the faculty id in the FACULTY table matches
the advisor value from the STUDENT table associated with the name John Smith. From the
result of the join, the name of the faculty is displayed. Note again that the attribute "Name" is
ambiguous since it is used in both tables. It must be qualified by table name wherever it is used
in the SELECT statement. In the first query, an ambiguous name was used only in the WHERE
clause. In this query it is needed both in the WHERE clause and in the SELECT clause since both
contain an ambiguous attribute name.
Any number of selection and join conditions may be specified in the WHERE clause of an SQL
query. We will look at this in the next sub-module.
M 7.2: Joins - Continued
To continue the discussion of joins, consider the following question. List the names of all
courses taken by the student with ID 22458 during Fall 2017 semester. In looking at the
database, course names are only available in the COURSE table. We need to look in the
TRANSCRIPT table to see what courses were taken by the student with ID 22458. Unfortunately
that table lists only the unique section ID of the courses taken. To find out which courses the
sections belong to, we need to look up course number which corresponds to the section ID in
SECTION table.
So to answer the question, the query would be:
SELECT
FROM
WHERE
Name
COURSE, TRANSCRIPT, SECTION
TRANSCRIPT.Section_id = SECTION.Section _id AND
Course_number = Number AND
Student_id = 22458 AND
Semester = "Fall" AND
Year = 2017;
The result would look like:
Name
Introduction to Programming
Calculus with Analytic
Geometry
We can see that this query requires information from three database tables. When three tables
are needed, two join conditions must be specified in the WHERE clause. The first join
condition, TRANSCRIPT.Section_id = SECTION.Section _id, combines the tuples from the
TRANSCRIPT and SECTION tables where the Section_id values in the tuples match. The second
join condition, Course_number = Number, joins the SECTION and COURSE tables where the
course numbers in the two tables match. Note that the second join condition does not require
the attribute names to be qualified by the table names since the attribute names are unique in
the two tables. If it seems to clarify the join condition, it can be specified with the qualified
names as SECTION.Course_number = COURSE.Number.
You can think of the joins as creating one large tuple containing all the information of the three
separate tuples which have been joined together. From these joined tuples, the selection
conditions will select all tuples containing student id 22458. From these, it will further select
only those tuples from fall semester, 2017. This will be the final result set where the Name (in
this case the course name since that is the only attribute in the three tables with that name)
will be printed for each tuple in the result.
Note that in the last sub-module, the selection condition was stated first in the WHERE clause
and this was followed by the join condition. Here, the join conditions were listed first and these
were followed by the selection conditions. Both work the same as far as the system is
concerned. Many prefer to list the join conditions first, but the ordering of the conditions is
based on user preference as to which seems clearer. Before performing the actual retrieval, the
system will examine all conditions and determine in which order the conditions should be
applied in order to produce the results most efficiently. The details of how this is done is a very
interesting topic, but will not be covered in this basic course. If you are interested in query
processing and query optimization, it is discussed in Chapters 18 and 19 in the text.
Aliasing and Recursive Relationships
In sub-module 4.4 we discussed recursive relationships. The example in that sub-module
considered the FACULTY entity. We made the assumption that all department chairs are also
faculty. We then considered a SUPERVISES relationship where the chair supervises the other
faculty in the department. This relationship involved only the FACULTY entity, but it was on
both sides of the relationship. We saw that only one entity instance represented the "chair
side" of the relationship, but many entity instances represented the "other faculty in the
department" side of the relationship. This makes the cardinality of the relationship 1:N,
representing a 1:N unary relationship. To capture this relationship, we would add an attribute,
call it Chair_id, to the FACULTY table. This would be considered a foreign key referencing the
Faculty_id of the chair in the same table. This requires us to join the FACULTY table to itself.
This can be thought of as two different copies of the table with one copy representing the role
of chair and the other copy representing the role of "regular faculty." To demonstrate how this
is used in a SELECT statement, consider a query to list the names of all faculty and the names of
the chair supervising the faculty.
So to answer the question, the query would be:
SELECT
FROM
WHERE
F.Name, C.Name
FACULTY AS F, FACULTY AS C
F.Chair_id = C.Faculty_id;
Since this doesn't actually exist in the example database, an actual result cannot be shown. The
join condition will cause all tuples on the "regular faculty" side to be joined with all
corresponding tuples on the chair side. However, since there is only a join condition and no
selection condition, all joined tuples will be included in the result. From the joined tuples, the
name of the faculty and the name of his/her chair will be displayed.
The result table would look like the following:
F.Name
C.Name
first faculty name
corresponding chair
name
etc.
etc.
Since the FACULTY relation is used twice, we need to be able to distinguish which copy of the
relation we are referencing. Alternative relation names must be used for this. The alternative
names are called aliases or tuple variables. The query above shows that an alias can be
specified following the keyword AS. Note that it is also possible to rename the relation
attributes within the query by giving them aliases. For example, we can write
FACULTY AS F(Fid, Nm, De, Cid)
in the FROM clause. This provides an alias for each of the attribute names: Fid for Faculty_id,
etc.
M 7.3: Types of Joins
We have looked at joins in the last two sub-modules. What we have done so far is the basic
join, more specifically the equijoin. There are additional types of joins available and they will be
discussed below. These types come from relational algebra and because they are useful in
many situations, they have been incorporated into SQL. We will not cover relational algebra in
this course, but if you would like additional details, see sections 8.3.2 and 8.4.4 in the text. In
the sections below, a description of each join will be followed by an example. The examples are
based on the following database tables. Note that the tables have been modified slightly from
the earlier examples.
Also note that both the queries and the results shown might be somewhat different depending
on the DBMS you are using. They are shown to illustrate the various points.
STUDENT
Student_id
Name
Class_rank
Major
Adviso
17352
John Smith
SR
SWENG
1186
19407
Prashant Kumar
JR
CMPSC
3572
22458
24356
27783
Rene Lopez
SO
SWENG
2842
Rachel Adams
FR
Julia Zhao
FR
4235
SWENG
3983
DEPARTMENT
Code
Name
Building
CMPSC
Computer Science
Hopper Center
MATH
Mathematics
Nittany Hall
SWENG
Software Engineering
Hopper Center
Assume there is a request to list student name and ID along with code of their major
department and the name of the department. Based on what we have done so far, we would
write a query like this:
SELECT
FROM
WHERE
Student_id, STUDENT.Name, Major, DEPARTMENT.Name
STUDENT, DEPARTMENT
Major = Code;
This should produce the following result:
Student_id
STUDENT.Name
Major
DEPARTMENT.Name
17352
John Smith
SWENG
Software Engineering
19407
22458
27783
Prashant Kumar
CMPSC
Computer Science
Rene Lopez
SWENG
Software Engineering
Julia Zhao
SWENG
Software Engineering
To look at the full joined table, we could execute:
SELECT
FROM
WHERE
*
STUDENT, DEPARTMENT
Major = Code;
and we would see:
Student_id STUDENT.Name Class_rank
Major
Advisor
Code
DEPARTMENT.Name
Building
17352
John Smith
SR
SWENG
1186
SWENG
Software Engineering
Hopper
Center
19407
Prashant Kumar
JR
CMPSC
3572
CMPSC
Computer Science
Hopper
Center
22458
Rene Lopez
SO
SWENG
2842
SWENG
Software Engineering
Hopper
Center
27783
Julia Zhao
FR
SWENG
3983
SWENG
Software Engineering
Hopper
Center
So far we have been specifying both the join condition(s) and the selection condition(s) in the
WHERE clause. This is often convenient for simple queries. It is also allowable to specify the join
condition(s) in the FROM clause. This format may make complex queries more understandable.
An example of using the format would be:
SELECT
FROM
*
STUDENT JOIN DEPARTMENT ON Major = Code;
This query would produce the same result shown above.
Note that in this resultant joined table, both the Major and Code attributes are shown. Based
on the join condition the values in the columns will be identical. To remove one of the duplicate
columns, you can specify a natural join. In a natural join, the join is performed on identically
named columns. The resultant table has the identically named column(s) repeated only once.
The natural join is based on column names, so you do not specify the ON clause.
With the above tables, there are no identically named columns. To perform a natural join, we
need to use the AS construct to to rename one of the relations and its attributes so one of the
attribute names matches an attribute name in the other table. We can perform a natural join
with a query like the following:
SELECT
*
FROM
(STUDENT AS STUDENT (Student_id, Name, Class_rank, Code, Advisor) NATURAL
JOIN DEPARTMENT);
The resulting table would look like:
Code
Student_id STUDENT.Name Class_rank Advisor
DEPARTMENT.Name
Building
SWENG
17352
John Smith
SR
1186
Software Engineering
Hopper Center
CMPSC
19407
Prashant Kumar
JR
3572
Computer Science
Hopper Center
SWENG
22458
Rene Lopez
SO
2842
Software Engineering
Hopper Center
SWENG
27783
Julia Zhao
FR
3983
Software Engineering
Hopper Center
The Major(aliased to Code)/Code column is now shown only once in the joined table. This is the
feature of the natural join. Many systems will show the common name used in the join as the
first column(s) in the result.
The default type of join we have used, and the type used in the earlier examples is called
an EQUIJOIN. Both the equijoin and natural join are forms of inner joins. An inner join is a join
where a tuple is included in the result only if a matching tuple exists in the other relation. In the
above example, a student such as Rachel Adams who does not have a declared major will not
appear in the result. Similarly a department with no declared majors, such as Math, will also not
appear in the result. Sometimes this is desirable, other times it is not.
In situations where it is desirable to include tuples without a matching tuple in the other
relation, a different type of join, called an OUTER JOIN must be performed. There are three
different variations, and each is described below.
Using the tables above, if we want to show all student tuples whether or not there is a
matching tuple in the department table, we use a LEFT OUTER JOIN. The select statement
would be:
SELECT
FROM
*
(STUDENT LEFT OUTER JOIN DEPARTMENT ON Major = Code);
The resulting table would be:
Student_id STUDENT.Name Class_rank
Major
Advisor
Code
DEPARTMENT.Name
Building
17352
John Smith
SR
SWENG
1186
SWENG
Software Engineering
Hopper
Center
19407
Prashant Kumar
JR
CMPSC
3572
CMPSC
Computer Science
Hopper
Center
22458
Rene Lopez
SO
SWENG
2842
SWENG
Software Engineering
Hopper
Center
25356
Rachel Adams
FR
<null>
2535
<null>
<null>
<null>
27783
Julia Zhao
FR
SWENG
3983
SWENG
Software Engineering
Hopper
Center
In this case Rachel Adams does not have a declared major. As such, there is a NULL value in her
major field. Since NULL does not match any department tuple, with the left outer join Rachel's
full information is shown and NULL values are shown for the department part of the tuple. I
have shown NULL values the way they are shown in Java DB. They might be shown in a different
manner in other DBMSs.
Note that if Rachel had a value for Major that does not match a department code, the value for
Major would be displayed in the joined tuple, but the other values above (those associated with
the DEPARTMENT table) would still be NULL.
Similarly, if we wished to display all department tuples whether or not there is a student tuple
with that major, we would use a RIGHT OUTER JOIN. The select statement would be:
SELECT
FROM
*
(STUDENT RIGHT OUTER JOIN DEPARTMENT ON Major = Code);
The resulting table would be (showing ordering from Java DB):
Student_id STUDENT.Name Class_rank
Major
Advisor
Code
DEPARTMENT.Name
Building
19407
Prashant Kumar
JR
CMPSC
3572
CMPSC
Computer Science
Hopper
Center
<null>
<null>
<null>
<null>
<null>
MATH
Mathematics
Nittany Hall
17352
John Smith
SR
SWENG
1186
SWENG
Software Engineering
Hopper
Center
22458
Rene Lopez
SO
SWENG
2842
SWENG
Software Engineering
Hopper
Center
27783
Julia Zhao
FR
SWENG
3983
SWENG
Software Engineering
Hopper
Center
Again note that the Code for the Mathematics department is not NULL. However, there is not
matching Major value in the student tuples. As such, the Mathematics department is
represented, but the values for the student side of the tuple are shown as NULL. Unlike the last
example, the value for Code is present here (not NULL) so the Code value is shown in the joined
tuple.
The final type of outer join is the FULL OUTER JOIN. The concept is that all tuples from both
tables are included in the result. When there is a match, information from both matching tuples
is included in the joined tuple. If there is not match for a tuple in the left table, the values for
the right table appear as NULL in the joined tuple. If there is not a match for a tuple in the right
table, the values for the left table appear as NULL in the joined tuple. The SQL statement would
be:
SELECT
FROM
*
(STUDENT FULL OUTER JOIN DEPARTMENT ON Major = Code);
The result would be a "combination" of the above two result tables. The full outer join is now in
the SQL standard, but many DBMSs have still not implemented it. It has not yet been
implemented in Java DB.
A final type of "join" is the full Cartesian product of two tables. This pairs every tuple in the left
table with every table in the right tuple. There is no matching of attribute values in this type of
join. This may produce many, many joined tuples and is useful in very few cases. Since there is a
complete pairing, the resulting set has m * n tuples where m is the number of tuples in the left
table and n is the number of tuples in the right table. Since the Cartesian product is also known
as a Cross product, it is known as a CROSS JOIN in SQL syntax. This is implemented in Java DB.
The syntax is:
SELECT
FROM
*
(STUDENT CROSS JOIN DEPARTMENT);
No join conditions are specified since all tuples are included. In our small example tables, we
have five student tuples and three department tuples. This produces a result set with 5 * 3 or
15 tuples. Again, caution is called for when using the cross join since the result set can be very
large.
As mentioned at the beginning of this sub-module, if you are interested, additional details can
be found in the relational algebra chapter in the text (Chapter 8), particularly Sections 8.3.2 and
8.4.4.
M 7.4: Aggregate Functions
Aggregate functions are provided to summarize (or aggregate) information from multiple tuples
into a single-tuple. There are several aggregate functions which are built-in to SQL. These
include COUNT, SUM, MAX, MIN, and, AVG.
The COUNT function returns the number of tuples retrieved by a query. The other four
functions are applied to a set of numeric values and return the sum, maximum, minimum, and
average of the values in the set. These functions can be used in a SELECT clause, as will be
demonstrated below. They can also be used in a HAVING clause as discussed in the next submodule. MAX and MIN can also be used with attributes which are not numeric as long as the
domain of the attribute provides total ordering of possible values. This means that for any two
values in the domain, it can be determined which of the two comes first in the defined order.
An example of this would be alphabetic strings, which can be ordered alphabetically. Additional
examples are the DATE and TIME domains, which we have not yet discussed.
Although there are some numeric values in the example database we have been using, the
attributes in that database do not provide good examples of the aggregate functions. For
example, it makes no sense, in general, to ask for the maximum student id or for the sum of the
student ids. As a better example, assume there is a small database table, called SCORES, which
contains the scores of students for a particular test. The attributes will be only Student_id and
Score, where score is a number between 0 and 100. It then does make sense to ask for the
maximum score, the minimum score, and the average score on the test. The maximum score
can be found by using the following query.
SELECT MAX (Score)
FROM SCORES;
This will return a single-row containing the highest score on the test.
Several aggregate functions can be included as part of the SELECT clause. For example
SELECT MAX (Score), MIN(Score), AVG(Score)
FROM SCORES;
will return a single-row containing the highest score, the lowest score, and the average (mean)
score on the test.
The COUNT function can be used to display the number of tuples. Continuing with the above
example,
SELECT COUNT(*)
FROM SCORES;
will display the number of students (and thus the number of tuples) in the table. The use of (*)
with the COUNT function is common. However, using the * will provide a count of all tuples,
including those which contain NULL values in certain attributes. Often, this is the desired result,
but not always. The use of an attribute name instead of * is discussed below.
Up to this point, we have applied the aggregate functions to then entire table since the WHERE
clause was not included. We can, however, include the WHERE clause to choose only some of
the tuples to be used. To expand the above example, assume the SCORES table also included an
attribute named Major which contains the code for the student's major. Further assume that
the course is open to all students. It is primarily taken by CMPSC and SWENG students, but a
few students from various other majors also take the course. If we want the test summary for
only the SWENG students, the following query can be used.
SELECT MAX (Score), MIN(Score), AVG(Score)
FROM SCORES
WHERE Major = 'SWENG';
To determine how many SWENG students took the test, we can use:
SELECT COUNT(*)
FROM SCORES
WHERE Major = 'SWENG';
Here, all tuples are retrieved where the value in the Major attribute is SWENG. The number of
tuples retrieved by this condition in the WHERE clause are then counted and the number of
tuples is returned. The * in the above COUNT examples causes the number of tuples to be
counted.
It is also possible to specify an attribute rather than the *. We could have specified:
SELECT COUNT(Score)
FROM SCORES;
This will often result in the same count as when the * is specified. The difference is that
specifying an attribute name will count tuples with non-NULL values for the attribute rather
than all tuples. If all tuples have a value in the specified attribute (column), the count will be the
same. However, if any of the tuples have a NULL for the value in Score, those tuples will not be
counted. In general, when an attribute is specified in an aggregate function, any NULL values
are discarded before the function is applied to the values.
If we desire to only count the number of unique scores on the exam, we can include DISTINCT
as in:
SELECT COUNT(DISTINCT Score)
FROM SCORES;
This will eliminate duplicate scores and provide a count of the number of unique scores on the
test.
M 7.5: GROUP BY and HAVING Clauses
In the last sub-module, we applied aggregate functions to either the tuples in the entire table or
to all tuples which satisfied the selection condition in the WHERE clause. It is sometimes
desirable to apply an aggregate function to subgroups of tuples in a table.
Consider the small SCORES table from the last sub-module where the attributes are
Student_id, Score, and Major. Score is a number between 0 and 100.
Suppose we want to find the average score on the test for each major. To do this, we need to
partition the table into non-overlapping subsets of tuples. These subsets are referred to
as groups in SQL. Each group will consist of tuples that have the same value for an attribute or
attributes. This is called the grouping attribute(s). The aggregate function is then applied to
each subgroup independently. This produces the summary information about each group.
The GROUP BYclause is used in SQL to accomplish the grouping. The grouping attribute(s)
should be listed as part of the SELECT clause. This provides a "label" for the value(s) shown in
the aggregate function.
This would be accomplished by the query:
SELECT Major, AVG(Score)
FROM SCORES
GROUP BY Major;
The results would look something like the following:
Major
AVG(Score)
CMPSC
88
SWENG
90
MATH
86
PSYCH
85
The query will run if you do not follow the above suggestion and you leave the grouping
attribute off of the SELECT clause. For example the query:
SELECT AVG(Score)
FROM SCORES
GROUP BY Major;
will run and the result will look something like:
AVG(Score)
88
90
86
85
A result display such as this is usually somewhat meaningless. However, this can be the desired
outcome in some instances.
Note that the rows in the above queries are not ordered. ORDER BY can be used to specify a
row order in this type of query also.
Further note that when GROUP BY is used, non-aggregated attributes (such as Major above)
listed in the SELECT clause must be included in the GROUP BY clause. For example, in the first
query above, it would cause an error to start the SELECT clause "SELECT Major, Student_id,
AVG(Score)". Because Major is specified in the GROUP BY clause, the same value of Major
belongs to everyone in the group, so a single value applies to everyone in the group and this
value can be listed in the result. However, many different Student_id values are contained
within a given group, so a single value cannot be listed. Since the DBMS does not have an
appropriate value to list, an error condition is generated.
The number of students in each group can be included with the following:
SELECT Major, COUNT(*), AVG(Score)
FROM SCORES
GROUP BY Major;
This would provide a result similar to:
Major
COUNT(*)
AVG(Score)
CMPSC
14
88
SWENG
12
90
MATH
3
86
PSYCH
1
85
Conditions can also be put on groups. The selection conditions put in the WHERE clause are
used to limit the tuples to which the functions are applied. For example, if we only wish to
include tuples where the major is either SWENG or CMPSC, would use:
SELECT Major, COUNT(*), AVG(Score)
FROM SCORES
WHERE Major = 'SWENG' or Major = 'CMPSC'
GROUP BY Major;
This would result in a table that looks like the following since only tuples with one of the two
majors are included in the set to which the aggregate functions are applied.
Major
COUNT(*)
AVG(Score)
CMPSC
14
88
SWENG
12
90
The HAVING clause is used to provide for selection based on groups that satisfy certain
conditions, rather than applying conditions to tuples prior to the grouping. For example, to list
results for only those majors which have at least two students in the class, the following query
would be used:
SELECT Major, COUNT(*), AVG(Score)
FROM SCORES
GROUP BY Major
HAVING COUNT(*) >= 2;
That would produce the following:
Major
COUNT(*)
AVG(Score)
CMPSC
14
88
SWENG
12
90
MATH
3
86
M 7.6: Comparisons Involving NULL
and Three-Valued Logic
As discussed in sub-module 5.1.1, NULL is used to represent a missing value, but it can be used
in three different cases.
1. The value is unknown. An example would be if a person's birth date is not known.
The value would be represented as NULL in the database.
2. The value is unavailable or withheld. An example would be that a person has a
home phone, but does not wish to provide the value for the database. Again, the
home phone would be represented as a NULL in the database.
3. The value is not applicable. For example, if the database splits the name into first
name, middle name, last name, and suffix. The suffix would be a value such as Jr. or
III. In many cases this would not apply to a particular individual, so the suffix would
be represented as a NULL in the database.
Sometimes it can be determined which of the three cases is involved, but other times it cannot.
If you consider a NULL in a home phone attribute, it could be NULL for any of the three reasons.
Therefore, SQL does not try to distinguish among the various meanings of NULL. SQL considers
each NULL value to be different from every other NULL value in the tuples. When a tuple with a
NULL value for an attribute is involved in a comparison operation, the result is deemed to be
UNKNOWN, meaning it could be TRUE or it could be FALSE. Because of this, the standard twovalued (Boolean) AND, OR, and NOT logic does not apply. It must be modified to use a threevalued logic.
The three-valued logic is shown in the following tables which can be read as follows. In the first
two tables, the operator is shown in red in the upper left hand corner of the table. The values in
the first column, shown in blue, represent the first operand. The values in the first row, also
shown in blue, represent the second operand. The values at the intersection of the row and
column represent the result value. For example, the result of (TRUE AND UNKNOWN) is
UNKNOWN. The value of (TRUE OR UNKNOWN) is TRUE. The third table represents the unary
NOT operator. The first column represents the operand, while the second column shows the
result in the corresponding row.
AND
TRUE
FALSE
UNKNOWN
TRUE
TRUE
FALSE
UNKNOWN
FALSE
FALSE
FALSE
FALSE
UNKNOWN
UNKNOWN
FALSE
UNKNOWN
OR
TRUE
FALSE
UNKNOWN
TRUE
TRUE
TRUE
TRUE
FALSE
TRUE
FALSE
UNKNOWN
UNKNOWN
TRUE
UNKNOWN
UNKNOWN
NOT
TRUE
FALSE
FALSE
TRUE
UNKNOWN
UNKNOWN
In general, when used in a WHERE clause, only those combination of tuple values that result in
a value of TRUE are selected. When the result is either FALSE or UNKNOWN, the tuple is not
selected. It is possible to check whether an attribute value is NULL. Instead of using = or <> to
compare an attribute value to NULL, the comparison operators IS and IS NOT are used. Since all
NULL values are considered distinct from all other NULL values, equality comparison will not
work for NULL values. In our example database, if a student is allowed to not have a declared
major, the value of NULL is stored in the Major column for that student. The following query
would be used to list the students who currently do not have a declared major.
SELECT
FROM
WHERE
Student_id, Name
STUDENT
Major IS NULL;
M 8.1: Substring Pattern Matching
and Arithmetic Operators
Before moving to discussing subqueries, which are the main topic of this module, this submodule covers two smaller topics not yet addressed: substring pattern matching and arithmetic
operators.
Substring Pattern Matching
So far, when comparing with substrings, we have used the equality comparison operator. This
requires an exact match to the value in the character field and this is often what we want. For
example:
SELECT *
FROM
STUDENT
WHERE MAJOR = 'SWENG';
will retrieve all tuples where the value in the MAJOR attribute is exactly 'SWENG'. That is the
desired action in this case.
Similarly, the query
SELECT RANK, MAJOR
FROM STUDENT
WHERE SNAME = 'Julia Zhao';
will retrieve all tuples where the value in the SNAME attribute is exactly 'Julia Zhao'. This is the
desired action in this case as long as we know (remember) Julia's first and last name and the
correct spelling of each. It also requires that the name be entered correctly in the database. For
example, if the person entering Julia's name inadvertently put two spaces between her first and
last name, the above query would not retrieve the tuple. For these and other reasons, SQL
provides the LIKE comparison operator which can be used for matching patterns in strings.
Partial strings can be specified using two reserved characters: % can be replaced by any number
of characters (zero or more). The underscore is replaced by exactly one character.
In the above example, if we remembered the name, but can't remember whether it is Julia or
Julie, the query can be stated as
SELECT SNAME
FROM STUDENT
WHERE SNAME LIKE 'Juli_ Zhao';
The _ will match any character. In this case it will pick up either Julie or Julia. Of course it will
also pick up Julib and Juliz. We can accept this since it is unlikely that such a misspelling will be
in the database.
If we remember that her name is Julia, but can't remember anything about her last name
except that it begins with Z, the query can be stated as:
SELECT SNAME
FROM STUDENT
WHERE SNAME LIKE 'Julia Z%';
The % will match any string of zero or more characters. In this case it will find Julia Zhao. It will
also find Julia Zee and Julia Zimmerman, as well as any other Julia's whose last name starts with
Z. Since we assume that even for a moderately sized database the retrieved set of tuples will be
somewhat small, we are willing to look through the retrieved tuples to find the Julia Z we want.
Suppose we want to find a faculty member whose first name we don't know, and where all we
know about his last name is that it starts with Ch. We can use the following query:
SELECT FNAME
FROM FACULTY
WHERE FNAME LIKE '%Ch%';
This will find all faculty with either first name or last name beginning with 'Ch '. Note that since
% can match zero characters, it will find first names beginning with 'Ch'. Since % can match
many characters, the % can be matched against the first name and the space between names,
which will then match last names beginning with 'Ch'. The pattern match is case sensitive, so it
will not match a 'ch' in the middle of either the first or last name, such as 'Zachary White'.
As indicated in prior modules, first and last names are usually stored as separate attributes. The
above is just a sample of where the LIKE comparison operator can be useful.
Arithmetic Operators in Queries
SQL allows the use of arithmetic in queries. The arithmetic operations for addition (+),
subtraction (-), multiplication (*), and division (/) can be used with numeric values or attributes
with numeric domains. To see what would happen if all faculty were given a 5% raise, the
following query could be used.
SELECT FNAME, SALARY, 1.05 * SALARY AS RAISE
FROM FACULTY;
This would list the name, current salary, and proposed salary with a 5% raise for each tuple in
the faculty table.
To just see the total impact on the budget, the query can be stated as:
SELECT SUM (SALARY), SUM (1.05 * SALARY) AS RAISE
FROM FACULTY;
This would show the current total of salaries as well as the total of the proposed salaries.
Either of the two prior queries can be stated with a WHERE clause. For example, to see the
impact of giving the 5% raise to only the faculty in the Software Engineering department, the
query would be
SELECT SUM (SALARY), SUM (1.05 * SALARY) AS RAISE
FROM FACULTY;
WHERE DEPT = 'SWENG';
An additional comparison operator, BETWEEN, is provided for convenience. An example is
SELECT *
FROM FACULTY;
WHERE SALARY BETWEEN 50000 AND 70000;
This would show faculty tuples for all faculty whose salary is between $50,000 and $70,000
(inclusive). The query can also be stated as
SELECT *
FROM FACULTY;
WHERE (SALARY >= 50000) AND (SALARY <= 70000);
Both queries return the same results, but many users find it easier to use the first query.
M 8.2: Introduction to Subqueries
In the last two modules we examined both basic SQL queries and SQL queries requiring the use
of a join. This module introduces an additional construct known as nested
queries or subqueries. In subqueries, complete select-from-where statements are nested as
blocks within another SQL query. This query containing the nested query is known as the outer
query.
Consider the query: Find the name of the department of the major for the student with
ID 17352. To answer this query we need to find the student tuple with id 17352 and get
the major code from that tuple. We then need to look in the department table to find the
tuple of the department with the matching code. From that tuple, we find the name of
the department.
One way to do this is a "manual" method. First run the following query:
SELECT
FROM
WHERE
Major
STUDENT
Student_id = '17352';
This will find the tuple for student with id 17352 (the John Smith tuple in the example database)
and display the major (which is SWENG) from the tuple. We can write down this information
using paper and pencil. We then run the following, inserting the "SWENG" value we wrote
down from the first query:
SELECT
Name
FROM
DEPARTMENT
WHERE Code = 'SWENG';
This will find the tuple for the SWENG department and list the name of the department
contained in the tuple, which is Software Engineering.
While this "manual" two-step process works nicely to demonstrate the point, it doesn't work
well in practice, especially when we need to write down more than one or two short facts.
As shown in the last module, a second way to handle this is to use a join. That query saves the
"manual" step and retrieves the information with the following query:
SELECT
FROM
DEPARTMENT.Name
STUDENT, DEPARTMENT
WHERE
AND
Major = Code
Student_id = '17352';
Looking at this query, it may be viewed as finding the tuple from the STUDENT table with
student id 17352, and then joining that tuple to the tuple from the DEPARTMENT table where
the major/code matches. From the joined tuple, the department name is shown. Another way
to look at this, however, is to join all the matching tuples first, and then select all joined tuples
with student id 17352 (there will be only one such tuple - the joined John Smith tuple in the
example database). From the selected tuples from the join, display the department name. In
reality, the same result (Software Engineering) will be displayed by following either process.
Which method is used will be determined by the steps of query optimization. Query
optimization is a topic that will not be covered in this course. If you are interested in this topic,
there is more information provided in Chapters 18 & 19 in the text.
Another way this query can be addressed is by using a subquery. This is a solution provided by
SQL to model the "manual" approach given above. To use this approach, we will create an inner
or nested query to answer the first part of the question (what is the major code for student
with ID 17352?). We will then use this answer to this inner query to form part of the outer
query. In this example, the outer query will answer the second part of the question: now that
we know the major (which is a department code) for the student with ID 17352, what is the
department name that matches the code? The query would be stated as:
SELECT
Name
FROM
DEPARTMENT
WHERE Code =
(SELECT Major
FROM STUDENT
WHERE Student_id = '17352');
The inner query is executed first, then the outer query.
This sub-module will close with an observation. Those who are new to writing SQL SELECT
statements, but who have a programming background, will be tempted avoid using joins and
instead use subqueries. Try to resist this tendency and use joins unless there is a good reason to
use a subquery. Joins often seem a bit strange at first, but mastering them is important if you
want to continue with database work. Another advantage to using joins is that multiple levels
of nested subqueries can be difficult to write correctly and debugging them can prove difficult.
Having said that, there are certain cases where a subquery must be used to obtain the correct
results. This will be discussed in the next sub-module.
M 8.3: When Subqueries are Required
Although there are many cases where either a join or subquery can be used to answer a query,
there are some cases where a join will not work and a subquery must be used.
Consider, the example from the last module. We described a small database table, called
SCORES, which contains the scores of students for a particular test. The attributes will be only
Student_id and Score, where score is a number between 0 and 100. Suppose we want to know
the id of the student (or students) who have the highest scores on the exam. At first glance, it
seems a relatively simple query will work.
SELECT
FROM
Student_id, MAX(Score)
SCORES;
Or possibly
SELECT
FROM
WHERE
Student_id
SCORES
Score = MAX(Score);
Unfortunately, neither of these queries will work. Both return an error. The issue is that the
above queries actually have two very different parts. First, the system must find the maximum
score on the exam. Then the system must find the student or students who achieved this score.
It is asking the system to perform two separate operations and then apply one to the other in
correct order. This is asking the system to do something it is not able to do. In this case,
we must work more like a programmer and specify the two steps in correct order. We must
specify that the system should find the maximum score on the exam first, and then use that
information to determine which students have that score. The first question is presented in the
subquery and the answer from the subquery (the max score) is used in the main query to
determine which student(s) earned that score. The query can be answered by using:
SELECT
Student_id
FROM
SCORES
WHERE Score =
(SELECT MAX(Score)
FROM SCORES);
Here the highest score on the exam is found in the subquery. The outer query then asks to
retrieve the tuple(s) of the students whose score value is equal to the score found in the
subquery.
Consider a second example where we use the expanded SCORES table from the last module
which also included an attribute named Major containing the code for the student's major. We
now want to find the Student IDs of Software Engineering major(s) had the highest score on the
exam (among the Software Engineering majors). The query would be posed this way:
SELECT
Student_id
FROM
SCORES
WHERE Major = 'SWENG'
AND
Score =
(SELECT MAX(Score)
FROM SCORES
WHERE Major = 'SWENG');
This subquery adds the additional restriction to the query that we want to only look at students
in the SWENG major. One question the above SELECT statement raises is why the condition
Major = 'SWENG' is listed in both the outer query and the subquery. In the subquery it is
needed so we only select the tuples where the student is a SWENG major so the maximum
score is determined for only those tuples. If a CMPSC major has a higher score, that is not of
interest for this query. However, since only tuples with major of SWENG are used to calculate
the maximum score, why can't the query simply be stated as:
SELECT
Student_id
FROM
SCORES
WHERE
Score =
(SELECT MAX(Score)
FROM SCORES
WHERE Major = 'SWENG');
This seems like it will also answer the question that was asked, however the query is not
correct. The subquery only returns a single number: the highest score on the exam for a
SWENG student. If that score is, say 95, the outer query is working with the equivalent
condition of Score = 95. In the second query, all tuples (which includes all majors) will be
searched and those tuples with a score of 95 will be returned, regardless of major. This will list
all SWENG students with a score of 95 (there will be at least one), but it will also list students
from other majors who scored 95 on the exam. There may be no such student, but there also
may be one or several. These student IDs will also be listed in the result. Note that it is also
possible that one or more students from other majors actually scored higher than 95. Since an
equality test was stated (Score =), these students will not be listed. That is why WHERE Major =
'SWENG' must also be listed in the outer query. Once the maximum score is returned (again,
use 95 as an example), we only want the tuples of SWENG majors returned by the outer query.
M 8.4: Operators IN, NOT IN, ALL,
SOME, ANY
The IN and NOT IN Operators
The examples in the last sub-module used the = comparison operator for the results of the
subquery. This works in cases where the results from the subquery are a single value. If you
look back at the last sub-module, you will see that this was the case in all the queries shown
there. However, in the general case, the subquery will return a table, which is a set of tuples.
When a table is returned from the subquery, the comparison operation INis used. Note that the
IN operator is more general than =, and it can be used as well as = when the subquery returns a
single value. For example, the query from the last sub-module:
SELECT
Student_id
FROM
SCORES
WHERE Score =
(SELECT MAX(Score)
FROM SCORES);
can also be stated as:
SELECT
Student_id
FROM
SCORES
WHERE Score IN
(SELECT MAX(Score)
FROM SCORES);
This query will return the same results as the first query. However, when returning a single
value, the first query is preferred. When a table of values is (or can be) returned, the second
query is required. For example, the query
SELECT
Student_id
FROM
SCORES
WHERE Score =
(SELECT Score
FROM SCORES
WHERE Score > 80 );
will return an error. Since the inner query returns a list of values, the WHERE clause in the outer
query cannot compare "SCORE = " against a set of values. In this case, IN must be used in the
query.
ELECT
Student_id
FROM
SCORES
WHERE Score IN
(SELECT Score
FROM SCORES
WHERE Score > 80 );
This will return the desired result.
The opposite of the IN operator is the NOT IN operator. This selects all tuples which do not
match the criteria. For example,
SELECT
Student_id
FROM
SCORES
WHERE Score NOT IN
(SELECT MAX(Score)
FROM SCORES);
will list the ids of all students who did not receive the highest score on the test.
Some DBMSs support using tuples with the IN clause. Assuming a larger set of scores in the
above database, to find out if any other students have the same major and score as the student
with id 12345, you can write the following query:
SELECT
Student_id
FROM
SCORES
WHERE (Major, Score) IN
(SELECT Major, Score
FROM SCORES
WHERE Student_id = '12345');
The subquery returns the major and score for the student with id 12345. The subquery will find
only one tuple which matches the given student id, but it returns a tuple (Major, Score), not just
a single value. The outer query then matches all tuples in the SCORES table against the Major,
Score pair returned from the subquery. It will then display the student id of all students with
the same score and the same major as student 12345. Not all DBMSs accept this tuple syntax in
the IN clause. Java DB (Derby) does not support the tuple syntax.
Subquery Comparison Operators ALL, SOME, ANY
The ALL keyword, will be compared to all tuples in a given set. As an example, consider the
query:
SELECT
Student_id
FROM
SCORES
WHERE Score < ALL
(SELECT Score
FROM SCORES
WHERE Major = 'SWENG');
The subquery first selects all tuples where the major is SWENG. The outer query then examines
the entire table returned by the subquery and selects the tuples with scores less than all of the
SWENG majors (i.e. each and every SWENG major). In other words, the outer query finds the
tuples which have a score lower than the lowest score among all SWENG majors and then
displays the ids of such students. In addition to the < comparison operator, the =, >, >=, <=, and
<> comparison operators can be used with ALL.
In addition to ALL, the SOME and ANY keywords can be used. These two keywords are
synonyms and have the same effect. For example
SELECT
Student_id
FROM
SCORES
WHERE Score < ANY
(SELECT Score
FROM SCORES
WHERE Major = 'SWENG');
The subquery will again first select and return all tuples where the major is SWENG. The outer
query then examines the entire table and selects the tuples with scores less than at least one
of the score values in the returned set of SWENG major scores. In other words, the outer query
finds the tuples which have a score lower than the highest score among the SWENG majors and
then displays the ids of such students. As with ALL, in addition to the < comparison operator,
the =, >, >=, <=, and <> comparison operators can also be used.
M 8.5: Correlated Nested Queries
When a condition in the WHERE clause of a nested query references an attribute of a relation
declared in the outer query, the two queries are said to be correlated.
For example, consider database you created in Assignment 7 which contained a FACULTY table
consisting several attributes including the faculty id (FAC_ID), the name of the faculty (FNAME),
and the faculty id of the chair of the faculty member's department (CHAIR). Suppose we want
to write a query to list the id and name of each faculty who reports to a chair. This can be done
using a join with the query:
SELECT
FROM
WHERE
F.FAC_ID, F.FNAME
FACULTY AS F, FACULTY AS C
WHERE F.CHAIR = C.FAC_ID;
Conceptually, this can be viewed as having two copies of the FACULTY table. One copy, aliased
to F, can be thought of as the normal view of the table. The second copy, aliased to C, can be
thought of as an additional view of the table. The join builds a tuple by appending a tuple from
the C table to a tuple from the F table where the value of the FAC_ID attribute in the C table
matches the value of the CHAIR attribute in the F tuple.
In a correlated nested query, we would use the following.
SELECT
F.FAC_ID, F.FNAME
FROM
FACULTY AS F
WHERE F.FAC_ID IN
(SELECT
F.FAC_ID
FROM
FACULTY AS C
WHERE
F.CHAIR = C.FAC_ID);
Here, the subquery selects all tuples where the value in the chair field of a tuple in the "faculty
copy" of the FACULTY table (the F table) matches a value in the faculty id field of a tuple in the
"chair copy" of the FACULTY table (the C table). The tuples selected will be, in effect, all those
faculty who have a value in their chair field. The subquery then produces a result containing the
faculty id (of the "faculty copy") for all retrieved tuples. As a result of the WHERE clause, these
tuples are used by the outer query and the faculty id and name values in the "faculty copy" of
these tuples are displayed.
A note about the scope of names is useful at this point. In the first query, there are two copies
of the FACULTY table in the query, one given the alias name of F, the other the alias name of C.
Since aliases were not provided for the attribute names in either table, the attribute names of
both copies of the table are in scope for the entire SELECT statement. Using any of the attribute
names without qualification will produce an error since the name is ambiguous. In the second
query, all of the attribute names were again qualified. This does not produce an error, and
helps clarify which copy of the table we want to use in the various parts of the query. However,
with nested queries, the scope of an unqualified name is affiliated with the table or tables
specified in that part of the query. In the query above, any unqualified name in the subquery
will reference the C copy of the table. Any unqualified name in the outer query will reference
the F copy of the table. Based on this, any F qualifier can be dropped in the outer query and any
C qualifier can be dropped in the subquery. This means that the above query can be given as
SELECT
FAC_ID, FNAME
FROM
FACULTY AS F
WHERE FAC_ID IN
(SELECT
F.FAC_ID
FROM
FACULTY AS C
WHERE
F.CHAIR = FAC_ID);
While this version of the query will produce correct results identical to the results produced by
the query above, it is harder to read quickly. It seems easier to understand the intent of the
query where all attribute names are fully qualified.
M 8.6: The EXISTS Function
EXISTS is a Boolean functions which return TRUE or FALSE. The EXISTS function in SQL tests to
see whether the result of the subquery is empty (returned no tuples) or not. EXISTS will return
TRUE if the result contains at least one tuple. If no tuples are in the result, EXISTS returns FALSE.
Consider again the database you created in Assignment 7. We can restate the query from submodule 8.5 using the EXISTS function. The query wishes to list the id and name of each faculty
who reports to a chair. A query which uses EXISTS to accomplish this is:
SELECT
FAC_ID, FNAME
FROM
FACULTY AS F
WHERE EXISTS
(SELECT *
FROM FACULTY AS C
WHERE F.CHAIR = C.FAC_ID);
As in the last sub-module, the subquery selects all tuples where the value in the chair field of a
tuple in the "faculty copy" of the FACULTY table matches a value in the faculty id field of a tuple
in the "chair copy" of the FACULTY table. The tuples selected will be, in effect, all those faculty
who have a value in their chair field. The subquery then produces a result containing all
information (of the "faculty copy") for all retrieved tuples. In this query, it is irrelevant what
result attributes are listed since the EXISTS function only cares whether or not a tuple is
retrieved by the subquery.
When such a tuple exists, the id and name of the faculty is displayed. This can also be thought
of as taking each faculty tuple in the "faculty copy" one at a time and evaluating the subquery
using the chair field from the tuple and matching it with the faculty id field from the "chair
copy" of the FACULTY table. If there is a match, EXISTS returns TRUE and the faculty id and
name from the "faculty tuple" are displayed. If there is no match, EXISTS returns FALSE (this
faculty does not report to a chair) and nothing is displayed for this tuple.
The opposite can be accomplished by using the NOT EXISTS function. This function will return
TRUE if no tuples are in the result. If the result contains at least one tuple NOT EXISTS returns
FALSE. For example to list the id and name of each faculty who does not report to a chair, the
following query can be used.
SELECT
FAC_ID, FNAME
FROM
FACULTY AS F
WHERE NOT EXISTS
(SELECT *
FROM FACULTY AS C
WHERE F.CHAIR = C.FAC_ID);
Here, again, the subquery selects all tuples where the value in the chair field of a tuple in the
"faculty copy" of the FACULTY table matches a value in the faculty id field of a tuple in the
"chair copy" of the FACULTY table. The tuples selected will be, in effect, all those faculty who
have a value in their chair field. When no such tuple exists, the id and name of the faculty is
displayed.
This can also be thought of as taking each faculty tuple in the "faculty copy" one at a time and
evaluating the subquery using the chair field from the tuple and matching it with the faculty id
field from the "chair copy" of the FACULTY table. If there is no match, NOT EXISTS returns TRUE
and the faculty id and name are displayed. If there is a match, NOT EXISTS returns FALSE (this
faculty does report to a chair) and nothing is displayed for this tuple.
M 8.7: Additional Functions and
Features
There are additional functions and features related to subqueries which we will not cover in this
course. Some of the following are not available on all DBMSs. These topics are listed below.







There is a UNIQUE function. It is discussed very briefly in the text at the end of
Section 7.1.4
Subquery in group by (7.1.8)
There is a WITH construct briefly described in the first part of Section 7.1.9. This is
similar to creating a view which will be discussed in the next module.
There is a CASE construct briefly described in the second part of Section 7.1.9. This
can be used when a value can be different based on certain conditions. It is
somewhat similar to the SWITCH statement in C++.
Recursive queries are briefly discussed in Section 7.1.10. An example of how such
queries can be useful would be in an employee table where each employee's direct
supervisor is kept as an attribute in the employee tuple. The assumption is that,
unlike the faculty table in the example we have been using where only the chair is
stored in the table (one level of supervision), several layers of supervisor are stored
(multiple levels - supervisors have supervisors, etc.). Certain types of query can use
this feature.
It isn't shown in the text, but it is possible to use a subquery in the FROM clause.
This is useful in some cases.
Subqueries can also be used in INSERT, DELETE, and UPDATE statements. We will
look at this in the next module.
M 9.1: Specifying Constraints as
Assertions
In this sub-module we will discuss the CREATE ASSERTION statement in SQL. Back in Module 5
we discussed the inherent constraints that are built into the relational model. These included
primary and unique keys, entity integrity, and referential integrity. These constraints can be
specified in the CREATE TABLE statement in SQL. This was discussed in Module 6.
In Module 5 we also looked at schema-based or explicit constraints. These are expressed in the
schemas of the data model, usually by specifying them in the DDL. They include:





domain constraints - specified as a data type of an attribute
key constraints - specified as PRIMARY KEY or UNIQUE in a CREATE
TABLE statement
constraints on null values - specified as NOT NULL in an attribute definition
entity integrity - automatically enforces that NULL values are not allowed in
attributes which make up the primary key
referential integrity - set up using FOREIGN KEY ... REFERENCES
Also mentioned in Module 5 was an additional type of constraint that must be expressed and
enforced in a different way, often by using application programs. These are
called semantic constraints or business rules. Constraints of this type are often difficult
to express and enforce within the data model. They relate to the meaning and behavior of the
attributes. These constraints are often enforced by application programs that update the
database. In some cases, this type of constraint can be handled by assertions in SQL.
Note that although CREATE ASSERTION is specified in the SQL standard, it has not been
implemented in Java DB (Derby). It is presented here since at some point you might use one of
the DBMSs which has implemented the feature.
The syntax of the CREATE ASSERTION statement is:
CREATE ASSERTION <Constraint name>
CHECK (search condition);
The keywords CREATE ASSERTION are followed by a user defined constraint name, which is
used to identify the assertion. This is followed by the keyword CHECK, which is then followed by
a condition in parentheses. For the assertion to be satisfied, the condition must be true
for every database state. It is possible to disable a constraint, modify a constraint, or drop a
constraint. The constraint name is used when performing any of these actions. It is the job of
the DBMS to make sure that the constraint is not violated.
The condition can be any condition that is valid in a WHERE clause. However, in many cases a
constraint can be specified using the EXISTS and NOT EXISTS keywords. If one or more tuples in
the database cause the condition of the assert statement to be evaluated as FALSE, the
constraint is said to be violated. The constraint is said to be satisfied by a database state if no
combination of tuples in the database state violates the constraint.
One general technique for writing assertions is to write a query that selects any tuple which
violates the desired condition. If this type of query is included inside a NOT EXISTS clause, the
assertion specifies that the result of the query must be empty for the condition to be TRUE. This
implies that the assertion is violated if the result of the query is not empty.
As an example, consider the example database first introduced in Module 1 which showed only
a few sample tuples for each table. When the database is fully populated, it will contain several
thousand student tuples and several hundred faculty tuples. We want to make sure that no
faculty has an excessive number of student advisees. This ensures that every faculty has
sufficient time to work with all advisees who are assigned to him/her. We would like most
faculty to have fewer than 20 advisees, but we want no faculty member to have more than 30
advisees. This can be checked with the following assertion.
CREATE ASSERTION PREVENT_ADVISEE_OVERLOAD
In this example, the name given to the assertion is PREVENT_ADVISEE_OVERLOAD. This is
followed by the keyword CHECK and the condition which must hold true for every database
state. In this condition, the SELECT statement will group all STUDENT tuples by ADVISOR. Any
advisor group having more than 30 tuples will be returned in the select. If any such tuple is
returned (meaning that the advisor has more than 30 advisees), NOT EXISTS will then be FALSE
(since at least one such tuple does exist). This causes the constraint to be violated.
Remember that in Module 6.3, CHECK was used to further restrict domain values. The example
used was:
If there is university policy that no course may be offered for more that six credit hours, this can
be specified as:
Credit_hours
INT
NOT NULL
CHECK (Credit_hours > 0 AND Credit_hours < 7);
This restricted the INT data type to only the values 1 - 6.
Module 6.3 also showed using CHECK to check values across a row or tuple. The example used
was a PRODUCT tuple where we did not want the sale price for a product set above the regular
price. We can make sure this does not happen by adding the following CHECK clause at the end
of the CREATE TABLE statement for the PRODUCT table.
CREATE TABLE PRODUCT(
rest of specification
CHECK (Sale_price <= Regular_price)
);
If the CHECK condition is violated, the insertion or modification of the offending tuple would
not be allowed to be stored in the database.
The two uses of CHECK from Module 6.3 are applied to individual attributes and domains and
to individual tuples. These are checked in SQL only when tuples are inserted or modified in
a specific table. This does allow the DBMS to implement the checking more efficiently in these
cases. This type of CHECK should be used where possible, but only when the designer is certain
that the constraint can only be violated by the insertion or modification of tuples. When this is
not the case, CREATE ASSERTION will need to be used. However, CREATE ASSERTION should
only be used when the more efficient simple checks cannot adequately address the desired
constraint.
M 9.2: Triggers
It is often desirable to specify an action which will be taken by the DBMS when certain events
occur or when certain conditions are satisfied. These actions are specified in SQL using the
CREATE TRIGGER statement. Only the basics of triggers are presented in this module and in
Chapter 7 in the text. If you are interested, additional information about triggers is presented in
Chapter 26, Section 1 in the text as part of the discussion of active database concepts. The basic
concept of triggers has existed since the early versions of relational databases. Triggers were
included in the SQL standard beginning with SQL-99. Today, many commercial DBMSs have
versions of triggers available, but many differ somewhat from the standard.
More specifically, CREATE TRIGGER has been implemented in Java DB (Derby). However, the
implementation is not as broad as that presented in the text. The broader version discussed in
the text is presented here since at some point you might use one of the DBMSs which has
implemented a more robust version of the feature. More specifics of the Java DB
implementation are described at the bottom of this page.
Triggers can be used to monitor the database. A trigger can be created to look for a particular
condition and take an appropriate action when the condition is met. The action is most often a
sequence of SQL statements. However, it can also be a database transaction or an external
program which is executed.
As an example of how this can be used, consider earlier examples where we included both
salary and the faculty id of the chair in the FACULTY table. Suppose we want to check when a
faculty member's salary is more than the salary of the chair. There are several events which can
lead to this condition being satisfied. When a new faculty member is added, the salaries will
need to be checked. They will also need to be checked when the salary of a faculty member is
changed. Finally, salaries will need to be checked when a new chair is appointed to lead the
department.
When this salary condition is met, the desired action is that a notification should be sent. One
possibility would be to inform the chair. However, depending on the university, the chair may
not have knowledge of the faculty salary information. In our example, we will assume the
notification is sent to someone with salary responsibility. For the example, we will assume that
someone in HR should receive the notification, and this person will follow-up with the
appropriate action.
This could be specified as follows (note that the specific syntax will vary from DBMS to DBMS).
CREATE TRIGGER SALARY_CONCERN
AFTER INSERT OR UPDATE OF SALARY, CHAIR ON FACULTY
FOR EACH ROW
WHEN (NEW.SALARY > (SELECT SALARY FROM FACULTY
WHERE FAC_ID = NEW.CHAIR))
INFORM_HR (NEW.CHAIR, NEW.FAC_ID);
The trigger is given a name: SALARY_CONCERN. This name can be used to remove or deactivate
the trigger at a later time. AFTER indicates that the DBMS will first allow the change of the tuple
to be made in the database then the condition of the trigger is checked. There is the option
BEFORE which will check the condition before the change is made. When using BEFORE, it is
sometime the case that the trigger will prevent the update. The next part of the condition is
that the trigger will be examined when a faculty tuple is inserted or when either the salary or
chair of a faculty tuple is updated. FOR EACH ROW indicates that the trigger is checked for each
tuple that is modified. The NEW qualifier indicates that the value being checked is the value in
the tuple after the change has been made. So NEW.SALARY will represent the salary value after
the tuple has been inserted or updated. The subquery will select the salary of the chair of the
faculty member. This will cause the updated salary of the faculty member to be compared to
the salary of his/her chair. If the salary of the faculty is greater, the INFORM_HR stored
procedure will be executed. At this point, think of this stored procedure as a program which is
executed to send an email to HR indicating the salary concern. Consider (NEW.CHAIR,
NEW.FAC_ID) to be parameters passed to the stored procedure. These parameters will be
values included in the email.
Typically a trigger is regarded as an ECA (Event, Condition, Action) rule. It has three
components:
1. The event - usually a database update operation applied to the database. In
example above, the events are inserting a new faculty tuple, changing the salary of a
faculty member, or changing the chair of a faculty member. The person writing the
trigger must make sure all possible related events are covered. It may be necessary
to write more than one trigger to cover all possible events. The events are specified
after the keyword AFTER in the above example.
2. The condition - determines whether the rule action should be executed. Once the
triggering event has occurred, an optional condition may be evaluated. If no
condition is specified, the action will be executed once the event occurs. If a
condition is specified, it is first evaluated and only if it got evaluated to TRUE will the
rule action be executed. The condition is specified in the WHEN clause of the trigger.
3. The action to be taken - usually a sequence of SQL statements, but can be a
database transaction or external program that will be automatically executed. In this
example, the action is to execute the INFORM_HR stored procedure.
Triggers can be used in various applications such as maintaining database consistency,
monitoring database updates, and updating derived data automatically.
Specifically for Derby, there are a few differences from the presentation in the text. The
example above shows the event as "AFTER INSERT OR UPDATE ..." Both INSERT and UPDATE are
valid (as is DELETE), but using the "OR" is not allowed in Derby. The same result can be achieved
in Derby, but two CREATE TRIGGER statements would be needed. One would include the
"INSERT" and the other would include the "UPDATE OF ..." The Derby Developer's Guide does
not provide much detail about trigger actions. It states: "A trigger action is a simple SQL
statement." The exact restrictions as to what the "simple statement" can and cannot be are not
listed.
M 9.3: Views (Virtual Tables) in SQL:
Specification
The Concept of a View in SQL
SQL uses the term view to indicate a single table that is derived from other tables. Note that
this use of the SQL term view is different from the use of the term user view in the early
modules. Here the term view only includes one relation, but a user view might include many
relations. The other tables a view is derived from can be base tables or previously defined
views. A view does not necessarily exist in physical form: it is considered a virtual table. This is
unlike a base table whose tuples are always physically stored in the database. Since a view is
often not stored physically in the database, possible update operations that can be applied to
views are somewhat limited. There are, however, no limitations placed on querying a view.
A view can be thought of as a way of specifying a table that needs to be queried frequently, but
it may not exist physically. For example, consider the question presented in Module 7.2 for the
example database we have been using.
List the names of all courses taken by the student with ID 22458 during Fall 2017 semester. In
looking at the database, we saw that the query would require joining three tables. Course
names are only available in the COURSE table, courses taken are only available in the
TRANSCRIPT table by looking at the section taken, and to find out which courses the sections
belong to we need to look in the SECTION table.
The query to answer the question was:
SELECT
FROM
WHERE
Name
COURSE, TRANSCRIPT, SECTION
TRANSCRIPT.Section_id = SECTION.Section _id AND
Course_number = Number AND
Student_id = 22458 AND
Semester = "Fall" AND
Year = 2017;
If we ask similar questions often, a view can be defined which specifies the result of these joins.
Queries can then be specified against the view as single table retrievals rather than queries
which require two joins on three tables. In this case, the COURSE, TRANSCRIPT, and SECTION
tables are called the defining tables of the view.
To define this view in SQL, the CREATE VIEW command is used. To use this command, we
provide a table name for the view (also known as the view name), a list of attribute names, and
a query to specify the contents of the view. In most cases, new attribute names for the view do
not need to be specified since the default is to use the attribute names from the defining
tables. In any of the view attributes are the result of applying functions or arithmetic
operations, then new attribute names will need to be specified for the view.
Specifying Views in SQL
As an example, the view discussed above can be created by using:
CREATE VIEW
AS SELECT
FROM
WHERE
AND
TRANS_EXTEND
Name, Student_id, SECTION.Section_ID, Grade, Semester, Year
COURSE, TRANSCRIPT, SECTION
TRANSCRIPT.Section_id = SECTION.Section _id
Course_number = Number;
This will create a view named TRANS_EXTEND. Note that Section_ID is used only once as an
attribute name in the view. Although Section_ID needs to be qualified in the CREATE VIEW
command, it will appear in the view as an attribute with its unqualified name: Section_ID.
When used in a query against the view, it will not need to be qualified.
To run the above query against the view, the SELECT statement is:
SELECT
FROM
WHERE
Name
TRANS_EXTEND
Student_id = 22458 AND
Semester = "Fall" AND
Year = 2017;
You can see how the view simplifies the above query as compared to the original query which
required the specification of the joins. This is one of the main advantages of views: the
simplification of the specification of some queries. This is especially advantageous for queries
that will be written frequently.
Views are also useful for certain types of security. This will be discussed later in the module.
A view should always be up-to-date. If tuples are modified in one or more of the base tables
used to define the view, the view should automatically reflect the changes. However the view
does not need to be materialized or "populated" when it is defined, but it must be materialized
when a query is written using the view. It is the DBMS, not the user, which must maintain the
view as up-to-date. How this can be accomplished by the DBMS is discussed in the next submodule.
When a view is no longer needed, it can be removed using the DROP VIEW command. For
example, to remove the TRANS_EXTEND view, you can use the command:
DROP VIEW
TRANS_EXTEND;
M 9.4: Views: Implementation and
Update
How can a DBMS effectively implement a view for efficient querying? There is not a simple
answer to this question.
Two main approaches have been proposed. One approach modifies or transforms a query
which specifies a view into a query on the underlying base tables. This approach is called query
modification. For example the query from the last sub-module:
SELECT Name
FROM
TRANS_EXTEND
WHERE Student_id = 22458 AND
Semester = "Fall" AND
Year = 2017;
would automatically be converted by the DBMS into:
SELECT
FROM
WHERE
Name
COURSE, TRANSCRIPT, SECTION
TRANSCRIPT.Section_id = SECTION.Section _id AND
Course_number = Number AND
Student_id = 22458 AND
Semester = "Fall" AND
Year = 2017;
The issue with this approach is that it is slow for views which are defined using complex queries
that take a relatively long time to execute. It saves time for the user when writing the query,
but has no execution advantage compared to the user writing out the "long" query. This
disadvantage is especially pronounced when many queries are run against the view in a
relatively short span of time.
The second approach is called view materialization. Using this approach, the DBMS creates a
physical copy of the view table when the view is first queried or created. The physical table can
be temporary or permanent and is kept with the assumption that there will soon be other
queries on the view. When this approach is used, an efficient method is needed to
automatically update the view table when the base tables are updated so that the view table
remains current. This is often done using an incremental update strategy. With this strategy,
the DBMS determines what tuples must be inserted, updated, or deleted in a materialized view
table when one of the defining base tables is updated. Using view materialization, the view is
generally kept as a physical copy as long as it is being queried. When the view has not been
queried for a set amount of time, the system can remove the physical view table. It will be
recreated when future queries use the view.
Various strategies for view materialization can be used. A strategy which updates the view as
soon as any of the base tables is changed is known as immediate update. Another strategy is to
update the view only when needed by a query. This is called lazy update. A strategy
called periodic update will update the view periodically, but has the drawback that it allows a
query to be run against the view when it is not completely current.
A query can always be run using a view. However, using a view in an INSERT, DELETE, or
UPDATE command is often not possible. Under some conditions, it is possible to update using a
view if it is based on only a single base table and is not constructed using any of the aggregate
functions.
If an update is attempted on a view which involves joins, it is often the case that the update can
be mapped to the underlying base relations in two or more ways. When this is the case, the
DBMS cannot determine which of the mappings is the intended mapping. All mappings will
provide the desired update to the view, but some of the mappings will create side effects that
will cause problems when querying one of the base tables directly. In general, an update
through a view is possible only when there is just one possible mapping to update the base
relations which will accomplish the desired update on the view. Any time this single mapping
criterion does not hold, the update is not permitted.
As an example, consider the small bookstore database we used in assignments 7 & 8.
Bookstore Schema from Assignment 7.docx
Assignment 7.docx
Download Bookstore Schema from
If it is often desired to list the names of books bought by a customer, given the customer name,
we can create the following view.
CREATE VIEW
AS SELECT
FROM
SALE_BY_NAME
CUSTNAME, BOOKNAME, DATE, PRICE
BOOK, CUSTOMER, SALES
WHERE
AND
BOOK.BOOKNUM = SALES.BOOKNUM
CUSTOMER.CUSTNUM = SALES.CUSTNUM;
This view can be used to list the books purchased by Roonil Wazlib as follows:
SELECT
FROM
WHERE
*
SALE_BY_NAME
CUSTNAME = 'Roonil Wazlib';
Suppose we examine the list and realize that it shows that Roonil bought the book Half-Blood
Prince. We realize that Roonil did not buy that book, but rather bought the book Deathly
Hallows.
We attempt to correct the error as follows:
UPDATE
SET
WHERE
AND
SALE_BY_NAME
BOOKNAME = 'Deathly Hallows'
CUSTNAME = 'Roonil Wazlib'
BOOKNAME = 'Half-Blood Prince';
This change can be made in the base tables in at least two different ways. The first is equivalent
to:
UPDATE
SET
SALES
SALES.BOOKNUM =
WHERE
SALES.CUSTNUM =
AND
SALES.BOOKNUM =
(SELECT BOOK.BOOKNUM
FROM BOOK
WHERE BOOKNAME = 'Deathly Hallows')
(SELECT CUSTOMER.CUSTNUM
FROM CUSTOMER
WHERE CUSTNAME = 'Roonil Wazlib')
(SELECT BOOK.BOOKNUM
FROM BOOK
WHERE BOOKNAME = 'Half-Blood
Prince');
The second mapping for the change is equivalent to:
UPDATE
SET
WHERE
BOOK
BOOKNAME = 'Deathly Hallows'
BOOKNAME = 'Half-Blood Prince');
Both mappings accomplish the goal when working through the SALE_BY_NAME view. When
listing the books through the view, as in the above query, the listing will now correctly show
that Roonil bought Deathly Hallows. The first mapping accomplishes this by changing
BOOKNUM in the tuple in the SALES base table to contain the BOOKNUM for Deathly
Hallows rather than the BOOKNUM for Half-Blood Prince. The second mapping finds the tuple
in the BOOK base table which has BOOKNAME Half-Blood Prince and changes the
BOOKNAME in that tuple to Deathly Hallows.
The first mapping is probably what was intended. The sales record is updated to reflect the
book number for Deathly Hallows rather than the book number for Half-Blood Prince. Making
the update in this manner should have no unintended side-effects. The second mapping does
not change the sales record, but changes the BOOK table to reflect that the book number which
had been assigned to Half-Blood Prince is now assigned to Deathly Hallows. While solving the
problem for the tuple when retrieved through the view, there are two side effects. First, there
are now two book numbers which show the book name Deathly Hallows and no book number
for Half-Blood Prince. Second, when looking through the view, all customers who actually
purchased Half-Blood Prince will be seen as having purchased Deathly Hallows.
Since either mapping can be argued to be "correct," DBMSs will not allow this type of update
through a view. Research is being conducted to determine which of the possible update
mappings is the most likely one. Some feel that DBMSs should use this strategy and allow
updates through a view. In the above case, the choice of the first mapping can be considered
the preferred mapping. Other researchers are looking into allowing the user to choose the
preferred mapping during view definition. Most commercial DBMSs do not currently support
either of these options.
M 9.5: Views as Authorization
Mechanisms
The basics of database security will be presented in a later module. However, one form of
security can be provided through views. Views can be used to hide certain attributes or tuples
from unauthorized users.
As an example of how this can be used, consider the faculty table which has attributes:
FACULTY_ID, NAME, DEPARTMENT, and SALARY. While the salary data must be kept in the
table for various administrative uses, it is not desirable to allow wide access to this data. One
possibility is that we want the chair of a department to have access to salary data, but only for
faculty in the department. To accomplish this, the following view can be created for the SWENG
department.
CREATE VIEW
AS SELECT
FROM
WHERE
SWENG_FACULTY
*
FACULTY
DEPARTMENT = 'SWENG';
Permission to access the view can then be given to the chair of the SWENG department, but not
to other faculty. This allows the chair to view tuples of SWENG faculty, but not view the tuples
in the base table which contains full faculty information for all faculty across the university.
Consider the case where it is desired to make some faculty information, such as name and
department, widely available. However it is desired to limit additional information contained in
the table. This can be done by creating the following view.
CREATE VIEW
AS SELECT
FROM
GENERAL_FACULTY
NAME, DEPARTMENT
FACULTY;
Granting access to the view but not the underlying base table would allow the specified users
to see the name and department for all faculty, but they would not be able to see any
additional faculty information.
M 9.6: Schema Change Statements in
SQL
This sub-module provides a brief discussion of commands which SQL provides to change a
schema. These commands can be executed while the database is operational, and the database
schema does not need to be recompiled after using the commands. The DBMS will perform
necessary checks to verify that the changes do not impact the database in such a manner that
the database becomes inconsistent.
The DROP Command
The DROP command is used to drop any element within a schema which has a name. It is also
used to drop an entire schema.
In a similar manner to what we saw in earlier sub-modules, the presentation of the DROP
command in the text and the implementation of the DROP command in Derby are somewhat
different. The information presented here first follows the material in the text. At the end of
the sub-module, the Derby implementation is discussed.
Remember that a schema can be thought of as an individual database within the DBMS. If the
full schema is no longer required, it can be removed by using the DROP SCHEMA command.
These keywords are followed by the name of the schema to be dropped. This is then followed
by one of two options, RESTRICT or CASCADE. For example,
DROP SCHEMA UNIVERSITY CASCADE;
will remove the UNIVERSITY schema from the system. When the CASCADE option is used, the
schema is removed along with all the tables, domains, and all other elements of the schema.
When the RESTRICT option is chosen instead of CASCADE, the schema is removed only if it is
empty. If any elements remain in the schema, the schema will not be dropped. All remaining
elements must be dropped individually before the schema can be dropped.
If a table (base relation) is no longer needed within the schema, the table can be removed by
using the DROP TABLE command. These keywords are followed by the name of the table to be
dropped. This is then followed by one of two options, RESTRICT or CASCADE. For example, if in
a company schema it is no longer necessary to keep the locations of the various departments,
the command
DROP TABLE DEPT_LOCATIONS CASCADE;
will remove the DEPT_LOCATIONS table from the schema. When the CASCADE option is used, in
addition to the table itself being dropped, all constraints, views, and any other elements which
reference the table are also dropped from the schema. If the command is successful, the
definition of the table will be removed from the catalog.
When the RESTRICT option is chosen instead of CASCADE, the table is removed only if it is not
referenced in any constraints, views, or other elements. Constraints include being referenced
by a foreign key in a different table, being referenced in an assertion, and being referenced in a
trigger. If any such references exist, they must be removed individually before the table can be
removed.
If the goal is to remove the tuples from the table, but have the table structure remain, the
tuples should be removed using the DELETE command.
The ALTER Command
The ALTER command can be used to change the definition of both base tables and other
schema elements which have a name. The modifications which can be made to a base table
include adding or dropping an attribute, changing the definition of an attribute, and adding or
dropping constraints which apply to the table. For example, suppose it was decided to add a
local contact phone number to the student table in the UNIVERSITY database. That can be done
by using the following command.
ALTER TABLE UNIVERSITY.STUDENT ADD COLUMN PHONE VARCHAR(15);
If this command is successful, each tuple in the table will now have an additional attribute.
There are two choices for the value assigned to the new attribute. The first option is to specify a
DEFAULT clause to assign the same value for the attribute in all tuples. The second option is to
not specify a DEFAULT clause. This will cause the value of the new attribute to be set to NULL.
When using this option, the NOT NULL constraint cannot be added to the attribute at this time.
In either case, the actual value desired must be added to each tuple individually using the
UPDATE command or a similar alternative. In the above command, the value of the new
attribute in each tuple will be set to NULL. Note that the actual data type is probably something
other than VARCHAR(15) which is used here just as an example. The actual data type depends
on exactly how the phone number will be stored and whether a standardized format for the
phone number will be used.
An attribute can be removed from a table by using the DROP COLUMN option. For example, if
at some point it is no longer desired to keep the phone number of students, the attribute can
be removed by:
ALTER TABLE UNIVERSITY.STUDENT DROP COLUMN PHONE CASCADE;
As with other commands, either CASCADE or RESTRICT must be specified. If CASCADE is chosen,
the column is dropped as are any constraints and views which reference the column. If
RESTRICT is chosen, the column is dropped only if it is not referenced by any constraints or
views.
Another use of ALTER TABLE is to add or remove a default value from a column. Examples of
this would be:
ALTER TABLE UNIVERSITY.STUDENT ALTER COLUMN PHONE DROP DEFAULT;
ALTER TABLE UNIVERSITY.STUDENT ALTER COLUMN PHONE SET DEFAULT '000-000-0000';
Specifically for Derby, there are differences from the presentation in the text.





For DROP SCHEMA, only the RESTRICT option is available. Even though it is the only
option, it must be specified.
For DROP TABLE, neither CASCADE or RESTRICT can be specified. The behavior is
similar to the CASCADE option shown above.
ADD COLUMN works as indicated above. There are some additional options which
can be used.
DROP COLUMN works as indicated above.
ALTER COLUMN works as indicated above.
There are more commands for changing a schema than are presented either here or in the text.
In this course, these should be sufficient. Additional details can be found in the SQL standards
and in reference manuals for individual DBMSs.
10.1: Overview of Database
Programming
In this sub-module, we will introduce some methods that have been developed for accessing
databases from programs.


Most database access is through software programs that implement database
applications. These are usually developed using general-purpose programming
languages such as Java and C/C++/C#.
Also, scripting languages such as PHP, Python, and JavaScript are used to provide
database access from Web applications.
When database access statements are included in a program, the general purpose
programming language is called the host language. The language used by the database (SQL for
us) is called the data sublanguage.
Some specialized database programming languages have been developed for the purpose of
writing database applications. Although many such languages have been developed for
research purposes, only a few , such as Oracle's PL/SQL, have been widely adopted for
commercial applications.
Please note that database programming is a broad topic. There are many database
programming techniques, and each technique is realized differently in each specific DBMS. New
techniques continue to be developed and existing techniques continue to be updated and
modified. In addition, although there are SQL standards, the standards continue to evolve. Also,
each vendor has usually implemented some variations from the standard.
Some institutions provide a complete course which covers database programming. In this
module, we will only be able to present an overview of the topic along with one specific
example. This will show the general steps needed to interact with a database from a
programming language. However, as you work with other programming languages and
databases, the details will be different and you will need to see which languages the specific
database supports and what tools it has available to support each language.
Throughout the course we have used an interactive interface to the database. Using an
interactive interface, commands can be typed in directly or commands can be collected into a
file of commands and the file can be executed through the interactive interface. Most DBMSs
provide such an interface. This interface is often used for database definition and running ad
hoc queries,
However, in most organizations,the majority of database interaction is by programs which have
been thoughtfully designed and carefully tested. Such programs are usually known
as application programs or database applications. Becoming increasingly common are
application programs which implement web interfaces to databases. Despite their growing
popularity, web applications are not covered in this course. If you are interested in this topic
and have limited background with it, I suggest starting by reading Chapter 11 in the text. It
provides an example of using PHP to access a database. This will provide one example and give
you a basic background to pursue additional and more current information about the topic.
Approaches to Database Programming
1. Embed database commands in a general purpose programming language. With this
approach, database statements are embedded into the program. These statements
contain a special prefix identifying them as embedded statements, often SQL
commands. A precompiler or preprocessor is run on the source code. The
precompiler processes each embedded statement and converts it to function calls to
DBMS generated code. This technique is often called embedded SQL. Additional
information can be found from the examples provided in Section 10.2 in the text.
2. Use a library of database functions or classes. Whether using functions or classes,
the library provides functions (methods) for interacting with the database. For
example, there will be function calls to connect to the database, to prepare a query,
to execute a query, to execute an update, to loop over the query results a record at
a time, etc. The actual database commands and any additional information which is
required are provided as parameters to the function calls. This technique provides
an API (Application Programmer Interface) for accessing the database from a given
general purpose programming language. For OOPLs, a class library is used for
database access. As an example, JDBC is a class library for Java which provides
objects for working with a database. Each object has an associated set of methods
for performing the needed operations. Additional information can be found from
the examples provided in Section 10.3 in the text. We will work with JDBC (covered
in Section 10.3.2) in the remainder of the module.
3. Designing a new language. A database programming language is a language
designed specifically to work with the database model and query language.
Examples of such languages are PL/SQL written to work with Oracle databases, and
SQL/PSM provided as part of the the SQL standard to specify stored procedures.
Additional information can be found from the examples provided in Section 10.4 in
the text.
The first two approaches are more common since many applications are already written in
general purpose programming languages and require database access. The third approach can
be used with applications designed specifically for database interaction. The first two
approaches must deal with the problem of impedance mismatch, while the third approach can
limit the portability of the application.
Impedance Mismatch
Impedance Mismatch is the name given to problems which may occur when there are
differences between the database model and the model used by the programming language.
One example of a potential problem is that there must be a mapping between data types of
attributes which are permitted by the DBMS and the data types allowed by the programming
language. This mapping specifies the data type to be used in the programming language for
each data type allowed by the attributes. The mapping is likely to be different for each
programming language supported by the DBMS since the data types allowed in different
programming languages are not all identical.
Another problem which must be addressed is that the results of queries are (in general) sets of
tuples. Each tuple is made up of a sequence of attribute values. For many application programs,
it is necessary to be able to access a single value from a single attribute of a single tuple. This
means that there must be a mapping from the query result table to an appropriate data
structure in the programming language. There then must be a way for the programming
language to process individual tuples and to obtain any required values from the tuple and
place the values into variables in the program. Most programming languages provide a cursor
as a way to loop through the result set and process each tuple individually.
Impedance mismatch is much less of a problem when a programming language is developed to
work specifically with a particular DBMS. In this case, the language is designed to use the same
data model and data types that are used by the DBMS.
Typical Sequence in Database Programming
When writing a program to access a database, the following general steps are followed. Note
that in many cases, the application program is not running on the same computer as the
database.



Establish or open a connection to the database server. To do this it is usually
required to provide the URL of the machine running the server as well a providing an
account name and password for the database.
Submit various database commands and process the results as required by the
application.
Terminate or close the connection once access to the database is no longer needed.
M10.2: Using the JDBC Library
The text covers JDBC in Section 10.3.2, JDBC: SQL Class Library for Java Programming. You
should carefully read that section in the text and study the two program segments presented
there. This presentation will follow the same general flow as that in the text, but the program
segments presented here will work with a database built in NetBeans and will be slightly
different from those in the text.
We will walk through the JDBC programming steps using the small database used in Assignment
7. I have recreated the database and named it MODULE10. To do that, I simply followed the
instructions given at the beginning of Assignment 7, but I changed the name of the schema
from assignment7 to MODULE10. I then used the assgnment7.sql file to create the tables and
load some sample data. This database has the same tables and data as the assignment7
database, but it has a different name so the database here has a "fresh copy" of the data. It
would be a good idea for you to do the same and follow along with your own copy as we work
through the process.
After the database is set up, the next step is to write a JDBC program to access the database.
JDBC is a class library for Java which provides the interface to access a database. Java is
platform independent and widely used, so many RDBMS vendors provide JDBC drivers which
allow Java programs to access their DBMS. A JDBC driver is an implementation of classes,
objects, and function calls which are available in JDBC for a given vendor's RDBMS. If a Java
program includes JDBC objects and function calls, it can access any RDBMS which has a JDBC
driver available. In order to process the JDBC function calls in Java, the program must import
the JDBC class libraries. These libraries can be imported by importing java..sql.*. A JDBC driver
can be loaded explicitly by using the Class.forName( ) command. An example of this is shown
in the text for loading a driver for the Oracle RDBMS. When using NetBeans, as part of the
project which contains your Java code, right click on Libraries. In the drop down, click on Add
Library. That drop down will contain a list of libraries. Click on Java DB Driver. This will
provide the drivers to the project and drivers do not need to be loaded from inside the Java
program in NetBeans.
An example of this is shown in the text for loading a driver for the Oracle RDBMS. The drivers
do not need to be loaded when accessing a Java DB database from a Java program in NetBeans.
The drivers are alreadyloaded.
A general program will use the following steps.
1.
1.
2.
3.
4.
5.
6.
7.
8.
9.
Import JDBC class library
Load the JDBC driver
Create appropriate variables
The Connection object
The Prepared Statement object
Setting the statement parameters
Binding the statement parameters
Executing the SQL statement
Processing the ResultSet object
The first example will be simple and will allow some of the steps to be skipped. Specifically, the
query will be "hard coded," so there will be no user input used to create the query. As such,
step 3 is not needed to produce the query. A Statement object will be used in Step 5, and Steps
6 & 7 will not be required. Also, as indicated above, the JDBC driver does not need to be loaded
from an external source, so Step 2 is not required in this example (or the remaining examples).
The following comments apply to Example Program 1 which can be viewed below.
Click here to download Example Program 1.JPG
Example Program 1.JPG
Download Click here to download
Example Program 1 in Word so you can preview it if you prefer
Program 1 in Word so you can preview it if you prefer
Download Example
Following the steps above for this specific example, you can see the following required steps.
1. Import the JDBC class library. This is done on line 7 where java.sql.* is imported.
2. Load the JDBC driver. This is not required for our use. An example of how this might
be used is shown on lines 11-22.
3. Create appropriate variables. Not required for this example.
4. The Connection object. This is shown on line 27. A Connection object is declared
and given the name "con". The getConnection method of DriverManager is called to
create the Connection object. One form of the method call takes three parameters.
The first is the URL of the connection. This is the URL you have been right clicking on
in the services tab to connect to the database. In my case it is
"jdbc:derby://localhost:1527/UNIVERSITY". The second parameter is the name used
to create the database, and the third is the password associated with the account.
5. The Prepared Statement object. As indicated above, we will use the Statement
object in this example. This is shown on lines 29-31. A Statement object is declared
and given the name "stmt". The createStatement method of the Connection object
is called for the "con" instance of the object to create the Statement object. The two
parameters are not actually needed in this example, but they will not hurt anything.
They will be discussed in a future example.
6. Setting the statement parameters. Not required for this example.
7. Binding the statement parameters. Not required for this example.
8. Executing the SQL statement. Executing the SQL statement, a query in this example,
returns a ResultSet object. This is shown In the example on line 35. When using the
executeQuery method, the parameter is a string which contains an SQL SELECT
statement. The SELECT statement in the string is what you would enter when
executing an interactive query as we have done in the past. This method is called
from the Statement object, "stmt", in the example. The returned value populates
the ResultSet object, which is named "rs" in the example. You can think of this as the
table of values returned (and displayed) when you execute the query interactively.
9. Processing the ResultSet object. This is shown on lines 37-41 in the example.
Although the ResultSet object contains the equivalent of a table, using the object
("rs" in the example) is equivalent to using a cursor into the table. When the query is
executed, rs refers to a tuple (which does not actually exist) BEFORE the first
tuple in the result set. The call to rs.next( ) on line 37, moves the cursor to the next
tuple in the result set. It will return NULL if there are no more tuples in the result
set. In the example, the first call to rs.next( ) moves the cursor from "before the first
row" to the first row. The program can refer to the attributes in the current tuple
(at this point, the first tuple) by using various get methods. Line 38 shows
referencing the SNAME attribute in the current tuple by using
rs.getString("SNAME"). Here, SNAME is the name of the student name attribute in
the database, and getString is used since the attribute has the data type of
VARCHAR. Note that using the attribute name is one way to reference the attribute.
An alternative method, demonstrated by the example in the text, is to use the
attribute position in the table. In our table, SNAME is in the second position, so line
38 could have used rs.getString(2). The result of this is to store the value found in
the tuple (the student with the name "John Smith") into the variable "s". Line 39
shows getting the value of the ADVISOR attribute in the tuple, by using
rs.getInt("ADVISOR") since the ADVISOR attribute has data type INTEGER. As with
SNAME, the get could have used the positional value as rs.getInt(5), since ADVISOR
is the fifth column in the table. The assignment statement will store the returned
value (the number of the advisor of the student : 6721) in the variable "n". Line 40
prints the values. When the while loop returns to execute line 37 again, rs.next() will
return a reference to the next tuple in the result set. Since there is only one tuple in
the result set, this call will return a NULL, and the loop will be exited.
The output from the program is shown here:
Click here to see Example Program 1 Output.docx
M10.3: Example of Processing a Result
Set Containing Multiple Tuples
The second example will also be fairly simple. Again, the query will be "hard coded," so there
will be no user input used to create the query. This example will produce a result set that
contains multiple tuples to compare it to Example Program 1 where the result set contained
only one tuple. Example Program 2 can be viewed below, and will then be followed by
comments.
Click here to download Example Program 2.JPG
Example Program 2.JPG
Download Click here to download
Example Program 2 in Word so you can preview it if you prefer
Program 2 in Word so you can preview it if you prefer
Download Example
Much of the code is the same as Example 1 and comments will not be repeated here.
Comments to the changes from Example 1 are:
8. Executing the SQL statement. As with Example 1, executing the SQL statement, a
query in this example, returns a ResultSet object. This is shown In the example on
line 26. This SQL statement returns all attributes and all tuples in the STUDENT table.
9. Processing the ResultSet object. This is shown on lines 28-41 in the example. As
with Example 1, the ResultSet object contains the equivalent of a table, and using
the object ("rs" in the example) is equivalent to using a cursor into the table. As in
the last example, when the query is executed, rs refers to a tuple (which does not
actually exist) BEFORE the first tuple in the result set. The call to rs.next( ) on line
28, moves the cursor to the next tuple in the result set. In the example, the first call
to rs.next( ) moves the cursor from "before the first row" to the first row. The
program refers to the attributes in the current tuple (at this point, the first tuple) by
using various get methods as shown in lines 30-34. Like Example 1, this example
shows referencing the attributes by name rather than by position. Lines 36-40 print
the values. When the while loop returns to execute line 28 again, rs.next() will move
the cursor from the first row to the second row. The next call will move the cursor
from the second row to the third row, etc. Finally, when there no more tuples in the
result set, the call will return a NULL, and the loop will be exited.
Click here to see Example Program 2 Output.docx
M10.4: Examples of Building a Query
Based on User Input
The third example shows how the program can accept user input before running a query. We
will ask the user to input a student id. The student id will be used to run a query to find and
then display the record of the student.
Click here to see Example Program 3.
Download Click here to see Example Program 3.
Much of the program is similar to the previous examples. The new code is discussed here.
Line 8 indicates that the Scanner class will be included. We'll use that to obtain the student id
that the user is going to enter. The Scanner object will be declared on Line 16.
Line 22 includes the two parameters for the createStatement method. By default, a ResultSet
object is not updatable and has a cursor that moves forward only. In order to allow the cursor
to be "moved backward" (see the discussion of line 40 below), the result set can be permitted
to move backwards by indicating TYPE_SCROLL_INSENSITIVE in the connection object as shown.
The result set can be made updatable by inlcuding CONCUR_UPDATABLE as shown. This last
parameter will be further discussed in the next sub-module.
Lines 24-28 are used to build the statement. Remember that student id is stored as a CHAR in
the database, which is actually a String in Java. The SQL statement (also a String) is constructed
based on the user input. Since the SQL statement requires character values be enclosed in
quotes,the quotes must be added to the to the student id in the statement String. These are
included by the program. Otherwise, the user would need to enter them as part of the id being
input. This would be cumbersome and error prone.
Lines 34-41 deal with a potential issue. If these lines were not included, if the user enters an id
number which is not currently in the STUDENT table, the program just ends with nothing begin
printed. This is because the while loop starting on Line 43 would never be executed since the
first call to rs1.next( ) would immediately return false (NULL) since no tuple exists in the result
set. With these lines, the first call to rs1.next( ) is checked. If it returns false, there are no tuples
in the result set and an appropriate message is printed before the program ends. The else
statement on Lines 39-41 addresses an issue that arises should there be one or more tuples in
the result set. Since the call to rs1.next( ) on Line 34 advances the cursor from "before the first
tuple" to the first tuple (if one exists), without the else, the call to rs1.next( ) on Line 43 would
advance the cursor from the first tuple to the second tuple. When the while loop is entered, the
first tuple would be skipped and only the second and remaining tuples would be displayed. If
the else statement is executed, there is at least one tuple in the result set, and Line 40 moves
the cursor back to "before the first tuple," allowing the while loop on Lines 43 to 55 to execute
beginning with the first tuple.
Click here to see Example Program 3 Output.docx
Program 3 Output.docx
Download Click here to see Example
Since building a statement as we did was a bit cumbersome, the following code will
demonstrate an alternative which uses the PreparedStatement class.
Click here to see Example Program 4.docx
4.docx
Download Click here to see Example Program
Much of this code is similar to that of Example 3, and both programs will look similar to the
user. The changes when going from Example 3 to Example 4 are noted here.
Line 24 is used to create a query string with one parameter. The place for the parameter to be
inserted is indicated by the ? character.
Line 25 shows that an object instance named pstmt of type PreparedStatement is created
based on the string in s1.
Lines 27 & 28 have the user provide the student id which is stored in the String variable sid.
Lines 30 & 31 show how a statement parameter is bound to a program variable. Line 30
demonstrates the suggested good practice of clearing all parameters before setting any new
values. Line 31 shows how a set function is used to bind a string parameter to a variable. There
are different set functions that match different parameter data types. Examples are the
setInteger and setDouble functions as well as the setString function shown here. The
parameters to the set function are the position of the statement parameter in the statement
and the name of the variable being bound to the parameter. In the code, the first (in this case
the only) parameter in pstmt is bound to the variable sid. Note that if there are n parameters in
a statement, n different set statements are required, one for each parameter.
Note that in Line 25, rather than being declared CONCUR_UPDATABLE, the Result Set was set as
CONCUR_READ_ONLY. This can be specified when no updates are being performed using the
statement. The same parameter value could also have been used in Example 3. Only simple
uses of settings which can be applied to Result Set objects are shown here. For additional
information about these parameters please consult the documentation which is provided for
the particular DBMS you are using.
The remaining code has the same function as that in Example 3.
Running this program will produce the same output as Example 3, so the output will not be
repeated here.
M 10:5: Examples of Using a Program
to Update the Database
The fifth example shows how the database can be updated using a program. For simplicity, the
command will be hard-coded. Making a modification using Examples 3 & 4 as a guide would
allow the program to accept user input.
Click here to see Example Program 5
Download Click here to see Example Program 5
Parts of the program are similar to the previous examples. The new code is discussed here.
Lines 18-24 retrieve a record from the database to show a "before" copy of the record. This is
similar to previous examples, but it does show retrieving the attribute by position rather than
by name. String s is assigned the value of the second attribute (SNAME), String t is assigned
the value of the fourth attribute (MAJOR), and int n is assigned the value of the fifth attribute
(ADVISOR). These values are then displayed by Line 23.
Line 26 sets the value of String variable sql to a string which contains the code to change the
major to "SWING" in all selected tuples.
Line 28 is where the update is actually executed. Note that unlike Line 18 and the SQL
statements executed in the previous examples, the function called is "executeUpdate" rather
than "executeQuery". This function returns the number of records (tuples) in the database
which were updated. In this case only one tuple was selected by the WHERE clause in the
update statement. This value is printed on Line 29.
Lines 32-38 retrieve the record from the database to show an "after" copy of the record. This is
identical to the code on lines 18-24. This shows that the value in the MAJOR attribute was
actually changed to "SWING".
The output from a run of the program is shown below.
Click here to see Example Program 5 Output
Program 5 Output
Download Click here to see Example
The sixth example program provides an additional look at updating tuples, in this case updating
several tuples rather than just one.
Click here to see Example Program 6
Download Click here to see Example Program 6
Although the intent of this program is similar to the program in Example 5, there are some
differences which are listed here.
Line 19 shows the initial query which retrieves all tuples from the STUDENT table.
Lines 21-34 display all attributes in the tuples. This is to get a more complete look at the
"before" status of the tuples.
Lines 36-39 build and execute the UPDATE string. The return value from the executeUpdate is
printed by line 39.
Lines 42-56 retrieve the tuples in the table after the update and display the "after" values in the
tuples.
The output from a run of the program is shown below.
Example Program 6 Output.docx
M 11.1: Informal Relation Schema
Design Guidelines
This sub-module discusses four informal guidelines that can be followed when designing
a relation schema. Note that these guidelines are for a relation schema, not
a relational schema. A relation schema references a single relation or table. A relational schema
references the entire database.
Clear Semantics for Attributes in Relations
We have discussed in earlier modules that a database is a model of a part of the real world. As
such, relations and their attributes have a real-world meaning. The semantics of a relation
indicate how the real-world meaning can be interpreted from the values of attributes in a tuple.
Following the design steps from earlier modules, the overall relational schema design should
have a clear meaning. The easier it is to explain the semantics of a relation, the better the
relation schema design will be.
This guideline states that a relation should be designed so that it is easy to explain its meaning.
If a relation schema represents one entity type or one relationship type its meaning is usually
clear. Including multiple entities and/or relationships in a single database relation tends to
cloud the meaning of the relation.
Avoid Redundant Information and Update Anomalies
We have discussed the goal of eliminating, or at least minimizing, redundant information in a
database. In addition to using extra storage space, redundant information can lead to a
problem known as update anomalies. Consider the FACULTY and DEPARTMENT relations we
have used in earlier modules. Information specific to the faculty members was kept in the
FACULTY relation, and information about the department was kept in the DEPARTMENT
relation. However, there is a relationship between the two, namely that each faculty belongs to
a department. This relationship was expressed by storing the primary key of DEPARTMENT as a
foreign key in FACULTY. This allows the relationship to be retained, but no DEPARTMENT
information was duplicated in the FACULTY relation.
Consider an alternative where the two tables are combined into one as a base table. To do this,
all department information such as name and building would be stored in the combined tuple
(which would, in effect, be a table resulting from a natural join of the two tables). If this were
done, the department name and the building it resides in would be repeated in several tuples,
once for each faculty member in the department. This can lead to update anomalies,
specifically insertion anomalies, deletion anomalies, and modification anomalies. An example of
how this would look is shown in the table below, which contains only a few tuples for simplicity.
FACULTY
Faculty ID
Faculty Name
Department Code
2469
Huei Chuang
CMPSC
Com
5368
Christopher Johnson
MATH
Mat
6721
Sanjay Gupta
SWENG
Soft
7497
Martha Simons
SWENG
Soft
Insertion anomalies can be caused when inserting new records into the combined relation. This
can happen in two cases. To add a new faculty member, the department information must also
be included. If the faculty member being added is in the SWENG department, the department
name and building that is included must be identical to the information included in the tuples
for other faculty in the SWENG department. If there is not a match, there is now conflicting
information about the SWENG department stored in the database. So, if Sally Brown is added as
a new faculty member in the SWENG department, but in the tuple the building is incorrectly
entered as Nittany Hall, the table would show the SWENG department residing in two buildings:
Nittany Hall and Hopper Center. This is incorrect and is an example of an insertion anomaly.
Also, assuming that faculty id is the primary key of the combined relation, a new department
cannot be added if the department has no faculty currently assigned. To do so, NULL values
would need to be entered into the faculty-specific attributes of the tuple, which violates the
constraint that the primary key value cannot be NULL. In the table above, if a new Cyber
Engineering department (CYENG ) is created, information about the department cannot be
added until a faculty member is assigned to the department. Neither of these problems exist in
the design which keeps the base tables separate.
Deletion anomalies can be caused when deleting records from the combined relation. If the
last faculty member of a department is deleted from the database, all information about the
department will be removed from the database. This may not be the intent of the deletion. In
the table, if Christopher Johnson leaves the university and is removed from the table, all
information about the Math department is also removed. Again, this problem does not exist in
the original design.
Modification anomalies can be caused when values in a tuple in the combined relation are
modified. For example, if a department is moved to a new building, this value must be changed
in all tuples for faculty in the department. If this is not done, the database will be in an
inconsistent state: some tuples will have the new building location and some will have the old
building location. In the table, if SWENG is moved to Nittany Hall and the change is made in
Sanjay Gupta's tuple, but not in Martha Simons' tuple, the table would show the SWENG
department residing in two buildings: Nittany Hall and Hopper Center. This is incorrect and is an
example of a modification anomaly. As with the other two examples, this problem does not
exist in the original design.
This guideline states that base relation schemas should be designed which do not allow update
anomalies to occur.
NULL Values in Tuples
If attributes are included in a relation which will contain NULL values in a large number of the
tuples, there are at least two issues which can arise. First, there will be a large amount of
wasted storage space since the value NULL will need to be stored in many tuples. Also, we saw
in earlier modules that the value of NULL can have different meanings (does not apply,
unknown, known but not yet available for the database). This can lead to unclear semantics for
the attribute.
This guideline states that the designer should avoid putting an attribute in a base relation if the
value of the attribute may frequently be NULL. If NULL values cannot be avoided, make sure
they apply to relatively few tuples in the relation.
It is often possible to create a separate relation whose attributes are the primary key of the
original relation and the attribute from the original relation which contains many null values.
Only tuples which do not have null values would need to be stored in the new relation. This
would lead to a relation with many fewer tuples, none of which would need to contain a NULL
value.
As an example, suppose some students serve as student assistants and some of the assistants
are assigned to offices. We want to keep track of the office assignments. If only 2% of the
students are student assistants who have offices, including a Stu_office_number attribute in
the STUDENT relation, would add an attribute which has a NULL value in 98% of the student
tuples. Instead, a STU_OFFICES relation can be created with attributes Student_id and
Stu_office_number. This relation will include tuples for only the students who have been
assigned an office.
Generation of Spurious Tuples
When taking a table which included redundant information and splitting the table to remove
the redundancy, it is possible to split the table in a way that leads to a poor design. The design
will be poor if the tables are such that a natural join of the two tables produces more tuples
than would be contained in the original combined tables. The extra tuples not in the original
combined tables are called spurious tuples, because such tuples provide spurious (false)
information.
This guideline states that the designer should design relation schemas so that they can be
joined using (primary key, foreign key) pairs that are related in a manner that will not cause
spurious tuples. Do not design relations where (primary key, foreign key) attributes do not
appropriately match, since joining on these attributes may yield spurious tuples.
M 11.2: Functional Dependencies
Normalization is the most well known formal tool for the analysis of relational schemas. The
main concept on which much of normalization is based is that of functional dependency. The
concept of functional dependency is defined in this sub-module. The discussion presented here
is based on Section 14.2 in the text and can seem overly "mathematical" and somewhat
abstract, so a simple example is presented along with the formal definitions to illustrate the
concepts.
We can start by thinking of listing all the attributes we want to capture in the database.
Without regard for entities at this time, give each attribute a unique name. Consider defining
the database as one large relation which contains all the attributes. This relation schema can be
called a universal relation schema.
As an example which will be carried through the normalization process, we will consider a
subset of the database we have been using for most of the course. Consider the university
database, but to keep the example simpler, we will focus on some of the attributes required to
produce a transcript. Specifically, consider student id, student name, class rank, section id,
course number, course name, course credit hours, and grade.
Student_id
Section_id
Student_name
Class_rank
Formally, if we call the universal relation schema R, a functional dependency can be defined as
follows. Consider two subsets of R which are denoted as X and Y. Y is functionally dependent on
X, denoted by X -> Y, if there is a constraint on a relation state r of R that for any two tuples in r
(which we will call t1 and t2) that if t1[X] = t2[X], then it must also be true that t1[Y] = t2[Y].
This means that the values of the attributes which make up the Y subset of the tuple depend on
(are determined by) the values which make up the X subset of the tuple. Looking at this in the
other direction, it can be said that the values which make up the X subset of the tuple uniquely
determine (or functionally determine) the values which make up the Y subset of the tuple.
Other words used to describe this are that there is a functional dependency from X to Y, or that
Y is functionally dependent on X. Functional dependency can be abbreviated FD. The set of
attributes making up X is called the left-hand side of the FD, and set Y is called the right-hand
side of the FD.
Whether or not a functional dependency holds is partly determined by assumptions about data.
For example, if the assumption is made that Student_id is unique, the functional dependency
Student_id -> Student_name will apply. This says that Student_name (the Y) is functionally
dependent on Student_id (the X). If two tuples (the t1 and t2 above) have the same Student_id,
they must also have the same Student_name. However, it would not be true that Student_id ->
Grade. Different grades would be associated with any individual student, so given the ID, you
could not uniquely determine the grade.
Since both X and Y are subsets, either or both can consist of multiple attributes. For example,
Student_id -> {Student_name, Class_rank}. Given a student id, both student name and class
rank are determined for the tuple. Similarly, {Student_id, Section_id} -> Grade. If both the
Course_
student id and section id are provided, the combination will determine the grade the student
earned for that section.
Note that if there is a further constraint on X that requires that only one tuple can exist with a
given X-value (which implies that X is a candidate key for the relation, as discussed in the next
sub-module), this implies X -> Y for any subset Y of R. If X is a candidate key for R, then X -> R.
This concept will be expanded in the next sub-module.
Also note that X -> Y in R, does not indicate whether or not Y -> X in R. As we showed above,
Student_id -> Student_name. Is it true that Student_name -> Student_id? Maybe. It depends on
the assumption about student name. Is it unique? If so, it it unique only in the current set of
tuples (the relation state) or will it hold for any potential set of tuples? If the assumption is
made that it is possible to have duplicate student names, then it isnot true that Student_name > Student_id. However, if a scheme is devised to guarantee that no two student names can ever
be identical, then it is true.
Functional dependency is not a property of the data, but rather it is a property of
the semantics or meaning of the attributes. Functional dependencies, therefore, are specified
by the database designer based on the designer's knowledge of dependencies which must hold
for all relation states r of R. Thus, functional dependencies represent constraints on the
attributes of a relation which must hold at all times.
A functional dependency is a property of the relation schema, not of a particular relation state.
Because of this, the functional dependency cannot be inferred from a given relation state, but
must be defined by a designer who knows the semantics of the attributes of R. When looking at
a relation state only, it is not possible to determine which FDs hold and which do not. Looking
at the data in a particular relation state, all that can be said is that an FD may exist between
certain attributes. However, it is possible to determine that a particular FD does not hold if the
data in even one tuple would violate the FD if it were to hold.
M 11.3 The Normalization Process
The normalization process usually starts with a set of relations which have already been
proposed or are already in use. This set of relations can be developed by first creating an ER
diagram and then converting the ER diagram into a set of relations to form a database schema.
The set of relations can also come from an existing implementation of some type, whether it is
an existing database, a file processing system, or a hard-copy process using paper forms. We
generated a set of relations in earlier modules using an ER diagram and an algorithm to produce
the database schema from the ER diagram.
The normalization process was first proposed by E.F. Codd in 1972 in a paper which defined the
first three normal forms, which Codd called first, second, and third normal forms. Codd is
considered the father of the relational database, since much of the relational database theory is
based on a paper he wrote in 1970 while working for IBM. Several years later a stronger version
of third normal form was proposed, and this version is now known as Boyce-Codd Normal form
(BCNF). In the late 1970s two additional normal forms were proposed and are called fourth and
fifth normal form. We will cover the first three normal forms. We will not cover BCNF, and the
fourth and fifth normal forms in detail.
In addition to the concept of functional dependency, the normalization process relies on the
concept of keys. Keys were discussed in Module 5.1.3. The main concepts are repeated here for
convenience.
We indicated above that in the definition of a relation, no two tuples may have the exact same
values for all their elements. It is often the case that many subsets of attributes will also have
the property that no two tuples in the relation will have the exact same values for the subset of
attributes. Any such set of attributes is called a superkey of the relation. A superkey specifies a
uniqueness constraint in that no two tuples in the relation will have the exact same values for
the superkey. Note that at least one superkey must exist; it is the subset which consists of all
attributes. This subset must be a superkey by the definition of a relation. Consider the set of
attributes from the last sub-module.
Student_id
Section_id
Student_name
Class_rank
As indicated above, the set of all eight attributes will always be a superkey. Similarly, the subset
of attributes {Student_id, Section_id, Student_name, Class_rank} is also a superkey since no
two tuples will have the exact same values for this subset of attributes. However the subset of
attributes {Student_id, Student_name, Grade} is not a superkey since is is possible (even likely)
that a given student will have earned an A in several courses over time, so there will be several
tuples with the same values for these three attributes.
A superkey may have redundant attributes - attributes which are not needed to insure the
uniqueness of the tuple. An example of this can be seen from the above. All eight attributes are
a superkey, but a subset of just four of the attributes is also a superkey. This means that at least
four of the eight attributes were redundant.
A key of a relation is a superkey where the removal of any attribute from the set will result in a
set of attributes which is no longer a superkey. More specifically, a key will have the properties:
1.
1.
1.
1. Two different tuples cannot have identical values
for all attributes in the key. This is
the uniqueness property.
2. A key must be a minimal superkey. That means
that it must be a superkey which cannot have any
attribute removed and still have the uniqueness
property hold.
Note that this implies that a superkey may or may not be a key, but a key must be a superkey.
Also note that a key must satisfy the property that it is guaranteed to be unique as new tuples
are added to the relation.
In the above example, the superkey consisting of the subset of attributes {Student_id,
Section_id, Student_name, Class_rank} is still not a key. If Class_rank is removed, the remaining
three attributes still comprise a superkey. Further, if Student_name is also removed, the
remaining two attributes still comprise a superkey. However, the removal of either attribute
from the subset {Student_id, Section_id} will no longer guarantee the uniqueness property, so
the superkey consisting of the subset {Student_id, Section_id} is also a key.
In many cases, a relation schema may have more than one key. Each of the keys is called
a candidate key. One of the candidate keys is then selected to be the primary key of the
relation. The primary key will be underlined in the relation schema. If there are multiple
candidate keys, the choice of the primary key is somewhat arbitrary. Normally, a candidate key
with a single attribute, or only a small number of attributes is chosen to be the primary key. The
other candidate keys are often called either unique keys or alternate keys.
Again consider the above set of attributes. Also, add the assumption that Student_name is
unique (a dubious assumption as we have said, but we'll make the assumption to illustrate the
point). We saw above that the subset {Student_id, Section_id} is a key. Assuming the
uniqueness of the Student_name attribute, the subset {Student_name, Section_id} is also key.
Both of these are therefore candidate keys. We choose one, let's say {Student_id, Section_id}
to be the primary key. Once this choice is made, {Student_name, Section_id} becomes
an alternate key.
In addition to the definition of keys, one additional definition is used in the normalization
process.
An attribute of a relation R is called a prime attribute of R if it is a member of at least one
candidate key of R. Any attribute which is not a prime attribute of R is called a nonprime
attribute of R. This means that any nonprime attribute of R is not a member of any candidate
key of R.
The normalization process is discussed in the following sub-modules.
M 11.4: First Normal Form
First normal form (1NF) requires that the domain of an attribute may contain only atomic
values. Atomic values are those which are simple and indivisible. Further, the value in any
attribute in a tuple must be a single value from the domain of the attribute. This requires that
an attribute value cannot be an array of atomic values, a set of atomic values, etc.
Note that in this course, in our earlier modules we forced attribute values to be atomic, so what
was developed was always in 1NF. As shown by some of the material in the text, this is not the
case for all methodologies.
To provide an example of the normalization process, we will continue with the subset of the
database we have been using for this module. Consider the university database with attributes
student id, student name, class rank, section id, course number, course name, course credit
hours, and grade. If we place all the attributes in one table, we might choose student id as the
primary key. If we do that, we see that in order to keep tuples unique, all the attributes except
student id, student name, and class rank would need to be multivalue attributes. Each student
will have taken many sections of courses. Each would need to be listed in the attribute. This
would also be true for the remaining attributes. In order to remove the need for multivalue
attributes, a primary key for this table will need to be determined so that each attribute has an
atomic value in each attribute.
Examining the table and the attributes, we can list the functional dependencies as follows.






Student_id -> Student_name
Student_id -> Class_rank
Section_id -> Course_number
Section_id -> Course_name
Section_id -> Credit_hours
{Student_id, Section_id} -> Grade
To remove the need for multivalue attributes, section id should be added as part of the primary
key. This leads to each tuple having attributes that have atomic values. This would yield a table
in first normal form that would look like the following:
Student_id
Section_id
Student_name
Class_rank
This "large" table meets the qualifications for a relation. Since each student id and each section
id combination has a unique value, each tuple in this table would be unique.
Hopefully, based on what we have covered in this course, you have the feeling that building a
database based on this large table would not be ideal. This feeling would be correct, so the
normalization process continues to second normal form.
M 11.5: Second Normal Form
In sub-module 11.2 we discussed functional dependency. Second normal form (2NF) requires
an extension of that concept. 2NF is based on the concept of full functional dependency. If X ->
Y represents a functional dependency, X -> Y is a full functional dependency if removing any
attribute A from X results in the functional dependency no longer holding. A functional
Course
dependency X -> Y is a partial dependency if at least one attribute A can be removed from X
and the functional dependency X -> Y still holds.
For example, from our table in the previous sub-module, {Student_id, Section_id} -> Grade is a
full dependency since Student_id -> Grade does not hold, nor does Section_id -> Grade hold.
Thus neither attribute can be removed and have the dependency still hold. However,
{Student_id, Section_id} -> Student_name is only a partial dependency since Section_id can be
removed and Student_id -> Student_name is still a functional dependency.
A relation schema R is defined to be in 2NF if every nonprime attribute of R is fully functionally
dependent on the primary key of R. The normal forms are considered sequential, so testing a
relation for 2NF requires that the relation already be in 1NF.
Consider the table from the last sub-module which is in 1NF. Since the combination
{Student_id, Section_id} is the primary key of this relation, all non-key attributes are
functionally dependent on the primary key. The question is which are fully functionally
dependent.
STUDENT_GRADE
Student_id
Section_id
Student_name
Class_rank
Course_
Functional dependencies:






Student_id -> Student_name
Student_id -> Class_rank
Section_id -> Course_number
Section_id -> Course_name
Section_id -> Credit_hours
{Student_id, Section_id} -> Grade
From this, only Grade is fully functionally dependent on the primary key. Second normal form
requires that the 1NF table be decomposed into tables where each nonprime attribute is fully
functionally dependent on its key. Note that if a primary key consists of a single attribute, that
relation must be in 2NF. Based on the above, the table would be decomposed into three tables.
One table requires the combined attributes for the primary key, one table requires only the
Student_id for the primary key, the third table requires only the Section_id for the primary key.
The tables would be as follows.
STUDENT
Student_id
Student_name
Class_rank
SECTION
Section_id
Course_number
Course_name
Credit_hou
GRADES
Student_id
Section_id
Grade
These tables represent the current state of the database. Each table is in 2NF. Since STUDENT
and SECTION now have a single attribute as primary key, the non-key attributes in these tables
must be fully functionally dependent on the primary key. In the GRADES table, as discussed
above, the Grade attribute is fully functionally dependent on the primary key since Grade is not
functionally dependent on either of the attributes which make up the primary key.
At this point, the STUDENT and GRADE tables look OK, but there is still something not quite
right about the SECTION table. This issue will be addressed as we move to third normal form.
M 11.6: Third Normal Form
Third normal form (3NF) is based on a concept known as transitive dependency. Informally, this
means that non-key attributes are not allowed to define other non-key attributes. Another way
to state this is that no non-key attribute can be functionally dependent on another non-key
attribute.
More formally, a functional dependency X -> Y in relation R is a transitive dependency if there
exists a set of attributes Z in R such that Z is not a candidate key nor is Z a subset of any key of
R, and both X -> Z and Z -> Y are true.
According to Codd's definition, a relation is in 3NF if it is in 2NF and no nonprime attribute of R
is transitively dependent on the primary key. This means that transitive dependencies must be
removed from tables to put them in 3NF.
Consider again the three tables from the last sub-module, all of which are in 2NF.
STUDENT
Student_id
Student_name
Class_rank
SECTION
Section_id
Course_number
Course_name
Credit_hou
GRADES
Student_id
Section_id
Grade
In earlier modules we indicated that the name of a student may not be unique. This means that
Student_name -> Class_rank does not hold. Also, since there are many students in each rank
(freshman, sophomore, etc.), Class_rank -> Student_name does not hold. Since neither non-key
attribute is dependent on the other, the STUDENT table is already in 3NF. Similarly, since the
GRADES table has only one non-key attribute, there are no transitive dependencies in this table
so it is also already in 3NF.
However, informally, we see a redundancy problem with the SECTION table. Since there are
many sections of the same course, Course_number, Course_name, and Credit_hours will be
duplicated in many tuples. We see that Course_number -> Course_name holds and
Course_number -> Credit_hours also holds. This means that the SECTION table contains
transitive dependencies since non-key attributes are functionally dependent on another nonkey attribute. The SECTION table is in 2NF, but it is not in 3NF. This is resolved by creating
another table which contains the non-key attribute that functionally determines other non-key
attributes. The determining non-key attribute will be the primary key of the new table. All of
the impacted attributes will be removed from the original table and placed in the new table.
The exception is that the attribute which is the primary key in the new table will remain as a
foreign key in the old table.
Following this description, the SECTION table is split into two tables as follows.
SECTION
Section_id
Course_number
COURSE
Course_number
Course_name
Credit_hours
Since the new SECTION table has only one non-key attribute, it is now in 3NF. An interesting
question arises when looking at the COURSE table. It is clear that Credit_hours -> Course_name
does not hold since many, many courses are three credit hours. What about Course_name ->
Credit_hours? If the assumption is made that Course_name is not unique, then it does not
determine Credit_hours and the table is in 3NF. If the assumption is made that Course_name is
unique, then Course_name does determine Credit_hours. However, if Course_name is unique,
it is a candidate key which was not selected as the primary key. Looking carefully at the
definition above, a candidate key may determine a non-key entity without causing a transitive
dependency, so the table shown above is in 3NF.
M 11.7: Further Normalization
As mentioned in sub-module 11.3, there are normal forms beyond third normal form. They will
not be covered in this course, but they will be presented in brief form here.
Boyce-Codd normal form was originally proposed as a simpler statement of 3NF. However, it
was discovered that it is not equivalent to 3NF, but is actually stricter. It turns out that every
relation in BCNF is in 3NF, but the inverse is not necessarily true. While most relations in 3NF
are also in BCNF, there are some relations which are in 3NF but are not in BCNF. Because of
this, some refer to BCNF as 3.5NF, but most use the designation BCNF. Presenting an example
of BCNF is a bit more complex than providing examples of the earlier normal forms, and the
university database does not provide an example of a table which is in 3NF, but not in BCNF, so
an example will not be provided. Further information, including an example, can be found in
Section 14.5 in the text.
Fourth normal form (4NF) was defined to address an issue known as multivalued dependency,
and fifth normal form (5NF) was defined to address an issue known as join dependency. These
types of dependency occur very infrequently, and in practice, most database designs stop at
either 3NF or BCNF. Additional information is provided in Sections 14.6 and 14.7 in the text.
Regardless of how far the normalization process is taken, it is sometimes desirable to
denormalize part of a database to improve performance. Denormalization is the term used to
describe the process of taking two base tables and storing the join of the two tables as a base
relation. The joined table will be in a lower normal form than either of the two tables
participating in the join. If the joined table is used quite often in queries, storing the joined
table eliminates the need to perform the join for each query. This improves performance when
querying the table. Denormalization does lead, however, to the issues inherent in a table which
has a lower normal form. The tradeoff needs to be evaluated in each situation to determine
whether or not denormalization is appropriate for the situation.
M 12.2: Introduction to ObjectOriented Concepts and Features
The term object-oriented was first applied to programming languages, often referred to as
OOPLs. These date back as far as the late 1960s. One of the early OOPLs was Smalltalk,
developed in the 1970s. Smalltalk was designed to be an object-oriented language, so is known
as a "pure" OO programming language. This is different from a "hybrid" OOPL which takes an
already existing language and extends it with OO features. An example of this is that the C
programming language was extended in the 1980s to incorporate OO features. The extended
language is known as C++. Java was developed later and is considered by some to be a pure
OOPL, but many do not consider it to be a pure OOPL. Regardless of which side of the argument
someone takes, all would agree that it is closer to a pure OOPL than C++ is. The debate as to
whether or not Java is a pure OOPL is largely irrelevant for the purposes of this course and will
not be pursed here.
Note that much of what is covered in parts of this module you will likely have seen in the
context of working with an object-oriented programming language. For most of you, that is
probably Java or C++, possibly both. Various object-oriented concepts will be presented briefly
with the assumption that the basic concepts will be a review for you. How these concepts are
incorporated into ODBs will be the main focus of the module.
An object consists of two components: its current value or its state and its behavior or
its operations. Objects can, among other things, have a complex data structure and specific
operations defined by the programmer. In an OOPL, objects exist only for the life of the
program. Such objects are called transient objects. An OODB can save objects permanently in a
database. Such objects are called persistent objects. These objects exist after a program
terminates and can be retrieved later by another program. An OODB stores objects
permanently in secondary storage, which provides the ability to share the objects among
multiple programs and applications. This requires features common to other types of DBMSs
such as indexing to efficiently locate the objects, concurrency control to allow sharing, and
recovery from failures. An OODB will usually interface with at least one OOPL to provide
persistent and shared object capabilities.
In OOPLs, objects have instance variables, which hold internal state of the object. Instance
variables are somewhat similar to attributes in RDBMSs. Unlike attributes, however, instance
variables may be encapsulated and are not always made visible to end users. Instance variables
may be of arbitrarily complex data types. ODBs permit definition of operations (functions or
methods) that can be applied to objects of a particular type. Some OO models insist that the
operations be predefined. This restriction forces the complete encapsulation of objects. This
restriction has been relaxed in most OO data models. One reason is because users often need
to know attribute names to specify selection conditions to retrieve the desired objects. Another
reason is that complete encapsulation requires that even simple retrieval operations must have
a predefined operation. This makes it difficult to write ad hoc queries.
To promote the use of encapsulation, operations are defined in two parts. The first part is
the signature or interface of the operation. The signature is the name of the operation and its
parameters. The second part is the method or body of the operation. This provides
the implementation of the operation, often written in a general purpose programming
language. An operation is invoked by passing a message to an object. The message includes the
operation name and its parameters. The object will then execute the method for the operation.
Encapsulation enables the internal structure of an object to be modified without the need to
change the external program that calls the operation.
Another main concept of OO systems is that of type and class hierarchies and inheritance.
Inheritance permits creation of new types or classes that inherit much of their structure and
operations from previously defined types and classes.
Early OODBs had an issue with representing relationships among objects. Early insistence on
complete encapsulation led to the thought that relationships should not be specifically
represented, but rather described by defining methods to locate related objects. However, this
does not work well with complex databases which have a large number of relationships since in
these cases it is useful to identify relationships and make them visible to users. The ODMG
standard recognized this and the standard represents binary relationships as a pair of inverse
references.
Another object-oriented concept is operator overloading, or the ability to apply a given
operation to different types of objects. This is also called operator polymorphism. The next few
sub-modules present the main characteristics of ODBs.
M 12.3: Object Identity and Objects
versus Literals
One goal of ODBs is to preserve the direct correspondence between real-world and database
objects. To accomplish this, a unique identity is assigned to each independent object in the
database. The unique identity is usually accomplished by assigning a unique value. This value is
normally generated by the system. The value is known as the object identifier (OID). The OID
may not be visible to the external user, but is used internally by the database to identify each
object uniquely. The OID can be assigned to program variables of the appropriate type when
needed.
The main property of the OID is that it must be immutable. This means that the value does not
change. This preserves the identity of the real-world object that the database object
represents. To accomplish this, an ODB must have a method to generate OIDs and maintain
their immutability. It is also preferable to use each OID only once. Even if the object is deleted
from the database, the OID should not be assigned to a different object. This implies that the
OID should not depend on any attribute values of the object since the values might be changed.
The OID can be somewhat likened to the use of the primary key for tables in relational
databases.
It is also not a good idea to base the OID on the physical address where an object is stored,
since these addresses can change over time. Some early ODBs used physical addresses as OIDs
to increase efficiency. Indirect pointers were then used if the physical address of an object
changed. Now it is more common to use long integers as OIDs and use a hash table to map the
OID to the current physical address of the object.
Early OO models required that everything be an object. This forced basic values (integer, etc.)
to have an OID. This led to the possibility that a value, such as the integer 3, may have several
different OIDs. While this can be useful in a theoretical model, it is not very practical since it
would lead to too many OIDs. Most ODBs now allow for both objects (which get an OID)
and literals (which just have values) which are not assigned OIDs. This requires that literals
must be stored within an object and the literals cannot be referenced directly from other
objects. In many systems, it is allowable to create complex structured literals which do not need
an OID.
M 12.4: Complex Type Structures for
Objects and Literals
Objects and literals may have an arbitrarily complex type structure to contain all necessary
information to describe the object or literal. In RDBMSs, information about a particular object is
spread across many relations and tuples. This leads to a loss of the direct mapping between a
real world object and its database representation. In an ODB, it is possible to construct a
complex type from other types by nesting type constructors. The three most basic constructors
are atom, struct (tuple), and collection.
One constructor has been known as the atom constructor. This term is still used, but was not
used in latest standard. It includes the basic built-in types of the object model, which are similar
to basic types such as integer, floating-point, etc. found in many programming languages. They
are called single-valued or atomic types, since the value is considered an atomic single value.
Another constructor is called the struct or tuple constructor. As with the atom constructor,
these terms are not used in latest standard. This constructor can be used to create standard
structured types such as tuples of the relational model. This type is made up of several
components and is sometimes called a compound or composite type. The struct constructor is
not actually a type, but rather a type generator since many different struct types can be
created. An example would be a struct Name which is made up of FirstName (a string),
MiddleInitial (a char), and LastName (a string). Note that struct and atom type constructors are
the only two available in the original relational model.
There are also the collection (or multivalued) type constructors. These include
the set(T), list(T), bag(T), array(T), and dictionary(K,T) type constructors. They allow part of an
object/literal value to include a collection of other objects or values when needed. These are
also considered to be type generators since many different types can be created. For example,
set(string), set(integer), etc. are different types. All elements in a given collection value must
have the same type.
The atom constructor is used to represent basic atomic values. The tuple creates structured
values and objects of the form <a1:i1, a2,i2, etc.> where ai is an attribute name (instance variable
in OO terminology) and ii is a value or OID.
The other constructors are all different, but are collectively called collection types. The set
constructor creates a set of distinct elements, all of same type. The bag constructor (also
known as a multiset) is similar to the set constructor, but the elements in a bag need not be
distinct. The list constructor creates an ordered list of OIDs or values of same type. The array
constructor creates a single dimensional array of elements of the same type. Arrays and lists
are same, except that an array has a maximum size, while a list can contain an arbitrary number
of elements. The dictionary constructor creates a collection of key-value pairs. A key can be
used to retrieve its corresponding value.
An object definition language (ODL) is used to define object types for a particular ODB
application. An example of this using a simplified demonstration ODL described in the text
would be:
define type EMPLOYEE
tuple ( Fname:
string;
Minit:
char;
Lname:
string;
Ssn:
string;
Birth_date: DATE;
Address:
string;
Salary:
float;
Supervisor: EMPLOYEE;
Dept:
DEPARTMENT; );
define type DATE
tuple ( Year:
Month:
Day:
integer;
integer;
integer;
define type DEPARTMENT
tuple ( Dname:
string;
Dnumber:
integer;
Mgr:
tuple ( Manager:
Start_date:
Employees:
set(EMPLOYEE);
Projects:
set(PROJECT); );
EMPLOYEE;
DATE; );
In this example, the EMPLOYEE type is defined using the tuple constructor. Several of the types
or attributes within the tuple, such as Address and Salary, are defined by the atom constructor
using basic types. Three of the items, including Birth_date, refer to other objects, in this case to
the DATE object. Attributes of this kind are basically OIDs that are references to the other
object. These are used to represent the relationships among the various objects.
A binary relationship can be represented in one direction, such as with Birth_date. In other
cases, the relationship is represented in both directions, which is known as an inverse
relationship. An example of this is the representation of the relationship between EMPLOYEE
and DEPARTMENT. In the EMPLOYEE type, the Dept attribute is a reference to the
DEPARTMENT object where the employee works. In the DEPARTMENT type, the Employees
attribute has as its value a set of references to objects of the EMPLOYEE type. This set
represents the set of employees who work in the department.
M 12.5: Encapsulation of Operations
and Persistence of Objects
Encapsulation of Operations:
Encapsulation is one of the main characteristics of object oriented languages and systems.
Encapsulation is related to abstract data types and information hiding in programming
languages. This concept was not used in traditional database systems since the structure of
database objects was made visible to the users and application programs. For example, in
relational database systems, SQL commands can be used with any relation in the database.
In ODBs, the concept of encapsulation can be used since the behavior of a type of object is
based on the operations that can be externally used with the object. Some operations can be
used to update the object, others can be used to retrieve the current values of the object, and
others can be used to apply calculations to the object. Usually the implementation of an
operation is specified in a general purpose programming language. External users of the object
are only made aware of the interface of the operations: the name and parameters of each
operation. However the actual implementation is hidden from the users. The interface is called
the signature, and the implementation is called the method.
The restriction that all objects should be completely encapsulated is too strict for database
applications. This restriction is eased by dividing the object into visible and hidden attributes
(the instance variables). Visible attributes are visible to the end user and can be accessed
directly through the query language. Hidden attributes are not visible and can be accessed only
through predefined operations.
The term class is used when referencing both the type definition and definitions of the
operations for the type. An example of how the EMPLOYEE type definition shown in the last
sub-module could be extended to include operations would be:
define class EMPLOYEE
type tuple ( Fname:
string;
Minit:
char;
Lname:
string;
Ssn:
string;
Birth_date: DATE;
Address:
string;
Salary:
float;
Supervisor: EMPLOYEE;
Dept:
DEPARTMENT; );
operations
age:
integer;
create_emp: EMPLOYEE;
delete_emp: boolean;
end EMPLOYEE;
In this example, an operation age is specified. It needs no parameters and returns the current
age of the employee by computing the age based on the birth date of the employee and the
current date. This method would be defined externally using a programming language. A
constructor operation is usually included. In the example it is given the name create_emp, but
many ODBs have a default name for the constructor, usually the name of the class. A destructor
operation is also usually included to delete an object from the database. In the example, the
destructor is named delete_emp and will return a Boolean indicating whether or not the object
was successfully deleted. An operation is invoked for a specific object by using the familiar dot
notation.
Specifying Object Persistence:
An ODB is often closely aligned with a particular OO programming language which is used to
specify the operation implementation. In some cases an object is not meant to be stored in the
database. These transient objects will not be kept once the program using them terminates.
This type of object is contrasted with persistent objects which are stored in the database and
remain after a program terminates.
The two methods used most often to make an object persistent are naming and reachability.
The naming technique involves giving an object a unique persistent name within a the
database. This can be done by giving an object a persistent object name in the program used
to create the object. The named objects are entry points to the database which are used to
access the object. Since it is not practical to give names to all objects in a large database, most
objects are made persistent using the second method: reachability. This works by making the
object reachable from some other persistent object. An object Y is said to be reachable from an
object X if a sequence of references can lead from X to Y.
It is interesting to note that in a relational database, all objects (tuples) are assumed to be
persistent. In an RDB, when a table is created, it represents both the type declaration and a
persistent set of all tuples. In an object-oriented database, declaring a class defines only the
type and operations for the class. The designer must separately declare an object to be
persistent.
M 12.6: Type Hierarchies and
Inheritance
Simplified Model for Inheritance:
Similar to OO programming languages, ODBs allow type hierarchies and inheritance. In Section
12.1.5, the text presents a simple OO model in which attributes and operations are treated
together since both can be inherited. Inheritance permits the definition of new types based on,
or inheriting from, existing types. This leads to a class hierarchy.
A type is defined by first specifying a type name and then by defining attributes and operations
for the type. In the simplified model, attributes and operations taken together are
called functions. A function name can be used to refer to the value of an attribute or to the
value generated as the result of an operation.
The simplest form of a type is a type name and a list of visible, or public, functions. In the
simplified model presented in the text, a type is specified in the format:
TYPE_NAME: function, function, function, ..., function
The functions are specified without parameters. Attributes would not have parameters, and
operations are listed only by name for simplicity. Although they are not identical, an ODB type
specification can look very much like a table definition in a relational database..
When a new type needs to be created and is not identical to but is similar to an already defined
type, ODBs allow a subtype to be created based on an existing type. The subtype inherits all of
the functions from the existing type. The existing type is now called the supertype.
For example, using the syntax shown above, a person type might be defined as:
PERSON: Name, Address, Birth_date, Age, Ssn
Similarly, a student type might be defined as:
STUDENT: Name, Address, Birth_date, Age, Ssn, Major, Gpa
It can be seen that these two types are somewhat similar. A student is a person, but it is desired
to include two additional attributes in the student type. Based on this, it is reasonable to derive
the STUDENT type from the PERSON type and add the two additional attributes. That can be
accomplished with syntax similar to the following:
STUDENT subtype-of PERSON: Major, Gpa.
Here, STUDENT will inherit all of the functions from PERSON. Only the Major and Gpa functions
will need to be defined for STUDENT.
Constraints on Extents Corresponding to a Type Hierarchy:
In most ODBs, an extent is defined to store the collection of persistent objects for each type or
subtype. If this is true for the ODB being used, there is a corresponding constraint that every
object in a subtype extent must also be a member of the supertype extent. Some ODBs have a
predefined system type (usually called either the ROOT class or the OBJECT class ) whose extent
contains all objects in the system. Then all objects are placed into additional subtypes that are
meaningful based on the application. This creates a class hierarchy or type hierarchy for the
system. All extents then become subsets of the OBJECT class.
An extent can be defined as a named persistent object whose value is a persistent collection
that holds a collection of objects of the same type that are stored permanently in the database.
Such objects can be accessed and shared by multiple programs. It is also possible to create a
transient collection. For example, it is possible to create a transient collection to hold the result
of a query run on a persistent collection. The program can then work with the transient
collection which will go away when the program terminates.
The ODMG model is more complex than what is presented here. This standard allows two
types of inheritance. One is type inheritance, which the standard calls interface inheritance. The
other type is the extent inheritance constraint. Details of the standard will not be covered in
this course.
M 12.7: Additional Object-Oriented
Concepts and a Summary of ODB
Concepts
There are a few additional object-oriented concepts that you may have already seen in an
object-oriented programming class. One is the polymorphism of operations which may be be
better known as operator overloading. This concept permits the same operator name (which
may be the symbol for the operator) to be applied differently depending on the type of object
the operator is applied to. In ODBs, the same concept can be used. In an ODB, different
implementations of a given method may need to be used depending on the actual type of
object the method is being applied to.
Another concept is multiple inheritance. Multiple inheritance occurs when a subtype inherits
from two or more types thereby inheriting the functions of both supertypes. There are some
issues with multiple inheritance such as the potential for ambiguity if any functions in the
supertypes have the same name. There are methods to handle the potential problems, but
some ODBs and OOPLs prefer to avoid the issue completely and do not allow multiple
inheritance.
The main concepts used in ODBs and object relational systems are:




Object identity
Type constructors
Encapsulation of operations
Programming language compatibility



Type hierarchies and inheritance
Extents
Polymorphism and operator overloading
M 12.8: Object Database Extensions
to SQL
Some features from object databases were first added to SQL in the SQL standard known as
SQL:99. These features were first treated as an extension to SQL, but these and additional
features were included in the main part of SQL in the SQL:2008 standard.
As we have seen, a relational database which includes object database features is usually called
an object-relational database. Some of the object database features which have been included
in SQL are:

o
o
o
o
Type constructors to specify complex objects
A method for providing object identity using a reference type
Encapsulation of operations through user-defined types (UDTs) which
may include operations a part of their declaration. In addition, userdefined routines (UDRs) allow the definition of general operations
(methods).
Inheritance is provided using the keyword UNDER
Additional details and some of the syntax specified in SQL to (make the above happen) are
provided in Section 12.2 in the text. These details will not be covered in this course.
M 13.1: Transactions
A transaction is the term used to describe a logical unit of processing in a database. For many
years now, large databases with hundreds, even thousands, of concurrent users are being used
by almost everyone on a daily basis. The use of these databases continues to increase. Such
databases are often referred to as transaction processing systems. These systems are
exemplified by applications such as purchasing concert and similar event tickets, shopping
online, online banking, and numerous other applications.
As further discussed below, a transaction must be completed in its entirety or the database is
possibly left in an incorrect state. This module focuses on the basic concepts and theory used to
make sure that transactions are executed correctly. This includes dealing with concurrency
control problems which can occur when multiple transactions are submitted by different users.
The problem happens when the requests of the different users interfere with one another. Also
discussed is the issue of recovering the database when transactions fail.
Back in Module 2, we saw that one of the ways to classify DBMSs was by number of users:
single-user vs. multiuser. Although not a problem with single-user systems, there is a potential
problem of interference with multiuser systems since the database may be accessed by
many concurrent users.
Multiuser systems have been available for decades, and have been based on the concept of
multiprogramming, where an operating system (OS) permits the execution of multiple
programs (technically processes) at apparently the same time. For most of these decades,
computers running multiprogramming operating systems contained a single CPU. A single CPU
can actually run only one process at any given time. The OS runs a process for a short time,
suspends its execution and then begins or continues the execution of another process, and so
on. Since both the time slice given to a process and the necessary switching between processes
happen at computer speed, the execution appears to be simultaneous to the user. This leads to
interleaved concurrency in multiuser systems. Newer systems have multiple CPUs and these
systems can execute multiple processes concurrently. Details of these and related topics will be
left to an operating systems class and will not be further discussed here. Most of the theory
presented in this sub-module was developed many years ago (before multiple CPU systems)
and since the general concepts still apply, a single CPU system will be assumed for the
presentation in this sub-module.
A transaction includes one or more database access operations. The access can be a retrieval or
any of the update operations. A transaction can be specified using SQL or can be part of an
application program. In either case, both the beginning and end of a transaction must be
specified. The details of the specification can vary, but the statements will be similar to begin
transaction and end transaction. All database operations between the transaction delimiters
are considered to form a single transaction. If all the database operations within the transaction
only retrieve data, but do not update the database, the transaction is considered to be a readonly transaction. If one or more of the operations can possibly update the database, the
transaction is considered to be a read-write transaction. Since the OS may interrupt a program
in the middle of a transaction, it is possible that two different programs attempt to modify the
same data item at the same time.
This possibility can lead to problems in the database unless concurrently executing transactions
can be run with some type of control. As an example, consider purchasing a ticket online for an
event. To keep the example simple, consider that the event has open seating. The ticket allows
entry to the event but the seats are not reserved, so all tickets are "equal." Assume that
customers Nittany and Lion are both interested in purchasing tickets to the event. Assume the
timeline of their interactions looks something like this:
Time
9:05 AM
Nittany
Reads the record and sees that 50 tickets are available - begins to discuss price and other de
friends
9:08 AM
9:17 AM
A total of 6 decide to go, so an update is issued to get the tickets - this updates the record le
total of 44 available tickets (50 - 6)
9:19 AM
Based on this example, both customers get their tickets. However, this leaves the database
showing that 43 tickets are still available when there are actually only 37 tickets remaining (50 -
6 - 7). Based on this scenario, the update from Nittany was overwritten by the update from
Lion. This is known as the lost update problem since the update from Nittany was lost.
There are related issues which also must be addressed. The dirty read problem occurs when
transaction A updates a database record, say record R, and then the updated record R is read
by transaction B. However transaction A is not able to complete and and the update to record R
must be rolled back to its original value, the value it had before transaction A began. Since
transaction B is looking at a value in record R that was later replaced by an older value,
transaction B sees an invalid or dirty value in record R.
Another potential problem is called the incorrect summary problem. If one transaction is
calculating values for an aggregate function while other transactions are updating the value(s)
of one or more tuples being used in the computed (summarized) value by the aggregate
function, it is possible that the values in some of the tuples being used are the values they had
before the update and other tuples the values being used are the values they have after the
update. This leads to incorrect results.
The unrepeatable read problem occurs when a transaction reads the same item twice and
another transaction changes the value of the item in between the two reads. The first
transaction sees different values the two times the value is read.
The next sub-module discusses some techniques which can be used to control concurrency.
The DBMS must guarantee that for every transaction, either all database operations requested
by the transaction complete successfully and are permanently recorded in the database or that
the transaction has no impact on the database. In the first case, the transaction is said to
be committed, while in the second case the transaction is said to be aborted. Therefore, if a
transaction makes some of its updates, but then fails before completing its remaining updates,
the completed updates must be undone so they will have no lasting impact on the database.
There are several reasons a transaction might fail. These include:






A computer failure
A transaction or system error
Local errors or exception conditions detected by the transaction
Concurrency control enforcement
Disk failure
Physical problems and catastrophes
The DBMS must account for the possibility of all of these failures and have a procedure in place
to recover from each type.
M 13.2: Concurrency
The last sub-module briefly discussed the need for concurrency control. This sub-module
discusses concurrency control techniques which can be used to insure that transactions which
are running concurrently can be executed so they do not interfere with each other. Most of the
techniques guarantee serializability.
When transactions can be interleaved, it is impossible to determine the order in which the
transactions will be executed and how and when each might be interrupted by the other. If two
transactions don't interrupt each other, one of two situations will occur: either transaction A
will run first and at a later time transaction B will run, or transaction B will run first and at a
later time transaction A will run. Both orderings are considered to be correct. Both are
called serial schedules since the transactions run in a series: first one, then the other.
When the transactions interrupt each other, it is possible that they interact in a way that is not
desired. An example of this was shown in the last sub-module when discussing the lost update
problem. One way to prevent the problems discussed in the last sub-module is to simply
prohibit concurrent execution of transactions. Transactions would be required to execute in a
serial fashion. Once a transaction starts, no other transaction can start until the executing
transaction completes. While this is a valid solution to the problems, it is not acceptable in
practice since it would eliminate concurrency and cause a vast under-utilization of system
resources. Since most transactions will not interfere with each other, this loss of concurrency is
considered unacceptable.
In practice, concurrent transactions are permitted as long as it can be determined that the
results of their execution will be equivalent to a serial schedule. We will not further pursue the
detailed concepts surrounding serializability. If you are interested, please see Chapter 20,
Section 5 in the text.
An early method used to guarantee serializability of schedules is the use of locking techniques,
specifically two-phase locking. Although still in use today by some DBMSs, most consider
locking to have high overhead and have moved to other protocols. Since basic locking
techniques are relatively easy to understand and since many of the newer protocols are derived
from locking, the basics of locking are described here.
A lock is usually implemented as a variable associated with a data item which indicates which
operations can be applied to the data item at a given time. To start the discussion, we will
describe a binary lock. This type of lock is simple to implement and to understand, but it is too
restrictive to be used in database systems. A binary lock has two states: locked and unlocked
(which can be represented as 1 and 0). If the value of the lock is 1 (locked), the data item
cannot be accessed. When this happens, a transaction needing to access the item will wait until
the lock is unlocked. If the value of the lock is 0 (unlocked) the item can be accessed. If a
transaction finds the value of the lock is 0, it will change the lock value to 1 before accessing the
item so any transactions which follow will see the item as locked. Once the transaction finishes
modifying the data item, it will unlock the lock by resetting the value to 0.
A binary lock scheme is easy to implement. For any item which is locked, an entry is put in
a lock table. This entry contains the name of the data item and the ID of the locking
transaction. A queue is also kept of any transactions currently waiting on this item to be
unlocked. Any data item not in the lock table is unlocked.
The binary lock is too restrictive for database use since only one transaction at a time can hold a
lock on a particular item. More than one transaction should be allowed to access the same item
as long as all the transactions are accessing the item only to read the item. If any transaction
wants to modify the item, it will need an exclusive lock on the item. To allow this type of
locking, a different type of lock, which has three possible states, must be used. This type of lock
is called a shared/exclusive or read/write lock. The three states are unlocked, read-locked, and
write-locked. The read lock is a shared lock, and the write lock is an exclusive lock. Once a
transaction holds a write lock, it may modify the item since no other transaction can hold a lock
on the item at that time. When the transaction finishes modifying the item, it will unlock the
item. When a transaction only wants to read the item, it can ask for a read lock. If the item is
unlocked or other transactions hold read locks on the item, the read lock is granted. If another
transaction holds a write lock on the item, the lock is not granted. Once a transaction holding a
read lock is finished reading the item, it will release its read lock.
For this type of locking scheme, the lock table entries need to be modified. In addition to the
name of the data item and locking transaction ID, the entry must contain the type of lock (read
or write), and if a read lock, the number of transactions holding the lock. If more than one
transaction holds a read lock on the item, the locking transaction ID becomes a list of the IDs of
all transactions holding the lock. When a read lock is released and more than one transaction is
holding the lock, the count is decreased by one and the ID of the transaction releasing the lock
is removed from the list.
An additional enhancement is to allow a transaction to convert a lock from one type of lock to
the other if certain conditions are met. It is possible to convert a read lock to a write lock if no
other transaction has a read lock on the item. This is called a lock upgrade. It is also possible to
convert a write lock to a read lock once a transaction is finished updating the data item. It
would perform this lock downgrade if it still needs to have the item available for reading. If it is
completely finished with the item, it would simply release the lock to unlock the item.
Serializability can be guaranteed if transactions follow a two-phase locking protocol. A two
phase locking protocol is followed if all locking operations, both read locks and write locks, are
completed before the first unlock operation is executed. This divides the transaction into two
parts. The first is called the expanding phase where locks are acquired but not released. The
second is called the shrinking phase where locks are released but no new locks can be acquired.
If lock conversion is permitted, upgrading from a read lock to a write lock must be done during
the expanding phase, and downgrading from a write lock to a read lock must be done during
the shrinking phase.
The above describes a general form of two-phase locking (2PL) that is known as basic 2PL.
There are other variations of 2PL. A variation known as conservative 2PL requires a transaction
to acquire all of its locks before the transaction can begin execution. If all the required locks
cannot be obtained, the transaction will not lock any item and will wait until all locks can be
obtained. This is a deadlock-free protocol (see below), but difficult to use since a transaction
may not know all the locks it will need prior to beginning execution.
A more widely used version is known as strict 2PL. In this variation, a transaction does not
release any write locks until after the transaction either commits or aborts. This will prohibit
another transaction from reading a value which may not end up being committed to the
database. This variation is not deadlock-free.
Another version, rigorous 2PL, is more restrictive than strict 2PL, but less restrictive than
conservative 2PL. Using rigorous 2PL, a transaction does not release any locks until after it
commits or aborts. This makes it easier to implement than strict 2PL.
One issue which must be addressed is deadlock. Deadlock occurs when two (or more)
transactions are waiting to obtain a lock which is held by the other. Since neither can proceed
until obtaining the required lock, neither will free the lock needed by the other.
One general method for dealing with deadlock is to prevent deadlock from happening in the
first place. These protocols are called deadlock prevention protocols. One such protocol is the
conservative 2PL protocol discussed above. There are other protocols which have been
developed to prevent deadlock.
A second general method for dealing with deadlock, deadlock detection, is to detect that
deadlock has occurred and then resolve the deadlock. This method usually provides more
concurrency, and is attractive in cases where there will be minimal interference between
transactions. This is usually the case when most transactions are short and lock only a few data
items. It is also usually the case when the transaction load is light. Once deadlock is detected, at
least one of the transactions involved in the deadlock must be aborted. This is known as victim
selection. There are several criteria that can be considered for selecting the "best" victim,
which is the one that causes the least amount of upheaval when it is aborted.
An additional issue which must be addressed when using locking is known as starvation. One
form of starvation is when a transaction must wait for an indefinite time to proceed while other
transactions are able to complete normally. This can happen with certain types of lock waiting
schemes, but there are modifications to give some types of transaction priority while still
making sure that no transaction is blocked for an extremely long period of time. A second form
of starvation occurs when a given transaction repeatedly becomes deadlocked and is chosen as
the victim each time, so it is not able to finish its execution. Again, there are schemes that can
be implemented to prevent this type of starvation.
Additional discussion of this topic, can be found in Chapter 21 in the text, beginning with
Section 21.2.
M 13.3: Database Recovery
There are a variety of events that can affect or even destroy data in a database. DBMSs and
system administrators must be prepared to handle these events when and if they occur. The
event can be as simple as a valid user entering an incorrect value in a tuple or the event can be
something as massive as a fire, earthquake, or other disaster destroying the entire data center
and everything in it. Thus, the possibilities range from a single incorrect data value on one end
to the complete destruction of the database on the other end. Since it is almost certain that
such events will occur at least occasionally, tools and procedures must be in place to correct or
reconstruct the database. This falls under the heading of database recovery and the related
database backup.
One aspect of this is the concept of database backup which has been around for a very long
time. On a regular basis, the database must be backed up (copied) and the backup must be kept
in a safe place which is away from the main database site if the backup is placed on a physical
medium. This was the case for many years, but more recent advances in networking technology
and related speed improvements have made it possible to perform a backup directly to a
remote site or to the cloud.
The second basic backup task is to keep a disk log or journal of all changes which are made to
the data. This includes all updates (insertions, deletions, and modifications) but does not
include reads since they do not change the data. There are two basic types of database logs.
One is called a change log or a before and after image log. This type of log records the value of
a piece of data just before it is changed and then again just after it is changed. In addition to the
values, the log records the table name and the primary key of the tuple which is changed. The
other type of journal, normally called a transaction log keeps a record of the program
(including interactive SQL) which changed the data and all the inputs that the program used.
Regardless of which type of log is used, a new log is started as soon as the data is backed up (by
making a backup copy).
Given that backups and logs are available, how are they used for recovery? The answer is that it
depends on what type of recovery needs to be performed. If the problem is a major problem
such as the loss of a database table, the entire database, or even the loss of the disk which
contains the database table, then at least one table needs to be recreated. This is usually done
using a procedure called roll-forward recovery. Considering the recovery of one table (which
can be repeated if the need is to recover two or more tables), the process begins with the most
recent backup copy of the table. This is a copy of the table which was lost, but it does not
reflect the most recent changes to the table. To bring the table up to date, the log is used.
Starting from the beginning of the log (which began just after the backup) each log entry which
represents an update to the table in question is then applied to the table. Starting at the
beginning of the log and then rolling forward in time sequence order will update the table in
the order that the changes were made, thus bringing the table up to the point where it was just
before the table was "lost."
Now consider a different reason for recovery. Assume that during normal database operation
an error is found in a recent update to a data value. This can be caused by something as simple
as a user entering an incorrect value, or by something more complex like a program crashing
after updating some, but not all records in a transaction. Since the program crashed, it was able
to neither commit the transaction nor to abort it. At first glance, it seems that an easy solution
is to go in and update the affected values with the accurate data. However, this does not take
into account the possibility that after the error occurred but before it was discovered, other
programs made updates to the database based on the incorrect data. Because of this
possibility, all changes made to the database since the error occurred must be backed out of
the database. This is done with a process called rollback by starting with the database as it
currently exists (a backup copy is not needed in this scenario) and then starting at the end of
the log moving backwards through the log and restoring the data values to their "before"
values. Working through the log stops when it reaches the point in time where the error was
made. Once the database is in this state, the value that was in error can be changed to the
correct value. Then the transactions which made changes after the error was made can be
rerun. This may need to be done manually, but if a transaction log is kept, a utility program can
roll forward through the transaction log and automatically rerun all of the transactions which
occurred beginning at the point the error was made.
This sub-module only describes the basic concepts of database recovery. Ch. 22 in the text
discusses additional concepts in more detail and also provides an overview of various database
recovery algorithms. If you are interested, please read that chapter in the text.
M 13.4: Database Security
This sub-module provides a very brief introduction to database security. Database security can
be quite comprehensive and it is not the intent to provide a thorough discussion here. Also,
database security shares much in common with overall site and network security. If you have
taken a course in security, much of what is covered there, such as physical security of the site,
applies to the database system also. These general security techniques will not be specifically
discussed here. Our brief overview will be limited to database-specific issues.
Types of Security
Database security covers many areas. These including the following:




Legal and Ethical issues regarding the right to access certain information.
Policy issues at the governmental and corporate level.
System related issues - at what levels should various security functions be enforced?
Some issues can be handled by the hardware or the operating system, others will
need to be handled by the DBMS.
The need of some organizations to provide multiple security levels such as top
secret, etc.
Threats to Databases



Loss of integrity: integrity is the requirement to protect data from improper
modification. Integrity is lost by unauthorized changes to the data, whether
intentional or accidental.
Loss of availability: availability is defined as providing data access to humans or
programs that have a right to access the data.
Loss of confidentiality: confidentiality deals with the protection of data from
unauthorized disclosure.
Database Security
When examining database security, it must be remembered that it is not implemented in a
vacuum. The database is usually networked and is part of a complete system which usually
includes firewalls, web servers, etc. The entire system needs to work together to implement
security. A DBMS usually has a security and authorization subsystem which provides security to
prevent unauthorized access to certain parts of a database.
Database security is usually split into two main types: discretionary and mandatory.
Discretionary security mechanisms involve the ability to grant various database privileges to
database system users. Mandatory security mechanisms involve putting users into various
security categories which are then used to implement the security policy of the organization.
Control Measures
The four main control measures used to provide database security are:




Access control: Used to restrict access to the database itself.
Inference control: Statistical databases allow queries to retrieve information about
group statistics while still protecting private individual information. Security for
statistical databases must provide protection to prevent the private information
from being retrieved or inferred from the statistical queries.
Flow control: This prevents the flow of information in a way that it reaches users
unauthorized to see the data.
Data encryption: To add an extra security measure, data in the database can be
encrypted. When this is done, even if an unauthorized user is able to get to the
stored data, the unauthorized user will have trouble deciphering it.
The Role of the DBA
As we discussed in Module 1, the DBA has many responsibilities in the oversight of the
database. One of these responsibilities is for the overall security of the database. This includes
creating accounts, granting privileges to accounts, and assigning users to the appropriate
security category. The DBA will assign account numbers and passwords to users who will be
authorized to use the database. In most DBMSs, logs will be kept each time a user logs in. The
log will also keep track of all changes the user makes to the database. If there are unauthorized
changes made to the database, this log can be used to determine the account number (and
therefore the user) who made the changes.
Additional Information
If you are further interested in this topic, additional information is provided in later parts of
Chapter 30. You might find of particular interest the discussion of the SQL injection threat
discussed in Section 30.4.
M 14.1 Introduction and Basic
Concepts
As discussed in Module 2, early DBMSs were based on a centralized architecture where the
DBMS ran on a mainframe and users accessed the database via "dumb" terminals. This was
dictated by the technology of the time when the centralized mainframe was the technology
which was available.
Early research was based on networking the mainframes, but for several years work in the area
remained at the research level.
As PCs moved from being considered only hobby machines to being used in business in the
1980s, PCs moved from stand-alone computers to computers which were networked over local
area networks. Some DBMSs were reduced in size to run stand-alone on PCs for small
databases. However, DBMSs were also deployed on small servers and used in a client/server
architecture. The DBMS was still centralized on the server, but some of the processing was
moved from the server to the PC client. This was enabled partly by the improvement in local
area networks (LANs).
In the late 1990s the web became more prevalent, especially with the development of the first
browser in 1994. Throughout the rest of the 1990s and the 2000s, improvements continued to
be made in both browsers and in the networking infrastructure supporting the web . This
fueled an explosion in web usage. As part of this explosion, web sites were hosting databases to
support their evolving online businesses. This led to a move to a three-tier architecture, with
one tier being the client running on a browser, the second being an application server running
various applications for the company, and the third being a database server which the
application server can query. Note that while this distributes the overall processing, the
database itself is not necessarily distributed.
What is a Distributed Database?
A distributed database (DDB) is a collection of multiple logically interrelated databases
distributed over a computer network. A distributed database management system (DDBMS) is a
software system which manages a distributed database while making the distribution
transparent to the user.
To be classified a distributed database, the database must meet the following:



There are multiple computers called sites or nodes. The sites are connected by an
underlying computer network which transmits data and commands among the sites.
The information in the database nodes must be logically related.
It is not necessary that all nodes have data, hardware, and software that is identical.
The sites may be connected by a LAN or connected by a wide-area network (WAN). When
considering database architecture issues at a high level, the network architecture does not
matter. The only requirement is that every node needs to be able to communicate with every
other node. The network design is, however, critical to overall distributed database
performance issues. Network design is a topic that is beyond the scope of this course and will
not be covered here.
Transparency
In this context, transparency refers to hiding implementation details from users. For a
centralized database, this refers just to logical and physical data independence.
In a distributed database, there are additional types of transparency which need to be
considered. They include the following.



Distribution transparency: This deals with the issue that the user should not be
concerned about the details of the network and the placement of data. This is
divided into two parts. Location transparency is where the command used to
perform a task is independent of the location of the data and of the location of the
node issuing the command. Naming transparency provides that once named, an
object can be accessed without additional location information being specified.
Replication transparency: This allows the database to store data at multiple sites,
but user is unaware of the copies.
Fragmentation transparency: This provides that a relation (table) can be split into
two or more parts and the user is unaware of the existence of the fragments. This is
further discussed in the next sub-module.
Availability and Reliability
In computer systems, in general, reliability is defined as the probability that the system will be
running at a specified point in time. Availability is the probability that the system will be always
available during a specified time interval. In distributed databases, the term availability is used
to indicate both concepts. Availability is directly related to the faults, errors, and failures of the
database. These terms describe related concepts and the terms are described in slightly
different ways by various sources. Here we will describe a fault as a defect in the system which
can be the cause of an error. An error is when one component of the database enters a state
which is not desirable. An error may lead to a failure. A system failure occurs when the system
does not function according to its specifications. Potential improvement in availability is one of
the main advantages of a distributed database.
A reliable DDBMS tolerates failures of underlying components. The recovery manager of the
DDBMS needs to deal with failures from transactions, hardware, and communications
networks.
Scalability and Partition Tolerance
Scalability indicates how much a system can expand and still continue to operate without
interruption.
Horizontal scalability is the term used for expanding the number of nodes. As nodes are added,
it should be possible to distribute some of the data and processing to new nodes.
Vertical scalability is the term used for expanding the capacity of individual nodes. This would
include expanding the storage capacity or the processing power of the node.
If there is a fault in the underlying network, it is possible that the network may need to be
partitioned (for a time) into sub-networks. Partition tolerance says that the database system
should be able to continue operating while the network is partitioned.
Autonomy
Autonomy is the extent to which individual nodes or databases connected to a DDB can
operate independently. This is desirable to increase flexibility.
M 14.2: Distributed Database Design
We have looked at database design issues and concepts throughout the course. When working
with a distributed database, in addition to the general design of the database, other factors
must be considered.
Data Fragmentation and Sharding
When using a centralized database, there is no decision to be made as to where to store the
data: it is all stored in the database. When the database is distributed, it must be determined
which sites should store which portions of the database. For now, assume that the data is not
replicated; data is stored at one site only.
First, the logical units of the database must be determined so they can be distributed. The
simplest logical unit is a complete relation (table). The entire relation will be stored at one site.
However, a relation can be split into smaller logical units. One way to do this is by creating
a horizontal fragment (also called a shard) of a relation. This is a subset of the tuples of a
relation (the table is split horizontally). The fragment can be determined by specifying a
condition on one or more attributes of the relation. For example, in a large company with
multiple sales locations, the customer relation may be split into several horizontal fragments by
using the condition of which sales site is primarily responsible for providing service to the
customer. Each fragment can then be stored at the database site closest to the sales site.
A second way to split a table into smaller logical units is by vertical fragmentation. This is a
division of the relation by columns (attributes); the vertical fragment keeps only certain
attributes of the relation. If a relation is split into two parts using vertical integration, at least
one of the attributes must be kept in both fragments so the original tuple can be recreated
from the two fragments. This common attribute kept in both tables must be the primary key (or
some unique key).
The two types of fragmentation can be intermixed resulting in what is known
as mixed or hybrid fragmentation. A fragmentation schema includes a definition of the set of
fragments which must include all tuples and attributes and also must allow the entire database
to be recreated with a series of outer join and union operations.
An allocation schema describes the distribution of fragments to nodes of the DDB. If a
fragment is stored at more than one site, it is said to be replicated.
Data Replication and Allocation
Replication is used to improve the availability of data. At one end of the spectrum is a fully
replicated distributed database where the entire database is kept at every site. This improves
the availability of the database since the database can keep operating even if only one site is
up. This full replication also improves retrieval performance since results of queries can be
obtained locally. The disadvantage is that full replication can slow down update performance
since the update must be performed on every copy of the database.
At the other end of the spectrum is no replication. This provides for faster updates and the
database is easier to maintain. However, availability and retrieval performance can suffer if the
database is not replicated.
In between is a wide variety of partial replication options. Some fragments can be replicated,
but not others. The number of copies of each fragment can range from as few as one to as
many as the total number of sites. A description of the replication of fragments is called
the replication schema. The choice of how the database should be replicated is based on both
availability needs, and update and retrieval performance needs.
M 14.3: Concurrency Control and
Recovery
Several concurrency control methods for DDBs have been proposed. They all extend the
concurrency control techniques used in centralized databases. These techniques will be
discussed by looking at extending centralized locking. The general scheme is to designate one
copy of each data item as a distinguished copy. The locks for the data item are associated with
the distinguished copy, and all locking and unlocking requests are sent to the site which houses
the distinguished copy. All the methods based on this idea differ in how the distinguished
copies are chosen. In all the methods, the distinguished copy of a data item acts as a
coordinator site for concurrency control on that item.
Primary Site Technique
Using this technique, a single site is chosen to be the primary site and this site is used as the
coordinator site for all database items. This site, then, keeps all locks and all requests for
locking and unlocking are sent to this site. The advantage of this technique is that it is a simple
extension of the centralized locking approach. The disadvantages are that since all locking
requests are sent to a single site, the site might become overloaded and cause an overall
system bottleneck. Also, if this site goes down, it basically takes the entire database down since
all locking is done at this site. This limits both reliability and availability of the database.
Primary Site with Backup Site
This approach addresses the problem of the entire DDB being unable to operate if the primary
site goes down. With this approach, a second site is chosen as the backup site. All locking
information is maintained at both sites. If the primary site fails, the backup site takes over and a
new backup site is chosen. This scheme still suffers from the bottleneck issue, and the
bottleneck potential is actually worse than that with the primary site technique since all lock
requests and lock status information must be recorded at both the primary and backup sites,
thereby leading to two potential bottleneck points.
Primary Copy Technique
This method distributes the load of lock coordination. The distinguished copies of data items
are stored at different sites, reducing the bottleneck issue seen with the previous two
techniques. If one site fails, only transactions which need a lock on one of the data items whose
lock is at the site are affected. This method can also use backup sites to improve reliability and
availability.
Choosing a New Coordinator in Case of Site Failure
If a primary site fails, a new lock coordinator must be chosen. If a method with a backup site is
being used, all transaction processing of impacted data items is halted while the backup site
becomes the new primary. Transaction processing will not resume until a new backup site is
chosen and all lock information is copied from the new primary site to the new backup site. If
no backup site is being used, all executing transactions must be aborted and a lengthy recovery
process is then launched. As part of recovery, a new primary site must be chosen and a lock
manager process must be started at the chosen site. Once this is done, all lock information
must be created. If no backup site exists, an election process is used for the remaining active
sites to agree on a new primary site. The elected site then establishes the new lock manager.
Distributed Concurrency Control Based on Voting
In this technique, no distinguished copy is used. A lock request is sent to all sites that have a
copy of the data item. Each copy has a lock table for the item and can allow or refuse a lock for
the item. When a transaction is granted a lock on the item by a majority of the sites holding the
item, the transaction holds the lock and informs all copies that it has been granted the lock. If
the transaction does not receive the lock from a majority of sites within a timeout period, it
cancels the lock request and informs all sites of the cancellation.
This method is a true distributed concurrency control method. Studies have shown that this
method produces more message traffic than the techniques which use a distinguished copy.
Also, if the algorithm deals with the possibility that a site might fail during the voting process,
the algorithm becomes very complex.
Distributed Recovery
Recovery in a DDB is quite involved and details will not be discussed here. One issue that must
be considered is communication failure issues. If one site sends a message to a second site and
does not receive a response, there are several possibilities for not receiving the response. One
is that the second site is actually down. Another is that the second site did not get the message
because of a failure in the network delivery system. A third possibility is that the second site
received the message and sent a response, but the response was not delivered to the initial
site.
Another problem which must be addressed by distributed recovery is dealing with a distributed
commit. When a transaction is updating data at two or more sites, it cannot commit the
distributed transaction until it is sure that the data at every site has been updated correctly.
This means that each site must have recorded the effects of the transaction at the local site
before the distributed transaction can be committed.
M 14.4: Overview of Transaction
Management and Query Processing in
Distributed Databases
Distributed Transaction Management
With a distributed database, a transaction may require that tables be updated on different
nodes of the DDBMS. The concept of a transaction manager must be extended to include a
global transaction manager to support distributed transactions. The global transaction manager
coordinates the execution of the transaction with the local transaction manager of each
impacted site. It will ensure that the necessary copies of each table will be updated and
committed at each site. The transaction manager will pass information to the concurrency
controller to acquire and then eventually release the necessary locks.
Originally, a two-phase commit (2PC) protocol was used. With this protocol, in phase 1 each
participating database will inform the coordinator that it has completed the changes required.
The coordinator will then send a message to all nodes indicating that they should prepare to
commit. Each node will then write to disk all information needed for local recovery, and then
send a ready to commit message back to the controller. If any local database has an issue
where it cannot commit its part of the transaction, it will send a message to the controller that
it cannot commit.
If the coordinator receives a ready to commit message from all participating nodes, the
coordinator sends a commit signal to all nodes, at which time the commit is completed by the
local controller of each node. If one or more nodes had sent a cannot commit to the controller,
the controller will send a roll back message to all nodes. Each node will then roll back the local
part of the transaction.
The two-phase commit protocol has an issue when the global coordinator fails during a
transaction. Since this is a blocking protocol, any locks on other sites will continue to be held
until the coordinator recovers. This problem was resolved by extending the protocol to a threephase commit (3PC) protocol. This extension divides the second phase of 2PC into two phases
which are known as prepare-to-commit and commit. The prepare-to-commit phase is used to
communicate the results of the replies from phase 1 to all participating nodes. If all replies were
yes, the coordinator indicates that all nodes should move to the prepare-to-commit state. The
commit phase is the same as the second part of the 2PC. With this extension, if the coordinator
crashes during this sub-phase, another participant can take over and continue the transaction
to completion, whether that be an eventual commit or an abort.
Distributed Query Processing
Since this course did not cover how queries are processed or optimized by a non-distributed
DBMS, there is no knowledge base to build upon to discuss the details of distributed query
processing. However, a few points about query processing with a distributed database can be
addressed at a high level.
When processing a distributed query, the data must first be located using the global conceptual
schema. If any of the data is replicated, the most appropriate site to be used to retrieve the
data for this query is identified. These algorithms must take several items into consideration.
One is the data transfer costs. To complete some queries, it is necessary to move intermediate
results and possibly final results over the network. Depending on the type of network, this data
transfer might be relatively expensive, and if so, minimizing how much data is transferred is a
consideration in how the query is executed. Also, the initial design of the database should
consider the necessity of performing joins on tables which are not located at the same site. If
two tables will often need to be joined, the design should strongly consider storing a copy of
each table at the same site. When joins do need to be performed across the network, an
operation called a semijoin is often used. With a semijoin, only the joining column of one of the
tables is sent across the network to where the second table is located. The column is then
joined with the second table. Only the resultant tuples are then returned to the site of the first
table where they are joined with the first table. So rather than sending an entire table across
the network, only one column of the first table is sent over and then only the required subset of
tuples from the second table is sent back.
Query optimization is a very interesting topic which time limitations did not allow us to cover in
this course. If you would like to study this topic, start with Chapters 18 and 19 in the text. This
will provide the background to more completely study Section 23.5 in the text and additional
material in the literature.
M 14:5 Advantages and Disadvantages
This sub-module highlights some of the advantages and disadvantages of distributed databases.
Centralized Database
Advantages include:



A high level of control over security, concurrency, and backup and recovery since it
can be covered at one site.
No need to deal with the various tables and directories needed to manage
distribution.
No need to worry about distributed joins and related issues.
Disadvantages include:



All database access from outside the site where the database is located requires
WAN communication and related costs.
The database site can be a bottleneck.
If the site goes down, there is no database access. This can cause availability issues.
Distributing Tables to Various Sites with No Replication and No
Partitioning
Advantages include:



Local autonomy.
Reduced communications costs when each table is located at the site that uses it the
most often.
Improved availability since some parts of the database are available even if one or
more of the sites are down.
Disadvantages include:



Security, concurrency, and backup and recovery issues concerns are spread across
multiple sites.
Requires a distributed directory and related global tables as well as the software
required to support transparency.
Requires distributed joins.
Distributed Including Replication
Advantages beyond distributing tables include:


Reduced communications cost for queries since copies of tables can be kept at all
sites which heavily use the table.
Additional improvement to availability since if a site goes down, it is possible that
another site may have copies of one or more tables hosted by the site which went
down.
Disadvantages beyond distributing tables include:

Additional concurrency control is required across multiple sites when data in
replicated tables is updated.
Distributed Including Partitioning
Advantages include:


Highest level of local autonomy. Data at tuple or attribute level can be stored at the
site that most heavily uses it.
Additional reduction of communication costs since data at tuple or attribute level
can be stored at the site that most heavily uses it.
Disadvantages include:

Retrieving an entire table or a large part of the table might require accessing
multiple sites.
M 15.1: Introduction
Over time, many organizations started to need to manage large amounts of data which did not
fit nicely into the relational database model. Newer applications also required this ability.
Examples of such organizations include Google, Amazon, Facebook, and Twitter. The
applications include social media, Web links, user profiles, marketing and sales databases,
navigation applications, and email. The systems which were created to manage this type of data
are generally referred to as NoSQL systems, or Not Only SQL systems. Such systems focus on
the storage of semi-structured data, on high performance, on high availability, on data
replication, and on scalability. This contrasts with traditional databases which focus on
immediate data consistency, powerful query languages, and the storage of structured data.
Many of the needs of the above organizations and applications did not match well with the
features and services provided by traditional databases. Many services provided by the
traditional databases were not needed by many of the applications and the structured data
model was too restrictive. This led many companies such as Google, Amazon, and Facebook to
develop their own systems to provide for the data storage and retrieval needs of their various
applications. Additional NoSQL systems have been developed for research and other uses.
M 15.2: Characteristics and Categories
of NoSQL Systems
Characteristics of NoSQL Systems
Although NoSQL systems do differ from each other, there are some general characteristics
which can be listed. The characteristics can be divided into two main groups: those related to
distributed databases and distributed systems, and those related to data models and query
languages.
Many of the features required by NoSQL systems are features related to the distributed
database features discussed in the last module. The characteristics of NoSQL systems related
to distributed databases and systems include the following.



Scalability: Since the volume of data handled by these systems keeps growing, it is
important to the organizations that the capacity of the systems can keep pace. This
is most often done with horizontal scalability: adding more nodes to provide more
storage and processing power.
Availability, Replication, and Eventual Consistency: Many of these systems require
very high system availability. To accomplish this, data is replicated over two or more
nodes. We saw in the last module that replication improves availability and also
improves read performance since the request can be handled by any of the
replicated nodes. However, update performance suffers since the update must be
made at all replicated nodes. The slowness of update performance is primarily due
to the distributed commit protocols introduced in the last module. These protocols
provide the serializable consistency required by many applications. However, many
NoSQL applications do not require serializable consistency, and they implement a
less rigorous form of consistency known as eventual consistency.
Replication Models: There are two main replication models used by NoSQL systems.
The first is called master-slave replication. This designates one copy as the master
copy. All write operations are applied to the master copy and then propagated to
the slave copies. This model usually uses eventual consistency meaning that the
slave copies will eventually be the same as the master. If reading is done at a slave
copy, the read cannot guarantee that the value seen is the same as the current value
in the master. The second model is called master-master replication. With this
model, both reads and writes are allowed at any copy of the data. Since there may


be a concurrent write to two different copies, the data item value may be
inconsistent. This model requires a scheme to reconcile the inconsistency.
Sharding of Files: Since many of the applications using NoSQL systems have millions
of records, it is often not practical to store the entire file in one node. In these
systems sharding (horizontal partitioning) is often used along with replication to
balance the load on the system.
High-Performance Data Access: To achieve quick response time when millions of
records are involved, records are usually stored based on key values using either
hashing or range partitioning.
The characteristics related to data models and query languages include the following:



Not Requiring a Schema: This provides flexibility to NoSQL systems. A partial
schema may be used to improve storage efficiency, but it is not required. Any
constraints on the data would be provided by application programs.
Less Powerful Query Languages: Many applications do not require a query language
nearly as powerful as SQL. A basic API is provided to application programmers to
read and write data objects.
Versioning: Some systems provide for storage of multiple versions of a data item.
Each is stored with a timestamp as to when the data item was created.
Categories of NoSQL Systems
NoSQL systems have generally grouped into four major categories. Some systems fit into two or
more of the categories.



Document-based NoSQL Systems: These systems store data in the form of
documents. The documents can be retrieved by document id, but can also be
searched through other indexes. An example of this type of system is MongoDB
which was developed by a New York based company. An open source version of this
database is now available.
NoSQL Key-value Stores: These systems provide fast access through the key to
retrieve the associated value. The value can be a simple record or object or can be a
complex data item. An example of this type of system is DynamoDB developed by
Amazon for its cloud based services. Oracle, best known for its RDBMS, also offers
this type of NoSQL system which it calls Oracle NoSQL Database.
Column-based or Wide Column NoSQL Systems: These systems partition a table by
column into column families. Each column family is stored in its own files. Note that
this is a type of vertical partitioning. An example of this type of system is BigTable
which was developed by Google for several of its applications including Gmail and
Google Maps.

Graph-based NoSQL Systems: In these systems, data is represented as graphs, and
related nodes are found by traversing edges of the graph. An example of this type of
system is Neo4J developed by a company based in both San Francisco and Sweden.
An open source copy is available.
M 15.3: The CAP Theorem
The CAP Theorem
The CAP theorem is used to discuss some of the competing requirements in a distributed
system which uses replication. In the last module we introduced concurrency control
techniques which can be used with distributed databases. The discussion assumed that a key
aspect of the database is consistency. The database should not allow two different copies of the
same data item to contain different values. We saw that the techniques used to insure
consistency often came with the price of slower update performance. This is at odds with the
desire of NoSQL systems to create multiple copies to improve the performance and availability
of the database. When discussing distributed databases, there is a range of consistency which
can be applied to replicated items. The levels of consistency can range from weak to strong.
Applying the serializability constraint is considered the strongest form of consistency. Applying
this constraint reduces the performance of update operations.
CAP represents three desirable properties of distributed systems: consistency, availability,
and partition tolerance. Consistency means that all nodes will have the same copies of
replicated data visible to transactions. Availability means that each read or write request will
either be processed successfully or will receive a message that the operation cannot be
completed. Partition tolerance means that the system can continue to operate if the underlying
network has a fault that requires that the network be split into two or more partitions that can
communicate only within the partition.
The CAP theorem states that it is not possible to guarantee all three of the desirable properties
at the same time in a distributed system with replicated data. The system designer will need to
choose which two will be guaranteed. In traditional systems, consistency is considered of prime
importance and must be guaranteed. In NoSQL systems a weaker consistency level is often
acceptable and guaranteeing the other two properties is important. NoSQL systems often adopt
the weaker form of consistency known as eventual consistency.
The specific implementations of eventual consistency can vary from system to system, but a
simplified description is that if a data item is not changed for a period of time, all copies will
eventually contain the same value, so consistency is obtained "eventually." Eventual
consistency allows the possibility that different read requests will see different values of the
same data item when each is reading from a different copy of the data at the same time. This is
considered an acceptable tradeoff by many NoSQL applications in order to provide higher
availability and faster response times.
Download