M 3.1: Database Design Flow When examining the pieces which go into database design and implementation, it is useful to keep in mind the overall flow of the process. Requirements collection and analysis o This important step is often downplayed in courses due to the necessity of focusing on the main concepts of the course. This should in no way minimize the importance of this step. Conceptual Design o The data requirements from the requirements collection are used in the conceptual design phase to produce a conceptual schema for the database. Functional Analysis o In parallel with the conceptual design, functional requirements are used to determine the operations which will need to be applied to the database. This process will not be discussed here as it is covered in other courses. Logical Design o Starts implementation; conceptual schema transformed from high-level data model to implementation data model; produces a database schema. Physical Design o Specifies internal storage, file organizations, indexes, etc. Produces internal schema. This type of design is important for larger databases, but is not needed when working with relatively small databases. Physical design will not be covered in this course. This module presents basic ER model concepts for conceptual schema design M 3.2: The Building Blocks of the ER Model In the Entity-Relationship (ER) Model, data is described as entities, relationships, and attributes. We will start by looking at entities and attributes. An entity is the basic concept that is represented in an ER model. It is a thing or object in the real world. An entity may be something with a physical existence, such as a student or a building. An entity may also be something with a conceptual existence, such as a bank account or a course at a university . The concept of an entity is similar to the concept of an object in C++ or Java. Each entity has attributes, which are properties which describe the entity. Attributes There are many choices for attributes for the student entity. Some you might have considered include, name, age, height, weight, gender, hair color, eye color, photo, social security number, student id, home address, home telephone, cell phone number, campus (or local) address, campus telephone, major, class rank, spouse name, how many children (and names), whether working on campus, car make and model, car license, etc. There are, of course, many additional attributes which are not listed above. You might have listed some of these additional attributes. Since a database is only representing some aspect of the real world (sometimes referred to as a miniworld in the text), not all attributes we can think of will be included in the database, but only those required for the miniworld the database is representing. Which attributes to include are determined during this design step and are directed by the requirements. There are several types of attributes that can be used in an ER model. The types can be classified as follows: Composite vs. Simple (Atomic) Attributes: A composite attribute can be divided into parts which represent more basic attributes. An example of this is an address, which can be subdivided into street, city, state, and zip (of course there are additional ways to subdivide an address). If an attribute cannot be further subdivided, it is known as a simple or atomic attribute. If there is no need to look at the components individually (such as the city part of an address), then the attribute (address) can be treated as a simple attribute. Single-Valued vs. Multivalued Attributes: Most attributes can take only one value. These are called single-valued attributes. An example of a single-valued attribute would be social security number for a student. An example of a multivalued attribute would be names of dependents. Stored vs. Derived Attributes: When two or more attribute values are related, it is sometimes possible to derive the value of one given the value of the other. For example, if age and birth date are both attributes of an entity (such as student), age can be stored directly in the database, in which case it is a stored attribute. Rather than being stored directly in the database, age can also be calculated from the birth date and the current date. If this option is chosen, age is calculated from (or derived from) birth data and is known as a derived attribute. NULL Value: In some cases, an attribute does not have an applicable value. When this is true, a special value, called the NULL value is used. This can be used in a few cases. The first is when the value does not exist, for example the SSN of a spouse when the person is not married. The second is when it is known that the value exists, but is missing, such as a person's unknown birth date. Finally, NULL is used when it is not known whether the value exists, for example the license plate number of a person's car (since the person may not own a car). Complex Attributes: Composite and multivalued attributes can be nested, e.g., Address(Street_Address(Number, Street, Apartment), City, State, Zip). These are called complex attributes. Entity Types and Entity Sets An entity is defined by its attributes. This is the type of data we want to collect about the entity. As a very simple example, for a student entity, we may want to collect name, student id, and address for each student. This would define the entity type, which can be thought of as a template for the student entity. Each individual would, of course, have his/her own values for the three attributes. The collection of all individual entities (i.e. the collection of all entity instances) of a particular entity type in the database at any point in time is known as an entity set or entity collection. Keys and Domains Keys One important constraint that must be met by an entity type is called the key constraint or uniqueness constraint. An entity type will have one attribute (or a combination of attributes) whose values are unique for each individual entity in the entity set. Such an attribute is called a key attribute. For example, in a student entity, a given student id can be used as a key since the id will be assigned to one and only one student. The values of a key attribute can be used to uniquely identify an entity instance. This means that no two entity instances may have the same value for the key attribute. We will discuss keys further in this and later modules. Value Sets (Domains) of Attributes Each simple attribute of an entity type is associated with a value set (or domain of values). This indicates the set of allowable values which may be assigned to that attribute for each entity instance. Value sets are normally not displayed on basic ER diagrams. For example, if we want to include the weight of a student in the database, we may indicate that the weight will be in pounds and we will store the weight rounded to the nearest integer. The integer values may range from 50 to 500 pounds. Most basic domains are similar to the data types such as integer, float, and string contained in most programming languages M 3.3: Initial Design of Database As indicated earlier, the database design starts with a set of requirements. We will work through a simple example where we build an ER diagram from a set of requirements. A simple set of requirements for the example is: This database will contain data about certain aspects of a university. 1. The university needs to keep information about its students. It wants to store the university assigned ID of the student. This is recorded as a five-digit number. The university also wishes to keep the student's name. At this time, the name can be stored as a full name, in first name-last name order. There is no need to break the name down any further. The major program of the student needs to be recorded using the five letter code assigned by the university. The student's class rank also needs to be kept as a two-letter code (FR for freshman, SO for sophomore, etc.). Finally, the university wants to keep track of the academic advisor assigned to the student. The advisor will be a faculty member in the student's department and will be identified by the faculty ID assigned to the advisor. 2. The university also needs to keep information about its faculty. It wants to store the university assigned ID of the faculty member. This is recorded as a four-digit number. The faculty member's name must also be kept and like the name of a student it will be stored as a full name and will not need to be broken down any further. Finally, each faculty member is assigned to a department and this will be stored as a five letter code, which also represents the major program of a student. 3. Information about each department will also be kept. This includes the assigned code which represents the department, the full department name, and a code for the building in which the department office is located. 4. Course information must also be kept. This includes the course number which consists of the code for the department offering the course and the number of the course within the department. This is stored as a single item, such as CMPSC431. The course name will also be kept. This name will as shown in the university catalog and may include abbreviations, such as Database Mgmt Sys. The credit hours earned for the course will be kept, as will and the code for the department offering the course. The code will most often be the first part of the course name, but the department code will be repeated as a separate item. 5. Finally, information must be maintained about each section of a course. A section ID will be assigned to each section offered. This is a six digit number representing a unique section - a specific offering of a specific course. For each section, the course number, the semester offered, and the year offered must also be stored. These requirements are admittedly not as comprehensive as they would be had a full requirements study been done, but they will suffice for our simple example. The next step is to use the requirements to determine the entities which should be included in the design. For each entity, the attributes describing the entity should then be listed. After this, the first part of the ER diagram will be drawn. Take a few minutes to develop the list of entities and associated attributes on your own. Then watch the video where we will work through this together and introduce the first part of the ER diagram. I encourage you to get in the habit of writing a "first draft" of your ER Diagram. Then take a look at the draft and determine how the positioning and spacing of the diagram looks. Then note modifications and adjustments you want to make to the visual presentation of the diagram. Then draw a "clean" copy of the diagram, possibly using a tool such as Visio. Sometimes you can draw a first draft followed by the final diagram. Other times you may need to work through several drafts before producing the final copy. Below is a final copy of the diagram for the draft copy shown on the white board in the above video. It also includes a few additional points concerning ER diagrams. ER Diagram and Additional Notes Figure 3.1: ER Diagram - Part 1 There are a few things to note here. As I suggested, I took my initial draft from the white board and adjusted it when I created the diagram. Also, I cheated a bit since I knew where the diagram was heading in the next step and "prearranged" the placement here. Normally this step and the next are taken together when drawing the draft diagram, so the initial draft would include more than this draft did. Note that in this diagram attribute names are reused for different entities (ID is an example of this). This works fine from a "syntax" standpoint, but is not encouraged unless the ID values are the same. In this case they are not, and it is better in most cases to give them unique names such as Student_id, Faculty_id, etc. As a reminder, the following conventions are used: Entities are placed in a rectangular box. The entity name is included inside the box in all capitals. Attributes are places in an oval. The attribute name is included inside the oval with only the first letter capitalized. If multiple words are used for clarity, they are separated by underscores. Words after the first are not capitalized. Attribute names are underlined for attributes which are the key (or part of the key) for the entity. Attributes are connected to the entity they represent by a line. M 3.4: Relationship Types, Relationship Sets, Degree, and Cardinality When looking at the preliminary design of entity types, it can be seen that there are some implicit relationships among some of the entity types. In general, when an attribute of one entity type refers to another entity type, some relationship exists between the two entities. In the example database, a faculty member is assigned to a department. The department attribute of the faculty entity refers to the department to which the faculty member is assigned. In this module, we will work through how to categorize these relationships and then how to represent them in an ER diagram. In a pattern similar to entity types and entity instances, the term relationship type refers to a relationship between entity types. The term relationship set refers to the set of all relationship instances among the entity instances. The degree of a relationship type is the number of entities which participate in the relationship. The most common is a relationship between two entities. This is a relationship of degree two and is called a binary relationship. Also fairly common are relationships between three entities (a relationship of degree three) which is called a ternary relationship. This and other higherdegree relationships tend to be more complex and will be further discussed in the next module. We will also discuss in the next module the possibility of a relationship between an entity and itself. This is often called a unary relationship, and is referred to in the text as a recursive or self-referencing relationship. For the rest of this module, we will focus only binary relationships. Relationship types normally have constraints which limit the possible combinations of entities that may participate in the relationship set. For example, the university may have a rule that each faculty member is placed administratively in exactly one department. This constraint should be captured in the schema. The two main types of constraints on binary relationships are cardinality and participation. The cardinality ratio for a binary relationship is the maximum number of entity instances which can participate in a particular relationship. In the above example of faculty and departments, we can define a BELONGS_TO relationship. This is a binary relationship since it is between two entities, FACULTY and DEPARTMENT. Each faculty member belongs to exactly one department. Each department can have several faculty members. For the DEPARTMENT:FACULTY relationship, we say that the cardinality ratio is 1:N. This means that each department has several faculty (the "several" is represented by N, indicating no maximum number), while each faculty belongs to at most one department. The cardinality ratios for binary relationships are 1:1, 1:N, and M:N. Note that the text also shows N:1, but this is usually converted to a 1:N by reversing the order of the entities. Note that so far we have considered maximum number of entity instances which can participate in a relation. We will wait until the next module to discuss whether it makes sense to talk about a minimum number of instances which can participate. An example of 1:1 binary relationship would be if we wished to keep the department chair in the database. To simplify, we will specify that the chair is a faculty member of the department, which is true in the vast majority of the cases. In this case, the faculty would be the chair of at most one department, and the department would have at most one chair. 1:1 binary relationships are explored in more depth in the next sub-module. If the relationship between two entities is such that an instance entity one may be related to several instances of entity two and also that an instance of entity two may be related to several instances of entity one, we say that this is a M:N binary relationship. The use of M and N, rather than using N twice, indicates that the number of related instances need not be the same in both directions. We will look more closely at M:N relationships in the next module. The next step is to examine the entities and their attributes to determine any relationships which exist among the entities. For each relationship, the degree of the relationship should be determined, as should the cardinality ratio of the relationship. After this, the next part of the ER diagram will be drawn. Take a few minutes to develop the list of relationships on your own. Be sure to determine both the degree and cardinality ratio. To keep this simple at first, we will look at only 1:1 and 1:N binary relationships. Then watch the video where we will work through this together and add the next part of the ER diagram. Hopefully this video reinforces the suggestion to develop the habit of writing a "first draft" of your ER Diagram. The positioning and spacing is less of an issue when beginning work with only the entities and attributes as we did in the earlier video. When you begin to include relationships, the position of the entities becomes important to permit the relationships to be added while keeping a clean look to the diagram. Note that the overall positioning of the entities was not too bad in the video, but it would benefit from minor adjustments. The spacing between entities in the draft developed in the video was not sufficient in a few cases. The relationship diamond was too cramped between entity boxes. Making the lines longer would make the cardinality labels (the 1 and the N) much easier to read. The choice of names is not always straightforward. We will discuss some guidelines to consider and naming suggestions in the next module. Below is a final copy of the diagram for the draft copy shown on the white board in the above video. The attributes are also included in the diagram below. ER Diagram and Additional Notes Figure 3.2: ER Diagram - Part 2 There are a few things to note with this diagram also. The first item is that on the board I used lower case with an initial capital when I put the names in the relationship boxes. This does not follow the convention used in the text where relationship names are entered in all caps. I followed the text convention when I drew the final copy of the ER diagram. Next, binary relationship names are normally chosen so that the ER diagram is read from left to right and top to bottom. To make the diagram consistent with this guideline, I changed the name of the relationship between student and department from MAJORS to MAJORS_IN. Notice that a binary relationship can be "read" in either direction with a slight modification of the name. Since the SECTION entity was below the COURSE entity in the diagram on the board, I used the name of OFFERING for the relationship. Since I moved SECTION to the left of COURSE in the final diagram, I changed the relationship name to OFFERING_OF since a section is a specific offering of a course. Finally, I was going to change the name of the relationship between DEPARTMENT and COURSE to OFFERS to make it read from top to bottom. I then realized that I did not like the use of a form of "offer" twice, so I changed the name of the relationship to OWNS. Also note that the diagram on the board was somewhat cramped. The cardinality ratios were appropriately placed on the participating edge, but it did not clearly show that it is customary to place the cardinality ratio closer to the relationship diamond than to the entity rectangle. As a reminder, the new conventions are listed first below. They are followed by conventions we have used before. Relationships are placed in a diamond. The relationship name is included inside the box in all capitals. For a binary relationship, the cardinality ratio value is placed on the edge between the entity rectangle and the relationship diamond. It is placed closer to the relationship diamond. Entities are placed in a rectangular box. The entity name is included inside the box in all capitals. Attributes are places in an oval. The attribute name is included inside the oval with only the first letter capitalized. If multiple words are used for clarity, they are separated by underscores. Words after the first are not capitalized. Attribute names are underlined for attributes which are the key (or part of the key) for the entity. Attributes are connected to the entity they represent by a line. M 3.5: An Example of a 1:1 Binary Relationship In the previous sub-module we proposed an example of 1:1 binary relationship. The example considered a desire to keep the department chair in the database. We added the simplification that the chair is a faculty member of the department, which is true in the vast majority of the cases. In this case, the faculty would be the chair of at most one department, and the department would have at most one chair. We can call the relationship CHAIR_OF. Another simplification is that each department has a chair (it is possible that the spot is vacant, but we will ignore that for now). Of course, there would be many faculty who are not chair, but this does not invalidate the 1:1 relationship. This relationship will not be added to the ER diagram from the last sub-module. The ER diagram below shows how this part would be included. It includes the new CHAIR_OF relationship with the BELONGS_TO relationship included for context. No new symbols are needed for this relationship. The cardinality ratio is shown as before, but in this case both values are 1. Figure 3.3 ER Diagram Showing 1:1 Binary Relationship M 4.1: Continued Design of a Database For our example in the last module, we worked from a set of requirements which produced a conceptual design which contained several entities and their associated attributes. It also contained several 1:N relationships. Here, we expand the example. This starts by expanding the requirements and then by adding the necessary parts to the design and ER diagram. A expanded set of requirements for the example is given below. Note that only point #6 has been added; the rest of the requirements remain the same This database will contain data about certain aspects of a university. 1. The university needs to keep information about its students. It wants to store the university assigned ID of the student. This is recorded as a five-digit number. The university also wishes to keep the student's name. At this time, the name can be stored as a full name, in first name-last name order. There is no need to break the name down any further. The major program of the student needs to be recorded using the five letter code assigned by the university. The student's class rank also needs to be kept as a two-letter code (FR for freshman, SO for sophomore, etc.). Finally, the university wants to keep track of the academic advisor assigned to the student. The advisor will be a faculty member in the student's department and will be identified by the faculty ID assigned to the advisor. 2. The university also needs to keep information about its faculty. It wants to store the university assigned ID of the faculty member. This is recorded as a four-digit number. The faculty member's name must also be kept and like the name of a student it will be stored as a full name and will not need to be broken down any further. Finally, each faculty member is assigned to a department and this will be stored as a five letter code, which also represents the major program of a student. 3. Information about each department will also be kept. This includes the assigned code which represents the department, the full department name, and a code for the building in which the department office is located. 4. Course information must also be kept. This includes the course number which consists of the code for the department offering the course and the number of the course within the department. This is stored as a single item, such as CMPSC431. The course name will also be kept. This name will as shown in the university catalog and may include abbreviations, such as Database Mgmt Sys. The credit hours earned for the course will be kept, as will and the code for the department offering the course. The code will most often be the first part of the course name, but the department code will be repeated as a separate item. 5. Information must be maintained about each section of a course. A section ID will be assigned to each section offered. This is a six digit number representing a unique section - a specific offering of a specific course. For each section, the course number, the semester offered, and the year offered must also be stored. 6. Finally, a transcript much be kept for each student. This must contain data which includes the ID of the student, the ID of the section, and the grade earned by the student in that section. Of course, there will be several instances for each student one for each section that was taken by the student. Although these requirements are slightly expanded, they are still not comprehensive. Again, they will suffice for our expanded, but still relatively simple example. Since the earlier requirements were not modified, and the only change was an additional requirement (requirement #6), the next step is to use the modified requirements to determine if any of the original entities and their attributes need to be modified. If so, the modifications should be noted. Then any new entities which should be included in the design should be noted. For each new entity, the attributes describing the entity should then be listed. After this, an expanded ER diagram will be drawn. Take a few minutes to develop the list of modifications as well as new entities and associated attributes on your own. Then keep this list handy as you move to the next sub-module. M 4.2: The M:N Relationship Hopefully, as you looked at the new set of requirements, you realized that with the first five requirements remaining unchanged, the entities and attributes from the last module will not need to be modified. The relationships identified in the last module will also not need to be modified. That leads to the sixth (the new) requirement. Will it dictate changes to any of the earlier work with entities or attributes? Will new entities, attributes, and/or relationships need to be created? At first it seems that a new entity, TRANSCRIPT, should be created. It will have attributes STUDENT ID, SECTION ID, and GRADE. After noting that a grade is associated with a specific student in a specific section, realize that a student will be in many sections while in school and each section will consist of many students. This leads to the realization that TRANSCRIPT is different. If STUDENT ID is chosen as the key (remember that a key must be unique), SECTION_ID and GRADE would need to become multivalued attributes since each student would have many sections and grades associated with the STUDENT_ID. If SECTION_ID is chosen as the key, STUDENT_ID and GRADE would need to become multivalued attributes since each section would have many students and grades associated with the SECTION_ID. We will see in the next module that the relational model does not allow attributes to be multivalued. Since that is the case, neither STUDENT_ID nor SECTION_ID can be the key for a TRANSCRIPT entity. This will be further discussed in the module on normalization later in the course. Since a given student will take many sections over his/her time at the university, and since a given section will contain many students, the relationship between student and section is many-to-many. A many-to-many relationship is called an M:N relationship. Based on this, rather than being an entity like we saw in the last module, the TRANSCRIPT "entity" is actually formed from the relationship between the STUDENT entity and the SECTION entity. It will be shown as a relationship (diamond) in the ER diagram. What about the GRADE attribute? It will not be an attribute of the STUDENT entity since a given student will have many grades. This would lead to GRADE being a multivalued attribute. Similarly GRADE will not be an attribute of the SECTION entity since each section having many students would again lead to GRADE being a multivalued attribute. It turns out that GRADE is an attribute of the relationship. As such, the GRADE attribute will be shown in the ER diagram in an oval which is connected to the the diamond containing the TRANSCRIPT relationship. Note that TRANSCRIPT is not a good name for a relationship. Even though I used that name in the video, a more appropriate name for a relationship is used in the ER diagram presented after the video. We will see in later modules how this relationship will actually become an entity in the schema design. What about keys for this relationship? Since it is represented in diagram as a relationship with a single attribute, GRADE, there is no actual key that is identified for the relationship (note that GRADE does not qualify). As such, the GRADE is not underlined in the ER diagram. Later, as we develop a schema design, we will develop this as an entity with the joint key of STUDENT_ID and SECTION_ID. This is not stated at the ER diagram level, but will come out during schema design. At this point, take a few minutes to sketch how you think this would look when added to the ER diagram. Then watch the video, "ER Diagram - Part 3," where we will work through this together and update the ER diagram. ER Diagram - Part 3 ER Diagram and Additional Notes Figure 4.1: ER Diagram Showing Inclusion of M:N Relationship There are a few things to note with this diagram. First, since grade is actually an attribute of the relationship rather than an attribute of one of the entities, the grade attribute is attached to the relationship diamond. On the board I used the name TRANSCRIPT for the relationship. Not following my sample diagram closely enough during the production of the video, I looked at the new requirement instead. Since the requirement discusses a transcript, I used that name on the board. Note that this is an entity-type name since "transcript" looks like an entity on first reading. Since this is actually a relationship, I changed the name to the more relationship appropriate name ENROLLED_IN in the final diagram. This indicates that a student is (or was, depending on the time we are looking at it) enrolled in a particular section. Also on the board, I used N as the cardinality value on both sides of the transcript diamond and noted that the N can represent a different value on the two sides. It just means "many" in this context. This is common with many, but not all, authors even though the many-many relationship is indicated by M:N . The authors of the text use M on one side and N on the other as a reminder that the values can be different. I used this in the final diagram. Note as you review the chapter that they do switch to using N on both sides when they introduce the (min, max) cardinality notation later in the chapter. We need no new symbols to represent the M:N relationship. ER Diagram Naming Guidelines The choice of names for entity types, attributes, and relationship types is not always straightforward. We will discuss some guidelines to consider and naming suggestions. The choice of names for roles will be discussed in the Recursive Relationships sub-module. As much as possible, choose names that convey the meaning of the construct being named. The text chooses to use singular (rather than plural) names for entity types since the name applies to each individual entity instance in the entity type. Entity type and relationship type names are in all caps, while attribute names and role names have only the first letter capitalized. When looking at requirements for the database, some nouns lead to entity type names. Other nouns describe these nouns and lead to attribute names. Verbs lead to the names of relationship types. When choosing names for binary relationships, attempt to choose names that make the ER diagram readable from left to right and from top to bottom. Note that when it seems a relationship type name should read from bottom to top, adjusting the name can usually allow the reading to go from top to bottom. M 4.3: Relationship Types of Degree Higher than Two In Module 3 we defined the degree of a relationship type as the number of entities which participate in a relationship. So far, we have only looked at relationships of degree two, which are called binary relationships. It is worth noting again, that degree and cardinality should not be confused. Binary relationships (degree two) have cardinality of 1:1, 1:N, and M:N. Here we examine relationships of degree three (ternary relationships). We will also look at the differences between binary and higher-degree relationships. We will consider when to choose each, and we will conclude by showing how to specify constraints on higher-degree relationships. Consider a store where the entities CUSTOMER, SALESPERSON, and PRODUCT have been identified. Assume the store, such as an appliance store or a furniture store, sells higher priced items where unlike other stores such as a grocery store, most customers buy only one or a few items at a time from a salesperson who is on commission. The customer is identified by a customer number, the product is identified by an item number, and the salesperson is identified by a salesperson number. It is desirable to keep track of which items a customer purchased on a particular day from which salesperson. Although not common, a customer may buy more than one of a particular item on a given day. An example would be to buy two matching chairs. Since that is a possibility, we want to record the quantity of a particular item which is purchased as well as the date purchased. We can then define a ternary relationship, we'll call it PURCHASE, between CUSTOMER, SALESPERSON, and PRODUCT. Note that other names for the relationship will work, and may be preferable, depending on the actual application. Also note that the three entities will have other attributes. To avoid clutter which may detract from the main point being demonstrated here, only the key attribute is shown for the three entities. How this is represented in ER diagram notation is shown in the figure below. ER Diagram Representation of a Ternary Relationship The ER diagram above shows how the three entities are related. Since the date of sale and the quantity sold are both part of an individual purchase of the customer, salesperson, and product (there may be several of these purchases over time) these two attributes are attached to the PURCHASE relationship rather than to any of the individual entities. It is possible to define binary relationships between the three entities. We can define an M:N relationship between CUSTOMER and SALESPERSON. Let's call it BOUGHT_FROM. We can also define an M:N relationship between CUSTOMER and PRODUCT and name the relationship BOUGHT. Finally, we can define an M:N relationship between SALESPERSON and PRODUCT and call it SOLD. A representation of this in ER diagram notation is shown below. ER Diagram Representation of Three Binary Relationships among the Three Entities Are the three binary relationships, taken together, equivalent to the ternary relationship? At first glance, the answer appears to be "yes." However, closer examination shows that the two methods of representation are not exactly equal. If customer Fred buys two matching chairs from salesperson Barney on November 15, 2020, that information can be captured in the ternary relationship (PURCHASE in this case). In the three binary relationships we can capture much, but not all of the information. The fact that Barney bought items from Fred can be captured in the BOUGHT FROM relationship. The fact that Barney bought at least one chair can be captured in the BOUGHT relationship. The fact that Fred sold chairs can be captured in the SOLD relationship. What is not clearly captured is that Fred sold the chairs to Barney. It is possible that Fred sold something, say a table, to Barney. Fred sold chairs to someone, say Wilma. Barney bought the chairs but not from Fred. Maybe he bought them from Betty. If you examine this closely, the set of facts will generate the same information in the binary relationships as the sale in the last paragraph did. Also, it would not be possible to capture the date of sale or quantity sold in the same manner in three binary relationships. This should become even clearer in the next module which discusses relational database design. M 4.4: Recursive Relationships In some cases, a single entity participates more than once in a relationship type. This happens when the entity takes on two different roles in the relationship. The text refers to this as a recursive relationship or self-referencing relationship. Looking at this further, this is a relationship of degree one since there is only one entity involved in the relationship. A relationship of degree one is known as a unary relationship. A cardinality can be assigned to such a relationship depending on the number of entity instances that can participate in one role and the number of entity instances that can participate in the other role. As an example, consider the FACULTY entity. Assume that all department chairs are also faculty. As in the past this is mostly true, but not always. We will make the assumption for simplicity. If dictated by the requirements, we can consider a SUPERVISES relationship where the chair supervises the other faculty in the department. Since only one entity instance represents the "chair side" of the relationship, but many entity instances represent the "other faculty in the department" side of the relationship, the cardinality of the relationship is 1:N. This, then, represents a 1:N unary relationship. In a unary relationship, role names become important since we are trying to capture the role played by the participating entity instances. We can define one role as chair, and the other as regular faculty. There are probably better names for the roles (especially the second one). Can you think of any? This relationship will again not be added to the larger ER diagram. The ER diagram below shows how this relationship would be captured. Note how the roles are depicted. ER Diagram Representation of a Recursive Relationship Figure 4.4: ER Diagram of Recursive Relationship M 4.5 Participation Constraints and Existence Dependencies In the last module, we looked at cardinality ratios (1:1, 1:N, and M:N). These indicate the maximum number of entity instances which can participate in a particular relationship. Is there a concern about the minimum number of instances which can (in this case it would be must) participate? The text discusses this concept by discussing participation constraints. The participation constraint indicates whether in order to exist, an entity instance must be related to an instance in another entity involved in the relationship. Since this constraint specifies the minimum number of entities that can participate in the relationship, it is also called the minimum cardinality constraint. Participation constraints are split into two types - total and partial. Going back to the ER diagram for the example database, FACULTY and DEPARTMENT were related by the BELONGS_TO relation. If there is a requirement that all faculty members must be assigned to a department, then a faculty instance cannot exist unless it participates in at least one relationship instance. This means that a faculty instance cannot be included unless it is related to a department instance in the BELONGS_TO relationship instance. This type of participation is called total participation, which means that every faculty record must be related to a department record. Total participation is also called existence dependency. As a different example, consider that STUDENT and DEPARTMENT are related by the MAJORS_IN relationship. If the requirement for students and majors specifies that a student is not required to have a declared major until junior year, that means that not every student is required to participate in the MAJOR relationship at all times. This is an example of partial participation meaning that only some or part of all students participate in the MAJOR relationship. The cardinality ratio and participation constrains taken together are known as the structural constraints of a relationship. This is also called the (min, max) cardinality. This is further discussed in the toggle at the end of this sub-module. For ER diagrams, the text uses cardinality ratio/single-line/double-line notation. To keep the example fairly simple, we will use Figure 4.1 as an example. This uses the cardinality ratio as we did in Figure 4.1. In that figure, we also used a line (single-line) to connect entities and attributes and to connect all entities and relationships. We will modify this to connect an entity to a relationship in which it totally participates by using a double-line rather than a single line. If the participation is partial, then we will leave the connection as a single line. Figure 4.1: ER Diagram Showing Inclusion of M:N Relationship This forces us to obtain additional information about relationships as the requirements are gathered. Below is shown the ER diagram of Figure 4.1 modified to include double-line connectors as required by the following assumptions. As in the example above, for the BELONGS_TO relationship we will assume that all faculty members must be assigned to a department. We will further assume that every department must have at least one faculty member. This implies that there is total participation in both directions. This means that there will be double-lines connecting both entities to the relationship. As in the second example above, we will assume that a student is not required to declare a major until junior year. Since this indicates partial participation, there will be only a single line connecting student to the MAJORS_IN relationship. We will further assume that some departments may be "support departments" which do not offer majors, so this is partial participation also. This means that there will be single lines connecting both entities to the relationship. To continue with the relationships, assume that each student must have an advisor, but not all faculty advise students. So student fully participates in the ADVISED_BY relationship and should be connected by a double line. Faculty, on the other hand, only partially participates so will be connected by a single line. For the OWNS relationship, we will assume that each department must own at least one course and every course must be owned by a department. This again means that there is total participation in both directions and there will be double-lines connecting both entities to the relationship. For the OFFERING_OF relationship, each section must be must be an offering of a particular course, so there is total participation and a double-line in that direction. To allow the possibility that a new course can be added without immediately being scheduled, we will not require that every course will have a section associated with it (although we assume that it will in the near future). This is then partial participation and the connection is a single-line. Finally, for the ENROLLED_IN relationship, we will allow for a student (probably a new student) to be entered into the STUDENT entity if the student is not yet enrolled in courses. This implies a partial participation in the relationship. Similarly, we will allow a new section to be posted to the SECTION entity before any students are enrolled. Again, this implies a partial participation in the relationship. Since both sides have partial participation, the single lines will remain in place. The revised ER diagram below shows participation constraints. Figure 4.5: ER Diagram Showing Participation Constraints An Example of (min, max) notation We will start this example by again considering the BELONGS_TO relationship discussed above. In the example, we assumed that all faculty members must be assigned to a department. We will clarify that the faculty member can be assigned to only one department. We further assumed that every department must have at least one faculty member. We will clarify that most departments have several faculty members. Based on the assumptions, this gives DEPARTMENT:EMPLOEE a cardinality ratio of 1:N. Since every faculty must be assigned to a department, and every department must have at least one faculty member, this implies that there is total participation in both directions. Total participation means that the minimum cardinality must be at least one. Based on this, the (min, max) notation for the department side of the relationship would be (1, N). The department must have at least one faculty member and may have several faculty members. The (min, max) for the faculty side of the relationship is (1, 1). This indicated that the faculty member must be in at least one department and may be in at most one department. This was represented in Figure 4.5 by using double lines to connect both the FACULTY and DEPARTMENT entities to the BELONGS_TO relationship, indicating that the minimum cardinality on both sides is one. In figure 4.5, the 1 on the DEPARTMENT side indicates that the maximum cardinality for FACULTY is one, while the N on the faculty side indicates that the maximum cardinality for DEPARTMENT is N. Using the (min, max) notation in an ER diagram is one of the many alternative notations mentioned in the text. Using this notation for the BELONGS_TO relationship and the two entities would look like the following. Note that the attributes are omitted to allow focus on the notation. Figure 4.6: ER Diagram Showing use of (min, max) Notation The notation used in the text for total participation (the double line) indicates that the minimum cardinality is one. What if the minimum cardinality is greater than one? This cannot be directly represented by the text notation. If there is a requirement that each department have at least three faculty assigned, that fact cannot be directly represented by the notation used in the text (nor can it be represented in most notations). It can only represent a full or partial participation - a minimum of zero or one. The (min, max) notation allows representation of this. The (1, N) on the DEPARTMENT side would be replaced by (3, N), thereby indicating that there must be at least three faculty in the department. A note is in order here about which side of the relationship should be used to place the various values. This is not consistent across various notations. Back in the 1990s, the terms Look Across and Look Here were introduced to indicate where the cardinality and participation constraints were placed in the various notations. Looking again at the BELONGS_TO relationship in Figure 4.5, the notation followed in the text uses Look Across for the cardinality constraints. A faculty member can work for for only one department. A department can have many faculty. So the 1 for faculty is placed across the relationship diamond, on the DEPARTMENT side, while the N for the department is placed across the relationship, on the FACULTY side. Different notations use Look Here and would reverse the placement. The text then switches to Look Here for the participation constrains. BELONGS_TO is not a good relationship to demonstrate this since the minimum cardinality on both sides of of the relationship is one, so both entities are connected by double lines to the relationship. Looking at the ADVISED_BY relationship, we see that participation is partial on one side and total on the other side. Since a student must have an advisor, the STUDENT entity is connected by a double line to the ADVISED_BY relationship. Since the double line is on the STUDENT side, that is a Look Here notation. Similarly, some faculty do not have advisees, so that is partial participation (minimum cardinality of zero). This means that the FACULTY entity is connected to the relationship by a single line. Since this line is placed on the FACULTY side, it again shows the Look Here guidelines of the notation. Returning to Figure 4.6, the text uses Look Here when illustrating the (min, max) notation. A faculty belongs to at least one and at most one department, so Look Here places the (1, 1) on the FACULTY side. A department has at least one, but possibly many faculty, so Look Here places the (1, N) on the DEPARTMENT side. As the text indicates, other conventions use Look Across when using (min, max) notation. To avoid confusion, we will stick with the primary notation used by the text. This is used in Figure 4.5. Just be aware that if you are involved with creating or reading ER diagrams in the future, be sure to check to see what notation is being used since it might be different from what we are using here. M 4.6: Design Choices, Topics Not Covered, and Additional Notes In section 3.7.3, the text covers design choices for ER conceptual design. It is sometimes difficult to decide how to capture a particular concept in the miniworld. Should it be modeled as an entity type, an attribute, as relationship type, or something else? Some general guidelines are given in the text. It is important to again note that conceptual schema design overall, and the production of an ER diagram should be thought of as an iterative process. An initial design should be created and then refined until the most appropriate design is obtained. The text suggests the following refinements. In many cases, a concept is first modeled as an attribute and then refined into a relationship when it is determined that the attribute is a reference to another entity type. An example of this is looking at the attribute "Advisor" for the STUDENT entity in sub-module 3.3. This attribute is a reference to the FACULTY entity type and is captured as the ADVISED_BY relationship type in sub-module 3.4. The text then indicates that in the notation used in the text, once an attribute is replaced by a relationship, the attribute itself should be removed from the entity type to avoid duplication and redundancy. This removal was done in the "complete" ER diagram shown in Figure 3.2 in the text. However, this type of attribute was included in later figures (e.g. Figure 3.8) which show the development of an ER diagram and would be produced earlier in the design process. In the actual development process, Figure 3.2 would be produced by the end of the iterative process. The text shows it at the beginning of the chapter so it can refer back to it during the chapter. We have not removed those attributes in the diagrams so far produced. Many authors will not remove those attributes, depending on the ER diagram notation being used. They are removed in the diagram in the next module when the diagram is being used as an example for illustrating the steps of an algorithm for mapping an ER diagram to a relational model. Sometimes an attribute is captured in several entity types. This occurred in the example where Department Number was captured as an attribute in the FACULTY, COURSE, and STUDENT (as Major) entity entity types. This is an attribute which should be considered for "promotion" to its own entity type. Note that in the example, DEPARTMENT had already been captured as its own entity type, so this type of consideration did not apply to the example. If it had not already been captured as an entity type, such a promotion would be considered based on this guideline. The reverse refinement can be considered in other cases. For example, suppose DEPARTMENT was identified in the initial design and the only attribute identified was Department_code. Further, assume that this was only identified as needed as an attribute on one other entity type, say FACULTY. It should then be considered to "demote" DEPARTMENT to an attribute of FACULTY. In order to avoid additional complexity, a few topics that could be covered in this module are intentionally not discussed in this course. Some are important in designing certain aspects of very large databases. Others can be included in the design at the conceptual (ER model) level, but will be "modified out" when the design is used to create a relational model. The intent here is to develop a good understanding of the basics. If you are working on a design, possibly even something you work on for a portion of your final project, and want to represent something that just doesn't seem to fit into what we have covered here, please come back to the text and explore the items we omitted. Some specifics are: Multivalued and composite attributes are not allowed in a relational schema. It is usually better to modify the conceptual design to not include them. Weak entity types (Section 3.5 in the text) can be handled by key selection in the relational schema. We will handle them the same way as we handled other entity types and relationships. Whether or not an attribute will be derived is a design choice better made when looking at the DBMS being used. This will be specified like other attributes in the ER diagram. It should again be pointed out that there are several alternate notations for drawing ER diagrams. Some of the alternate notations are shown in Appendix A of the text. Most have both good and not-so-good features. Often, it is what you become accustomed to use or what your company has decided to use. It is not important in this course that you learn them all. You just need to be aware that alternatives do exist. We are not going to discuss (or use) UML Class Diagrams in this course. They are presented in Section 3.8 in the text. You are likely already familiar with them and have seen a more detailed presentation in earlier courses. M 5.1: The Relational Database Model and Relational Database Constraints The relational data model was first introduced in in 1970 in a (now) classic paper by Ted Codd of IBM. The idea was appealing due to its simplicity and its underlying mathematical foundation. It is based on the concept of a mathematical relation which, in simple terms, is a table of values. The computers at that time did not have sufficient processing power to make a DBMSs built on this concept commercially viable due to the slow response time. Two research relational DBMSs were built in the mid-1970s and became well known at the time: System R at IBM and Ingres at the University of California Berkeley. During the 1980s, computer processing power had improved to the point where relational DBMSs became feasible for many commercial database applications and several commercial products were released, most notably DB2 from IBM (based on System R), a commercial version of Ingres (which was faulted for its poor user interface), and Oracle (from a new company named Oracle). These products were well received, but at their initial release still could not be used for very large databases which still required more processing power to effectively use a relational database. As computer processing power kept improving, by the 1990s, most commercial database applications were built on a relational DBMS. Today, major RDBMSs include DB2 and Oracle as well as Sybase (from SAP) and SQLServer and Microsoft Access (both from Microsoft). Open source systems are also available such as MySQL and PostgreSQL. The mathematical foundations for the relational model include the relational algebra which is the basis for many implementations of query processing and optimization. The relational calculus is the basis for SQL. Also included in the foundations are aspects of set theory as well as the concepts of AND, OR, NOT and other aspects of predicate logic. This topic is covered in much greater detail in Chapter 8 of the text. Please read this chapter if you are interested in this topic, but it will not be covered in this introductory course. M 5.1.1: Concepts and Terminology The relational model represents the database as a collection of relations. A relation is similar to a table of values, or a flat file of records. It is called a flat file since each record has a simple flat structure. A relation and a flat file are similar, but there are some key differences which will be discussed shortly. Note that a relational database will look similar to the example database from Module 1. When looking at a relation as a table of values, each row in the table represents a set of related values. This is called a tuple in relational terminology, but in common practice, the formal term tuple and the informal term row are used interchangeably. The table name and the column names should be chosen to provide, or at least suggest, the meaning of the values in each row. All values in a given column have the same data type. Continuing with the more formal relational terminology, a column is called an attribute and a table is called a relation. Here also, in common practice the formal and informal terms are used interchangeably. The data type which indicates the types of values which are allowed in each column is called the domain of possible values. More details and additional concepts are discussed below. Domains A domain defines the set of possible values for an attribute. A domain is a set of atomic values. Atomic means that each value in the domain cannot be further divided as far at the relational model is concerned. Part of this definition does rely on design issues. For example, a domain of names which are to be represented as full names can be considered to be atomic in one design because there is no need to further subdivide the name for this design. Another design might require the first name and last name to be stored as separate items. In this design, the first name would be atomic and the last name would be atomic, but the full name would not be atomic since it is further divided in this design. Each domain should be given a (domain) name, a data type, and a format. One example would be phone_number as a domain name. The data type can be specified as a character string. The format can be specified as ddd-ddd-dddd, where "d" represents a decimal digit. Note that there are some further restrictions (based on phone company standards) which are placed on both the first three digits (known as the area code) and the second set of three digits (known as the exchange). These can be completely specified to reduce the chance of a typo-like error, but this is usually not done. Another example would be item_weight as a domain name. The data type could be an integer, say from 0 to 500. It could also be a float from 0.0 to 500.0 if fractions are important, especially at lower weights. An additional format specification would not be needed in this case. However, in this case a unit of weight (pounds or kilograms) should be specified. It is possible for several attributes in a relation to have the same domain. For example, many database applications require a phone number to be stored. This can be assigned the phone_number domain given above. In addition, it is becoming common for databases to store home phone, work phone, and cell phone. In this case there would be three attributes. All would use the same phone_number domain. Relation Schema If we skip the mathematical nomenclature, a relation schema consists of a relation name and a list of attributes. Each attribute names a role played by some domain in the relation schema. The degree (also known as arity) of a relation is the number of attributes in the relation. If a relation is made up of five attributes, the relation is said to be a relation of degree five. Relation State The relation state (sometimes just relation) of a relation schema, is defined by the set of rows currently in a relation. If the degree of the relation is n, each row consists of an n-tuple of values. Each value in the n-tuple is a member of the domain of the attribute which matches the position of the value in the tuple. The exception is that a tuple value may contain the value NULL, which is a special value. Sometimes, the schema is known as the relation intension and state is known as the relation extension. The value NULL is assigned to a value in a tuple in three different cases. First, it is used when a value does not apply. For example, many databases include information about a customer's company in their sales record. Assume the database keeps track of both corporate and individual customers. If the sale is personal (to an individual) and not on behalf of the company, all information related to the customer's company would be given the value of NULL in that record. A second case where NULL is used is when a value does not exist. An example is a customer home phone. While the value does apply in general, many people no longer have land line phones at home. If a customer does not have a home phone, that value would be stored as NULL in the record. Finally, NULL is also used when it is unknown whether a value exists. Consider again the phone example above. In the above example we knew that the customer did not have a home phone. There is also a case where we don't know whether or not the customer has a home phone. If the customer has a home phone, we do not know the number. We would need to record a NULL value in this case also. M 5.1.2: Characteristics of Relations A relation can be viewed as similar to a file or a table. There are similarities, but there are differences as well. Some of the characteristics which make a relation different are discussed below. Ordering of Tuples in a Relation A relation is defined as a set of tuples. In mathematics, there is no order to elements of a set. Thus, tuples in a relation are not ordered. This means that in a relation, the tuples may be in any order - one ordering is as good as any other. There is no preference for any particular order. In a file, on the other hand, records are stored physically (on a disk for example), so there is going to be an ordering of the records which indicates first, second, and so on until the last record. Similarly, when we display a relation in a table, there is an order in which the rows are displayed: first, second, etc. This is just one particular display of the relation and the next display of the same relation, may show the table with the tuples in a different order. However, there is a restriction on relations that no two tuples may be identical. This is different from a file, where duplicate records may exist. In many file uses, there are checks to make sure that duplicate records do not exist, but the nature of a file does not place restrictions on duplicate records. In a relational DBMS, duplicate tuples will not be allowed. No additional software checks will need to be performed. Ordering of Values within a Tuple The ordering of values in a tuple is important. The first value in a tuple indicates the value for the first attribute, the second value in a tuple indicates the value for the second attribute, etc. There is an alternative definition where each tuple is viewed as a set of ordered pairs, where each pair is represented as (<attribute>, <value>). This provides self-describing data, since the description of each value is included in the tuple. We will not use the alternative definition in this course. Values in the Tuples Each value in a tuple must be an atomic value. This means that it cannot be subdivided within the current relational design. Therefore, composite and multivalued attributes are not allowed. This is sometimes called the flat relational model. This is one of the assumptions behind the relational model. We will see this concept in more detail when we discuss normalization in a later module. Multivalued attributes will be represented by using separate relations. In the strictly relational models, composite attributes are represented by their simple component attributes. Some extensions of the strictly relational model, such as the object-relational model, allow complex-structured attributes. We will discuss this in later modules. For the next several modules, we will assume that all attributes must be atomic. Interpretation (Meaning) of a Relation A relation schema can be viewed as a type of assertion. As an example, consider the example database presented in Module 1. The COURSE relation asserts that a course entity has Number, Name, Credit_hours, and Department. Each tuple can be viewed as a fact, or an instance of the assertion. For example, there is a COURSE with Number CMPSC121, Name "Introduction to Programming, Credit_hours: 3, and Department "CMPSC". Some relations, such as COURSE represent facts about entities. Other relations represent facts about relationships, such as the TRANSCRIPT relation in the example database. The relational model represents facts about both entities and relationships as relations. In the next major section of this module, we will examine how different constructs in an ER diagram are converted into relations. M 5.1.3: Relational Model Constraints There are usually many restrictions or constraints that should be put on the values stored in a database. The constraints are directed by the requirements representing the miniworld that the database is to represent. This sub-module discusses the restrictions that can be specified for a relational database. The constraints can be divided into three main categories. 1. Constraints from the data model, called inherent constraints. These are modelbased. 2. Constraints which are schema-based or explicit constraints. These are expressed in the schemas of the data model, usually by specifying them in the DDL. 3. Constraints which cannot be expressed in either #1 or #2. They must be expressed in a different way, often by the application programs. These are called semantic constraints or business rules. Constraints of type #1 are driven by the relational model itself and will not be further discussed here. An example of a constraint of this type would be the fact that no relation can have two identical tuples. Constraints of type #3 are often difficult to express and enforce within the data model. They relate to the meaning and behavior of the attributes. These constraints are often enforced by application programs that update the database. In some cases, this type of constraint can be handled by assertions in SQL. We will discuss these in a future module. An additional category of constraints is called data dependencies. These include functional dependencies and multivalued dependencies. This category focuses on the quality of the relational database design. Database normalization uses these. We will discuss normalization in a later module. Below, the type #2 constraints, the schema-based constraints are listed and discussed. Domain Constraints Domain constraints indicate the requirement for the value of a particular attribute in each tuple of the relation by specifying the domain for the attribute. In an earlier sub-module, we discussed how domains are specified. Some data types associated with domains are: Standard data types for integers and real numbers, characters, Booleans, and both fixed and variablelength strings. Other special data types are available. These include date, time, timestamp, and others. This is further discussed in the next module. It is also possible to further restrict data types, for example by taking an int data type and restricting the range of integers which are allowed. It is also possible to provide a list of values which are allowed in the attribute. Key Constraints and Constraints on NULL Values We indicated above that in the definition of a relation, no two tuples may have the exact same values for all their elements. It is often the case that many subsets of attributes will also have the property that no two tuples in the relation will have the exact same values for the subset of attributes. Any such set of attributes is called a superkey of the relation. A superkey specifies a uniqueness constraint in that no two tuples in the relation will have the exact same values for the superkey. Note that at least one superkey must exist; it is the subset which consists of all attributes. This subset must be a superkey by the definition of a relation. A superkey may have redundant attributes - attributes which are not needed to insure the uniqueness of the tuple. A key of a relation is a superkey where the removal of any attribute from the set will result in a set of attributes which is not longer a superkey. More specifically, a key will have the properties: 1. Two different tuples cannot have identical values for all attributes in the key. This is the uniqueness property. 2. A key must be a minimal superkey. That means that it must be a superkey which cannot have any attribute removed and still have the uniqueness property hold. Note that this implies that a superkey may or may not be a key, but a key must be a superkey. Also note that a key must satisfy the property that it is guaranteed to be unique as new tuples are added to the relation. In many cases, a relation schema may have more than one key. Each of the keys is called a candidate key. One of the candidate keys is then selected to be the primary key of the relation. The primary key will be underlined in the relation schema. If there are multiple candidate keys, the choice of the primary key is somewhat arbitrary. Normally, a candidate key with a single attribute, or only a small number of attributes is chosen to be the primary key. The other candidate keys are often called either unique keys or alternate keys. One additional constraint that is applied to attributes specifies whether or not NULL values are permitted for the attribute. Considering the example database from Module 1, if every student must have a declared major, then the MAJOR attribute is constrained to be NOT NULL. If it is acceptable for a student to at times not have a declared major, then the attribute is not so constrained. Entity Integrity, Referential Integrity, and Foreign Keys The entity integrity constraint requires that no tuple may have a NULL value in any attribute which makes up the primary key. This follows from the fact that the primary key is used to uniquely identify tuples in a relation. Both entity integrity constraints and key constraints deal with a single relation. On the other hand, referential integrity constraints are used between two relations. They are used to maintain consistency between the tuples of the two relations. A referential integrity constraint indicates that a tuple in one of the relations refers to a tuple in a second relation. Further, it must refer to an existing tuple in the second relation. For example, in the example database, the MAJOR attribute in the STUDENT relation refers to and must match the value in some tuple in the CODE attribute of the DEPARTMENT relation. More specifically, we need to define the concept of a foreign key. A set of attributes in Relation One is a foreign key that references Relation Two if it satisfies: 1. The attributes of the foreign key in Relation One have the same domains as the primary key attributes in Relation Two. The foreign key is said to reference Relation Two. 2. The value of the foreign key in a tuple of Relation One must either occur as a value of a primary key in Relation Two or must be NULL. Note that it is possible for a foreign key to refer to the primary key in its own relation. This can occur in a unary or recursive relationship. Integrity constraints should be indicated on the relational database schema. Many of these constraints can be specified in the DDL and automatically enforced by the DBMS. M 5.1.4: Update Operations and Dealing with Constraint Violations Operations applied to an RDBMS can be split into retrievals and updates. Queries are used to retrieve information from a database. Retrievals will be discussed later. Retrieving information does not violate any integrity constraints. There are three types of update operation: 1. Insert (sometimes referred to as create) is used to add a new tuple to the database. 2. Update (sometimes referred to as modify) is used to change the value of one or more attributes in an existing tuple. 3. Delete is used to remove a tuple from the database. Each of the three update operations can potentially violate integrity constraints as indicated below. How the potential violations can be handled is also discussed. Insert Operation The insert operation provides a list of attribute values for a new tuple to be added to a relation. This operation can violate: 1. Domain constraints if a value is supplied for an attribute which is not of the correct data type or does not adhere to valid values in the domain. 2. Key constraints if the key value for the new tuple already exists in another tuple in the relation. 3. Entity integrity if any part of the primary key of the new tuple contains the NULL value. 4. Referential integrity if the value of any foreign key in the new tuple refers to a tuple which does not exist in the referenced relation. In most cases, when an insert operation violates any of the above constraints, the insert is rejected. There are other options, but they are not commonly used. Delete Operation The delete indicates which tuple should be deleted. This operation can only violate referential integrity. This violation happens when the tuple being deleted is referenced by foreign keys in other tuples in the database. When deleting a tuple would result in a referential integrity violation, there are three main options available: 1. Restrict - reject the deletion. 2. Cascade - try to propagate the deletion by deleting not only the tuple specified in the delete operation, but also by deleting all tuples in referencing relations that reference the tuple being deleted. 3. Set to NULL - set all referencing attribute values to NULL, then delete the tuple specified in the delete operation. Note that this option will not work if one of the attributes in a referencing tuple is part of the primary key in the referencing relation. This would cause a different violation since no part of a primary key can be set to NULL. Update Operation The update operation is used to change the values of one or more attributes in a tuple. This can lead to several possibilities: 1. Updating a attribute which is not part of a primary key or a foreign key is usually acceptable. However this type of update can violate domain constraints if the new value supplied for an attribute is not of the correct data type or does not adhere to valid values in the domain. 2. Modifying the value of an attribute which is part of a primary key causes, in effect, a deletion of the original tuple and an insertion of a new tuple with the new value(s) forming the key of the new tuple. This can have any of the impacts for delete and insert discussed above. 3. Modifying the value of a foreign key attribute involves verifying that the new foreign key references an existing tuple in the referenced relation. The options for dealing with any resulting violation are similar to those discussed in the delete operation. relational database schema M 5.2: Relational Database Design Using ER-to-Relational Mapping In previous modules we examined taking a set of requirements and creating a conceptual schema in the form of an ER diagram. The next step in the process is creating a logical database design by creating a relational database schema based on the conceptual schema. This is known as data model mapping. This part of the module presents a procedure for creating a database schema from an ER diagram. The text presents a seven-step algorithm for converting basic ER model concepts into relations. The algorithm presented here modifies the algorithm in the text to account for (intentional) omissions and modifications to some of the material presented in the text. Specifically, step 2 from the text has been eliminated since we are treating weak entity types as regular relationships. Also, step 6 from the text has been eliminated since we are not allowing multivalued attributes. They are eliminated during the design when the ER diagram is constructed. This leaves the five-step algorithm presented here. Consider the following ER diagram as an example as we work through the steps of the algorithm. Note that this is similar to the ER diagram in the second ER diagram module, but it has been extended to illustrate additional points. Note that following the design choices from the text which are highlighted in sub-module 4.6, the attributes which are refined into a binary relationship have been removed from the entity type. Also note that participation constraints are not needed in the mapping algorithm, so single lines are used for all connections between entities and relationships regardless of the nature of participation. Example ER Diagram Figure 5.1: ER Diagram to Use as Example for Algorithm Steps The mapping algorithm follows. Step 1: Mapping of Regular Entity Types For each regular entity type in the ER schema, create a relation that includes all the attributes. Note that unlike the text, we did not allow composite attributes, so all attributes will be simple and will be included in the relation. Next consider all the candidate keys identified for an entity. Choose one of the candidate keys as primary key. Repeat this for each relation created in this step. Note that foreign keys and relationship attributes (if there are any) are not addressed in this step. They will be addressed in steps below. The relations created during this step are sometimes called entity relations because each tuple in these relations represents an entity instance. In our example ER diagram, the regular entity types STUDENT, FACULTY, DEPARTMENT, COURSE, and SECTION will lead to the creation of relations STUDENT, FACULTY, DEPARTMENT, COURSE, and SECTION. All attributes shown with the entities in the diagram will be included in the corresponding relation. For the STUDENT relation, (student) ID will be the primary key. It might be possible to declare (student) Name as a candidate key, but since there is at least a slight possibility that duplicate names (e.g. John R. Smith) will be present at some point during the life of the database, this cannot be a candidate/primary key unless we have a mechanism in the name attribute for keeping the names unique. An admittedly awkward way to do this might be "John R. Smith -1" and "John R. Smith - 2" as values in the name field. Similarly, we would use (faculty) ID as the primary key for the FACULTY relation. We would use (department) Code as the primary key for the DEPARTMENT relation, Number as primary key for the COURSE relation, and (section) ID as the primary key for the SECTION relation. Note that we made these key choices when originally drawing the ER diagram. If we do this, the key determinations at this step have already been made. This just reviews some of the decisions made at that earlier point. The result after this step is shown here. Relations after Step 1 STUDENT Student_id Name FACULTY Faculty_id DEPARTMENT Code Name COURSE Number Name SECTION Section_id Semester Figure 5.2 Entity Relations after Step 1 Step 2: Mapping of Binary 1:1 Relationship Types For each binary 1:1 relationship type in the ER schema, identify the two relations that correspond to the the entity types participating in the relationship. Call the two relations R1 and R2. There are three ways to handle this. By far the most common is the first approach. The other two are useful in special cases. 1. Foreign Key Approach: Choose one of the relations - we'll choose R1 - and include as a foreign key in R1, the primary key of R2. If there are any attributes assigned to the relationship (and not to either relation directly), those attributes will be included as attributes of R1. 2. Merged Relation Approach: This approach involves merging the two entity types into a single relation. If this approach is chosen, it is preferable for both participations to be total. This implies that there are the same number of tuples in both tables. 3. Cross-reference or Relationship Relation Approach: This approach involves setting up a third relation, which we'll call RX, which exists for cross-referencing the primary keys of relations R1 and R2. We will see in a later step, that this is the approach used for a binary M:N relationship. The relation RX is known as a relationship relation because each tuple in RX indicates a relationship between a tuple in R1 and a tuple in R2. The pros and cons of the three approaches can be further illustrated with an example. Example Illustrating the Three Approaches In this simple example, suppose we are modeling a company. Two of the entities we have captured are SALESPERSON and OFFICE. Keeping the example simple, the primary key of the SALESPERSON entity is a unique Salesperson Number assigned to each salesperson. Also of interest are the attributes: Salesperson Name, Salary, and Year of Hire. The primary key of the OFFICE entity is Office Number. Assume that all offices are in the same building, so the Office Number is unique. We are also interested in the size of the office and whether or not the office has a window. Also assume that this is a sales force that does most of its work from the office, so a salesperson will be in his/her office most of the time. Since a salesperson will be on the phone much of the time, only one salesperson will be assigned to a given office to minimize distractions to phone calls with customers. Note that in other cases where salespeople share offices, such as when they are usually on the road and only use an office occasionally, this would become a 1:N relationship and would be covered in the next step. Examining the foreign key approach first, we'll choose SALESPERSON as R1 and OFFICE as R2. We create the SALESPERSON relation with all its attributes and designate Salesperson Number as the primary key. We also include as an attribute, the foreign key Office Number. We then create the OFFICE relation with primary key Office Number and attributes Size and Window. We could also have used the design where OFFICE is chosen as R1 and SALESPERSON is chosen as R2. With this design, SALESPERSON would not have a foreign key in its table, but OFFICE would now include the foreign key Salesperson Number. Since either design meets the outline of the approach, which design works better? It depends on the actual situation in the miniworld being modeled. If there is total participation by one relation in the relationship, that relation should be chosen as R1. If our miniworld indicates that every salesperson has an office, but there is the possibility that one or more offices are empty (do not have a salesperson currently assigned), then there is total participation by SALESPERSON in the relationship (which might be called WORKS_IN), but not total participation by OFFICE. Therefore, we should choose SALESPERSON to be R1. Since every salesperson has an office, the foreign key attribute (which can also be called Office Number) will always have a value in each tuple. There may be one or more offices in the OFFICE relation which are not occupied. That would not change tuples in the OFFICE relation. It just may be that some of the tuples in the relation are not currently related to a SALESPERSON tuple. With the same assumptions, if we choose OFFICE to be R1, some of the OFFICE tuples will have NULL values in the foreign key field (which might be called Salesperson Number). In this scenario, the first choice of SALESPERSON as R1 is preferred. If the actual situation in the miniworld is modified, where all offices are occupied, but not all salespersons have offices, the choice needs to be reexamined. This might be the case if some of the salespersons are "home based," but others travel. The mostly home based salespersons are assigned an office, but those who mostly travel are not. In this case, all offices are occupied, but not all salespersons have an office. This means that there is total participation by OFFICE in the relationship (which this time we might call OCCUPIED_BY). Therefore OFFICE should be chosen as R1. Since every office is occupied, the foreign key attribute (which can also be called Salesperson Number) will always have a value in each tuple. There may be one or more salespersons in the SALESPERSON relation who do not have offices. That would not change tuples in the SALESPERSON relation. It just may be that some of the tuples in the relation are not currently related to an OFFICE tuple. With the same assumptions, if we choose SALESPERSON to be R1, some of the SALESPERSON tuples will have NULL values in the foreign key field. In this scenario, the second design choice of OFFICE as R1 is preferred. Next we'll examine the merged relation approach. One reason that this approach is not favored is that the two entities are thought of as separate and this is reinforced by the fact that they were drawn separately in the ER diagram. Despite this, the approach can work when both participations are total. In the example, this would mean that every salesperson works in an office and every office is occupied by a salesperson. Then either office number or salesperson number can be chosen as the primary key. If both participations are not total, then either there can be empty offices or some salespersons may not have offices, or both. If there are empty offices, then office number must be chosen as the primary key and in some tuples the information related to salespersons will be NULL. If some salespersons do not have offices, then salesperson number must be chosen as the primary key and in some tuples the information related to offices will be NULL. If neither participation is total, some of the tuples will have non-NULL for all attributes, some will have NULL values for the salesperson information, and others will have NULL information for the office information. This leads to the question of which attribute will be the primary key? Closer examination will show that whatever attribute is picked, it is possible that the primary key value will be NULL for some tuples. This is not allowed, so this approach cannot be used when neither participation is total. Finally, there is the Cross-reference or Relationship Relation Approach. This involves keeping the R1 and R2 relations with only their immediate attributes. We then create a new relation RX to handle the cross-reference. In the example, we might call this relation WORK_LOCATION. This relation will have tuples with only two attributes: Salesperson Number and Office Number. The two are used together to form the primary key of this relation. Each will also be a foreign key, one in the SALESPERSON table, and one in the OFFICE table. This approach has the drawback that it requires an extra relation and this will require additional processing for certain queries. As mentioned, this approach is required in an M:N relationships, but it is not common for 1:1 relationships. In our example from Figure 5.1, the CHAIR_OF relationship is a 1:1 binary relationship. We will choose the foreign key approach for mapping this relationship. This requires that we include the primary key of one of the relations as a foreign key in the other relation. If we include the primary key of DEPARTMENT (Code) as a foreign key in FACULTY, there will be may NULL values in the attribute set up as a foreign key since only a relatively small number of faculty are chairs. However, if we use the primary key of FACULTY (Faculty_id) as a foreign key in DEPARTMENT, there will be no NULL values in this attribute since every department must have a chair. So, we will use the second choice and include an attribute, which we'll call Chair_id, in the DEPARTMENT relation. Note that this will be a foreign key referencing the FACULTY relation. The modified DEPARTMENT relation is shown below. Department Relation after Step 2 DEPARTMENT Code Name Figure 5.3 Modified DEPARTMENT Relation after Step 2 Step 3: Mapping of Binary 1:N Relationship Types For each binary 1:N relationship type in the ER schema, there are two possible approaches. By far the most common is the first approach since it reduces the number of tables. 1. 1. Foreign Key Approach: Find the relation for the N-side of the relationship. We'll call this R1. Include as a foreign key in R1, the primary key of the other relation, which can be called R2. 2. The Relationship Relation Approach: As with the third approach in the previous step, this approach involves setting up a third relation, which we'll call RX. This relation exists for cross-referencing the primary keys of relations R1 and R2. The attributes of RX are the primary keys of R1 and R2. This approach can be used if there are only a few tuples of R1 which participate in the relationship. Using the first approach would cause many NULL values in the foreign key field of R1. In our example from Figure 5.1, there are five 1:N binary relationships. We will choose the foreign key approach for mapping these relationships. To do this, we will select the relation on the N-side of the relationship and include in this relation as a foreign key, the primary key of the relation on the 1-side. Specifically, in our example: For the ADVISED_BY relationship, we will include in the STUDENT relation (the Nside) a foreign key which contains the primary key of the FACULTY relation (Faculty_id). We will call this new attribute Advisor_id. Note that this name is appropriate in the context of the STUDENT relation, but realize that it is the Faculty_id from the FACULTY relation. Using similar logic, for the BELONGS_TO relationship, we will include in the FACULTY relation (the N-side) a foreign key which contains the primary key of the DEPARTMENT relation. We will call this new attribute Department. For the OWNS relationship, we will include in the COURSE relation (the N-side) a foreign key which contains the primary key of the DEPARTMENT relation. We will call this new attribute Department. For the MAJORS_IN relationship, we will include in the STUDENT relation (the Nside) a foreign key which contains the primary key of the DEPARTMENT relation. We will call this new attribute Major. For the OFFERING_OF relationship, we will include in the SECTION relation (the Nside) a foreign key which contains the primary key of the COURSE relation. We will call this new attribute Course_number. The modified relations are shown below. Relations after Step 3 STUDENT Student_id Name Class_rank FACULTY Faculty_id Name DEPARTMENT Code COURSE Name Number Name SECTION Section_id Semester Figure 5.4 Entity Relations after Step 3 Step 4: Mapping of Binary M:N Relationship Types Since we are following the traditional relational model and not allowing multivalued attributes, the only option for the M:N relationships is to use the relationship relation (cross-reference) approach. For each binary M:N relationship type create a new relation RX to represent the relationship. Include in RX, as foreign keys, the primary keys of the the relations that represent the entities involved in the relationship. The combination of the foreign keys will form the primary key of RX. Add to the relation RX any attributes which are associated with the relationship as shown in the ER diagram. In our example from Figure 5.1, there is one M:N binary relationship, ENROLLED_IN. We will create a new relation to represent the relationship. We could name the relation ENROLLED_IN. However, the requirement which led to identifying this relationship is based on a desire for transcript information, so we will name the new relation TRANSCRIPT. Following the algorithm, we include the primary keys of the two relations STUDENT and SECTION as foreign keys in the TRANSCRIPT relation. The two keys together will form the primary key of the TRANSCRIPT relation. We will also include any attributes associated with the relationship. In this case, there is only one: Grade. The new TRANSCRIPT relation is shown below. TRANSCRIPT Relation after Step 4 TRANSCRIPT Student_id Section_id Figure 5.5 TRANSCRIPT Relation after Step 4 Step 5: Mapping of N-ary Relationship Types For this mapping, we again use the relationship relation approach. For each N-ary relationship type (where N > 2) create a new relation RX to represent the relationship. Include in RX, as foreign keys, the primary keys of the relations that represent all entities involved in the relationship. The combination of the foreign keys will form the primary key of RX. Add to the relation RX any attributes which are associated with the relationship as shown in the ER diagram. Note that this step does not apply to the ER diagram in Figure 5.1 since there are no N-ary relationships in the diagram. A separate example of how to apply this step is given below. Example Consider the example for n-ary relationships from Module 4.3. We looked at a store where the entities CUSTOMER, SALESPERSON, and PRODUCT have been identified. Assume the store sells higher priced items, such as an appliance store or a furniture store, where unlike other stores like a grocery store, most customers buy only one or a few items at a time from a salesperson who is on commission. The customer is identified by a customer number, the product is identified by an item number, and the salesperson is identified by a salesperson number. It is desirable to keep track of which items a customer purchased on a particular day from which salesperson. We want to record the quantity of a particular item which is purchased as well as the date purchased. We defined a ternary relationship, called PURCHASE, between CUSTOMER, SALESPERSON, and PRODUCT. Quantity and Date Purchased become attributes of the PURCHASE relationship. In this example, we would define relations for CUSTOMER, SALESPERSON, and PRODUCT in step 1. In this step, we define a relation called PURCHASE. This relation contains as foreign key attributes, the primary keys of CUSTOMER, SALESPERSON, and PRODUCT. These attributes are combined to form the primary key of PURCHASE. Also included as attributes of PURCHASE are Quantity and Date Purchased since these are attributes of the relationship and not independent attributes of any of the three entities which participate in the relationship. If you want to refresh your memory, the ER diagram shown as Figure 4.2 can be seen by clicking on the toggle below. A Copy of Figure 4.2 Figure 4.2: ER Diagram Representation of Ternary Relationship M 6.1: SQL Background Relational DBMSs are still the most widely used DBMSs in industry today. Almost, if not all, RDBMSs support the SQL standard. The standard had its origins with the SEQUEL language developed by IBM in the 1970s for use with its experimental RDBMS, SYSTEM R. As relational systems began to move from prototype systems to production systems, each vendor had its own proprietary RDBMS and its own proprietary DML/DDL language. Many were similar to SEQUEL (later renamed SQL: Structured Query Language). The differences caused problems for users who wanted to switch RDBMS vendors. The cost of conversion was often quite high since switching systems often required table modification to run on the new system, and many programs which accessed the DBMS often required extensive modification to work with the new system. Customers who wanted to run two or more systems from different vendors faced similar problems since interface programs written for one system would not run directly on the other system. Under pressure from users, vendors moved to help alleviate this problem. Under the leadership of the American National Standards Institute (ANSI) a standard was developed for SQL (called SQL1 or SQL-86). This first standard was quite similar to IBM's version of SQL. Suggested improvements to the IBM version were delayed to SQL2 to allow the users of the IBM RDBMS (named DB2 when SYSTEM R was released as a production version) to delay likely necessary changes to their system. This made sense at the time since the majority of early RDBMS users ran DB2 on IBM computers. If vendors adhered to the standard, users could switch vendors or run systems from multiple vendors with much less rework required. Although most vendors continued to add features beyond those specified in the standard, users could choose to ignore the vendor-specific features to minimize rework when moving to a different system. The revised and greatly expanded standard known as SQL2 (officially SQL-92) was released in 1992. Creating an official standard takes several years, with the final approval process usually taking well over a year since there are opportunities for user comments. The committee discusses the comments and decides which to include in the official standard. Some vendors often have features in their DBMS which later become part of the standard. Also, most vendors have not implemented all the features which are included in the standard by the time the standard is released. It often takes vendors a few years before they include all the features specified in a new standard. The standards have been updated over time. Major revisions have been: SQL:1999 (also known as SQL3); SQL:2003 which began support for XML; SQL:2006 which added additional XML features; SQL:2008 which added more object database features SQL:2011 SQL:2016 SQL:2019 is currently in progress SQL is comprehensive, including statements for data definitions, queries, and updates. This means that it is both a DDL and a DML. It includes many additional facilities including rules for embedding SQL statements into programming languages such as Java and C++. Beginning in 1999, the standards were split into a core specification plus specialized extensions. The core is to be implemented by all SQL compliant vendors. The extensions can be implemented as optional modules if the vendor chooses. M 6.2: SQL Data Definition and Data Types SQL uses the (informal) terms table, row, and column for the formal relational terms relation, tuple, and attribute. These are most often used interchangeably when discussing SQL. Below we use all upper case for SQL keywords such as CREATE TABLE or PRIMARY KEY. However, SQL is case insensitive when examining keywords. Specifying CREATE TABLE, create table, or Create Table will all be equivalent in SQL. The same is not true for character string data stored in rows (tuples) of the database. The database is case sensitive when processing data. The concept of a schema was introduced in SQL2. This allowed a user to group tables and related constructs that belonged to the same schema (often just called a database). This feature allows an RDBMS to host many different databases. An SQL schema is given a schema name and an authorization identifier which indicates which user owns the schema. It also includes descriptors for each element in the schema. These schema elements include such items as tables, types, constraints, views, domains, and others. A schema is created using the CREATE SCHEMA statement. This must include the schema name. It can include the definitions for no, several, or all schema elements. The elements not defined in the statement can be defined later. An example for a university database would be: CREATE SCHEMA UNIVERSITY; Note that SQL statements end with a semicolon. The semicolon is optional in some DBMSs and SQL statements will run fine without them. It is required in other DBMSs. The next step in creating a database is to create the tables (relations). This is done using the CREATE TABLE command. This command is used to name the table and give its attributes and initial constraints. Often the schema name is inferred from the environment and need not be specified explicitly, but it can be specified explicitly if desired or if needed. An example without and then with the schema would be: CREATE TABLE STUDENT ( rest of specification ); CREATE TABLE UNIVERSITY.STUDENT ( rest of specification ); Inside the parentheses, attributes are declared first. This includes the name given to the attribute, a data type to provide the domain, and optional attribute constraints such as NOT NULL. An example for two of the tables in the example database from Figure 1.2 follows. CREATE TABLE STUDENT ( STU_ID CHAR(5) NOT NULL, SNAME VARCHAR(25) NOT NULL, RANK CHAR(2), MAJOR VARCHAR(5), ADVISOR INT, PRIMARY KEY(STU_ID) ); CREATE TABLE DEPARTMENT ( CODE VARCHAR(5) NOT NULL, DNAME VARCHAR(25) NOT NULL, BUILDING VARCHAR(35), NOT NULL PRIMARY KEY(CODE) ); Tables (relations) created using CREATE TABLE are called base tables. This means that the tables and their rows are stored as actual files by the DBMS. A later module will discuss the CREATE VIEW statement. This statement creates virtual tables, which may or may not be represented by a physical file. In SQL, attributes specified in a base table are considered to be in the order as specified in the CREATE TABLE statement. Remember, however, that row ordering is not relevant so rows are not considered to be ordered in a table. Keys were discussed in the last module. The primary key is specified as shown above. Foreign keys can be specified in the CREATE TABLE statement. However, doing this might lead to errors if the foreign key references a table which has not yet been created. The foreign keys can be added later using the ALTER TABLE statement. We will discuss this statement in a later module. Foreign keys have been left off of the CREATE TABLE example above. There are several basic data types in SQL. They are listed below with brief descriptions. Additional details can be found in the text. We will cover additional details as they are needed. o o o o o Numeric data types include integer numbers of different sizes (INT and SMALLINT). They also include floating-point numbers of different precisions ( FLOAT, REAL, and DOUBLE PRECISION). SQL also includes formatted numbers using DEC(i,j) where i specifies the total number of digits and j specifies the number of digits after the decimal point. Character-string data types can be either fixed length specified by CHAR(n) where n is the length of the string or varying length specified by VARCHAR(n) where n is the maximum length of the string. Bit-string data types are also either fixed length of n bits (BIT(n)) or variable length of maximum length n (BIT VARYING (n)). The Boolean data type has the traditional values of TRUE and FALSE. However, to allow for the possibility of NULL values, SQL also allows the value of UNKNOWN. This three-valued logic is not quite the same as the traditional two-valued Boolean logic. It is discussed in more detail in a later module. The DATE data type consists of both DAY and TIME components. Additional, more specialized, data types exist. Some are not part of the standard, but have been implemented by individual vendors. They will be discussed if they are needed. M 6.3: Specifying Constraints in SQL There are several basic constraints which can be specified in SQL as part of table creation. These include key and referential integrity constraints, restrictions on attribute domains and use of NULLs, and constraints on individual tuples by using the CHECK clause. An additional type of constraint, called an assertion, is discussed in a future module. Specifying Attribute Constraints and Attribute Defaults By default, SQL allows NULL to be used as a value for an attribute. A NOT NULL constraint can be specified for any attribute where a value of NULL is not permitted. Examples of this were shown in the last sub-module. Remember from the last module that the entity integrity constraint requires that no tuple may have a NULL value in any attribute which makes up the primary key. Therefore, any attribute specified as a primary key will not allow nulls by default, but any other attributes where NULL should not be permitted will require the NOT NULL constraint to be specified. Unless the NOT NULL constraint is specified, the default value assigned to an attribute in a tuple is NULL. If a different default value is desired, it can be defined by appending the "DEFAULT <value>" clause to the attribute definition. An example of this would be to assign the default value of "3" to the credit hours attribute in the COURSE table. This would be done by specifying the attribute as: Credit_hours INT NOT NULL DEFAULT 3; An additional constraint can be added to restrict the domain values of an attribute. This can be done using the CHECK clause. If there is university policy that no course may be offered for more that six credit hours, this can be specified as: Credit_hours INT NOT NULL CHECK (Credit_hours > 0 AND Credit_hours < 7); Specifying Key and Referential Integrity Constraints The PRIMARY KEY clause specifies the primary key for the table. Examples of specifying the primary key were shown in the last sub-module. If multiple attributes are required to make up the primary key, all key attributes are listed in the parentheses. For example, in the TRANSCRIPT table, the primary key is a combination of student ID and section ID. In the TRANSCRIPT table, this would be specified as: PRIMARY KEY (Stu_id, Section_id); We have already discussed alternate or candidate keys. These are also unique, but were not chosen as the primary key for the relation. This can be specified in a UNIQUE clause. An example of this would be if we assume department name is unique in the DEPARTMENT relation, we can specify: UNIQUE (Dname); For both primary and alternate keys, if the key is made up of only one attribute the corresponding clause can be specified directly. Examples for the DEPARTMENT table would be: Code VARCHAR(5) PRIMARY KEY; Dname VARCHAR(25) UNIQUE; Referential integrity constraints are specified by using the FOREIGN KEY clause. Remember from an earlier module that it is possible to violate a referential integrity constraints when a tuple is added, deleted, or modified (the update operations). Specifically, the modify would be to a primary key or foreign key value. Also remember that there are three main options for handling an integrity violation: restrict, cascade, and set to NULL. The default for SQL is RESTRICT which causes a rejection of the update operation that would cause the violation. However the schema designer can specify an alternative action. For an insertion, RESTRICT is the only option which makes sense, so alternative actions are not specified for insert operations. The alternatives for modify and delete are SET NULL, CASCADE, and SET DEFAULT. The option must be qualified by ON DELETE or ON UPDATE (this is a reference to a modify operation, not to all three operations which update the database). Example Tables Referenced STUDENT Student ID Name Class Rank 17352 John Smith SR 19407 Prashant Kumar JR 22458 24356 27783 Rene Lopez SO Rachel Adams FR Julia Zhao FR FACULTY Faculty ID Name 1186 Huei Chuang 5368 Christopher Johnson 6721 Sanjay Gupta 7497 Martha Simons Note that the values in the tables have been modified to better demonstrate the concepts below. For example, we might specify a foreign key in the STUDENT table of our example database by the clause: FOREIGN KEY (Advisor_id) REFERENCES FACULTY (Faculty_id); Since the default is RESTRICT, we will not be able to add a tuple to the STUDENT table unless the value in the Advisor_id attribute matches a value of Faculty_id of some tuple in the FACULTY table. Similarly, we cannot modify the value of the Faculty_id attribute if that faculty member has been assigned an advisee using the old Faculty_id. Neither can we modify the value of the Advisor_id attribute unless we modify it to the value of Faculty_id in some other tuple in the table. Finally, we cannot delete a row from the FACULTY table if that faculty is the advisor to at least one student. All these actions are prohibited by the RESTRICT option. Consider the tables above. For the sake of simplicity, assume this is our "entire" database for now. The RESTRICT option (default) would not allow the addition of a new student with an Advisor value of 4953 since there is no tuple in the FACULTY relation with the faculty id of 4953. We could not modify the faculty id of Huei Chuang since there are two students who show their advisor as 1186. Changing that value in FACULTY tuple would create a referential integrity infraction. This would not be allowed with RESTRICT. In a similar manner, we cannot delete faculty Martha Simons since that would again remove a faculty id which is used as the advisor value for two students who show their advisor as 7497. This deletion would again create a referential integrity infraction. Suppose we modify the FOREIGN KEY clause as follows: FOREIGN KEY (Advisor_id) REFERENCES FACULTY (Faculty_id) ON DELETE SET NULL ON UPDATE CASCADE; The add operation would work as above since the RESTRICT option would still be in effect for add. When the row for a faculty is deleted and that faculty has advisees, the delete operation is now allowed, but for all students who had the (soon to be) deleted faculty as advisor, the Advisor_id attribute would be changed to NULL. If the Faculty_id is modified, again that operation is now permitted and with CASCADE, the Advisor_id attribute for all students who have the faculty as advisor would have the value changed to the new Faculty_id given to the faculty member. Again, consider the tables above. Since RESTRICT would still be in effect for add. The DBMS would still not allow the addition of a new student with an Advisor value of 4953 since there is no tuple in the FACULTY relation with the faculty id of 4953. Since the option for DELETE is now SET NULL, we can delete faculty Martha Simons. This would be allowed and cause the advisor value for two students who show their advisor as 7497 (John Smith and Rene Lopez) to have that value changed from 7497 to NULL in their student tuples. Since the option for UPDATE is now CASCADE, we could modify the faculty id of Huei Chuang. If we change his ID to 1215, not only is the faculty id value in his faculty record changed to 1215, but the two students (Prashant Kumar and Rachel Adams) who have their advisor value as 1186 would automatically have the values modified to 1215. Note that other checks might restrict us from modifying Huei Chuang's id since it is the primary key of the relation, but the change would no longer be prevented by the referential integrity constraint. Using CHECK to Specify Constraints on Tuples In addition to being used to further restrict the domain of an attribute, the CHECK clause can be used to specify row based constraints which are checked whenever a row is inserted or modified. This type of constraint is specified by adding the CHECK clause at the end of other specifications in a CREATE TABLE statement. As an example of how this can be used, suppose our database contains a PRODUCT table. Among other attributes, we store the regular price at which the item is sold in the attribute Regular_price. However, we often put products on sale. We store the current sale price of the item in the attribute Sale_price. If an item is not currently on sale, we set the value of Sale_price to the value of Regular_price in the tuple. We do want to make sure, however, that we do not accidentally set the sale price above the regular price. We can make sure this does not happen by adding the following CHECK clause at the end of the CREATE TABLE statement for the PRODUCT table. CREATE TABLE PRODUCT( rest of specification CHECK (Sale_price <= Regular_price) ); If the CHECK condition is violated, the insertion or modification of the offending tuple would not be allowed to be stored in the database. M 6.4: INSERT, DELETE, and UPDATE Statements in SQL The three commands used to change the database are INSERT, DELETE, and UPDATE. They will be considered separately. The INSERT Command The basic form of the INSERT command is used to add a single row to a table. The table name and a list of values for the row must be specified. The values must be listed in the same order that the corresponding attributes were listed in the CREATE TABLE command. For example, in the STUDENT table created in sub-module 6.2, we can specify: INSERT INTO VALUES STUDENT ('17352', 'John Smith', 'SR', 'SWENG', 1186); Note that the first value (the student id) is specified in single quotes since it was specified as a CHAR(5) value and character values must be enclosed in single quotes. The last value (advisor) is not specified in quotes since it was specified as an INT and numbers are not enclosed in quotes. The DELETE Command The DELETE command removes tuples from a table. It includes a WHERE clause similar to the WHERE clause in the SELECT statement (which will be discussed on sub-module 6.6. The WHERE clause is used to specify which tuples should be deleted. The WHERE clause often specifies an equal condition on the primary key of the table whose tuple should be removed. However, it can be used to specify other conditions which may delete multiple tuples. One example of using the DELETE command is: DELETE FROM WHERE STUDENT Student_id = '24356'; If a tuple is found with the specified student id, that tuple is removed from the table. If no tuple matches the student id, no action is taken. A second example is: DELETE FROM WHERE STUDENT Major = 'CMPSC'; This removes any student tuple where the student is a CMPSC major. (I'm not sure why such a drastic action would be taken, but that is what the command would do.) The UPDATE Command The UPDATE command allows modification of attribute values of one or more tuples. Like DELETE, the UPDATE command includes a WHERE clause which selects the tuples to be modified from the indicated table. This commands also contains a SET clause to specify which attribute(s) should be modified and what the new values should be. For example, to change the major of student 19497 to SWENG, we would issue the following: UPDATE SET WHERE STUDENT Major = 'SWENG' Student_id = '19497'; This would update the value for one tuple. For an example using multiple tuples, consider that the faculty member with id 3468 is retiring and this faculty has several advisees. All of the advisees are now being assigned to faculty with id 5793 as their new advisor. This would be accomplished with the following. UPDATE SET WHERE STUDENT Advisor = 5793 Advisor = 3468; It is also possible to specify NULL as the new attribute value. In these examples, we only specified one table in the command. It is not possible to specify multiple tables in an UPDATE command. To update multiple tables, an individual UPDATE command must be issued for each table. M 6.5 Creating and Populating a Database Using SQL In Assignments 1 & 2, you built and populated a database using the screens provided by NetBeans. In this module, we will start to build and populate a database using SQL commands. You will finish the build and population of the database in Assignment 6. You will then perform a few basic queries. We will discuss queries in upcoming sub-modules. In this module, using SQL we are going to create a database named ASSIGNMENT6. We will then create a table and add some rows. The steps are listed below. Please follow along in your own version of NetBeans. At the end of the module you can click to download or view a Word document that contains more detail and screenshots which might be helpful. Steps required: Open NetBeans. If the Services tab is not visible, click on "Window" and then click the "Services" selection. If there is a + to the left of Java DB, click to expand. If there is no +, see the Word document. Right click on the “jdbc:derby://localhost…” and then click “Connect”. Right click on the “jdbc:derby://localhost…” again and this time click on “Execute Command…” Type in “CREATE SCHEMA ASSIGNMENT6;” (don’t forget the semi-colon). Click on the “Run SQL” button (first button to the right of where you entered the command). The output window which is shown in the bottom part of the screen should indicate “Executed successfully…” You will probably not see ASSIGNMENT6 in the Services window. Click the + to expand "Other schemas". You should see ASSIGNMENT6. Right click on it and select “Set as Default Schema”. This should move it to the top and the name will be in bold. Expand, then right click on TABLES, select “Execute Command…” In this window, you need to type in the table definition. We will create the STUDENT table using: CREATE TABLE STUDENT ( STU_ID CHAR(5) NOT NULL, SNAME VARCHAR(25) NOT NULL, RANK CHAR(2), MAJOR VARCHAR(5), ADVISOR INT, PRIMARY KEY(STU_ID) ); Note that you can type in the entire definition in the window. An alternate, and probably better way is to type the definition into a simple text file (Notepad on Windows works well) and then copy and paste the entire definition from the text document to the window. Also note that unlike the guidelines we have been following from the text, I changed all attribute names to all upper case. This is because the SQL interpreter in Java DB treats attribute names the same way that it treats keywords – as case insensitive. If you put the attribute names in mixed case as we have done so far, when you run queries you will need to put the attribute names in single quotes. This can be a pain, so it easier to put them into the database as all caps. Again click the “Run SQL” button. The output window should again indicate “Executed successfully…” Now that the table has been created, the next step is to add tuples (rows) to the table. In the Services tab, right click on the STUDENT table, select “Execute Command…” In this window, we will type in one INSERT command for each tuple. We will populate the STUDENT table by using: INSERT INTO STUDENT VALUES ('17352', 'John Smith', 'SR', 'SWENG', 1186); INSERT INTO STUDENT VALUES ('19407', 'Prashant Kumar', 'JR', 'CMPSC', 3572); INSERT INTO STUDENT VALUES ('22458', 'Rene Lopez', 'SO', 'SWENG', 2842); INSERT INTO STUDENT VALUES ('24356', 'Rachel Adams', 'FR', 'CMPSC', 4235); INSERT INTO STUDENT VALUES ('27783', 'Julia Zhao', 'FR', 'SWENG', 3983); Click the “Run SQL” button. The output window should again indicate “Executed successfully…” five times, one for each of the INSERT commands. Finally, right click on STUDENT and select “View Data…” This should show the tuples you just entered. Now that we have created a database, added a table, and populated the table with a few tuples (rows), we will look at basic queries in the next sub-module. M 6.6: Basic Retrieval Queries in SQL There is one basic statement in SQL for retrieving information from (or querying) a database: the SELECT statement. This is a somewhat unfortunate choice of name for the keyword, since there is a SELECT operation in the relational algebra which forms some of the background for SQL. If you look into relational algebra, please note that relational algebra operation and the SQL statement are NOT the same. The SELECT statement in SQL has numerous options and features. We will introduce them gradually over the next several modules. Note that the results of many operations will form new tables in SQL. When some of these tables are created, they will contain tuples which are exact duplicates. This is OK in the practical SQL, but it violates the definition of a relation in the formal model. When duplicate tuples are not desired, we will see methods for eliminating the duplicates. Since each base table will have a unique primary key, there will not be duplicate tuples in base tables. The SELECT-FROM-WHERE Structure SELECT FROM WHERE <attribute list> <table list> <condition>; where <attribute list> is a list of attribute names whose values are retrieved by the query. <table list> is a list of the names of the tables required to process the query. <condition> is a conditional expression that indicates the tuples which should be retrieved by the query. The basic operators for the condition are the comparison operators: =, <, >, <=, >=, and <>. You should already be familiar with these from C++ or Java. The exception might be the "<>" notation for "not equal". Click to see Figure 1.2 with the attributes named as in Figure 5.6 An example from the database in Figure 1.2 with the attributes named as in Figure 5.6 would be to find class rank and major for Julia Zhao. This would be specified as: SELECT Class_rank, Major FROM STUDENT WHERE Name = 'Julia Zhao'; Since this query only requires the STUDENT table, it is the only table name specified in the FROM clause. The query selects (relational algebra terminology) the tuple(s) from the table which meet the condition specified in the WHERE clause. Since the Name attribute is specified as a one of the CHAR types, the value specified in the WHERE clause to be matched with values in the database must be enclosed in single quotes. In this case, there is only one tuple where the student's name is Julia Zhao. The result is projected (again relational algebra terminology) on the attributes listed in the SELECT clause. This means that the result has only those attributes listed. The result of the above query would look like: Class_rank Major FR SWENG You can consider this query working by "looping through" all tuples in the STUDENT table and applying the condition from the WHERE clause to each. If the tuple meets the condition, it is placed in what can be considered a "result set," where the term set is used somewhat loosely. If the tuple does not meet the condition, it is ignored. Then the attribute values which are listed in the SELECT clause are displayed for all tuples in the result set. Another example would be to find all Juniors and Seniors and list their student ID, name, and class rank. This would be specified as: SELECT Student_id, Name, Class_rank FROM STUDENT WHERE Class_rank = 'JR' OR Class_rank = 'SR'; Since this query again only requires the STUDENT table, it is the only table name specified in the FROM clause. The query selects (relational algebra terminology) the tuple(s) from the table which meet the condition specified in the WHERE clause. In this case, it selects from the table all tuples where the class rank is either JR or SR. The result is projected on the three attributes listed in the SELECT clause. This means that the result has only those attributes listed. The result of the above query would look like: Student_id Name 17352 John Smith 19407 Prashant Kumar Note that we used the Boolean operator "OR" as part of the WHERE clause. Both AND and OR are valid keywords and perform their normal Boolean function. These are examples of simple SELECT statements using a single table. In the next module we will consider queries that require more than one table to answer the question. Cla Unspecified WHERE Clause and Use of the Asterisk It is permissible to specify a query without a WHERE clause. When this is done, all tuples in the table specified in the FROM clause meet the condition and are selected. For example consider: SELECT FROM Name STUDENT; This would display the names of all the students. Note that by default, the student names can be displayed in any order. This order is usually based on the order that the result can be retrieved and displayed most quickly. We will see shortly how we can specify an ordering. In order to retrieve all attribute values of the selected tuples, it is not necessary to list all the attribute names. We can simply specify the asterisk (*) in the SELECT clause. Consider the following example query. SELECT * FROM STUDENT WHERE Major = 'SWENG'; This will list all attributes for the students who are majoring in software engineering. Ordering of Query Results As mentioned above, the default allows query results to be displayed in any order. The ordering can be specified by using the ORDER BY clause. Consider the following query. SELECT Name, Advisor_id FROM STUDENT ORDER BY Name; This query would select all student tuples (no WHERE clause) and retrieve the Name and Advisor_id of each. Rather than the default random order of the tuples, the tuples would be ordered by Name. They would be displayed in ascending order by default. Note that using the full name has kept things simple, but it will cause problems with the ordering, since the sort will be based on first name. This means that "John Smith" will be ordered before "Prashant Kumar" since the "J" sorts before "P" even though Smith sorts after Kumar. First name and last name would need to be separate attributes to perform the sort in what would be considered "normal" order. If a sort in descending order is desired, the keyword DESC is used. For example: SELECT Name, Advisor_id FROM STUDENT ORDER BY Name DESC; This would perform the same retrieval as the last example, but this time the results would be displayed in descending order. Again, the issue of full name would cause the results to be correct for a sort on the full name, but this would not be what is normally expected. Two or more attributes can be specified in the ORDER BY clause. The primary sort would be on the first attribute listed. A secondary sort would be performed based on the second attribute listed, etc. SELECT Name, Advisor_id FROM STUDENT ORDER BY Advisor_id, Name; Again all STUDENT tuples are selected. Name and Advisor_id are displayed for all students. The main ordering would be by the value in the Advisor_id attribute. Since there would often be several students with the same advisor, for those tuples where the value of Advisor_id is the same, the tuples would be ordered by student name within that group. Without adding Name as a secondary sort value, all students would be put in order by Advisor_id value, but when the Advisor_id values are identical, those tuples will appear in random order within the group of those with the same advisor. M6.7: Querying a Database Using SQL In sub-module 6.5 we built and populated a database using SQL commands. In this sub-module, we will run a few simple queries of the ASSIGNMENT6 database using the SQL SELECT command. You will write some additional queries as part of Assignment 6. To a large extent, this repeats the queries in sub-module 6.6. This module demonstrates running the queries in NetBeans. Please follow along in your own version of NetBeans. At the end of the module you can click to download or view a Word document that contains more detail and screenshots which might be helpful. The steps for querying a database in NetBeans are very similar to the steps followed earlier in sub-module 6.5. To query an existing database using SQL in NetBeans, perform the following steps. Open NetBeans. Click on the Services tab. Expand the Databases selection by clicking on the + box next to “Databases”. Right click on the “jdbc:derby://localhost…” and then click “Connect”. Right click on the “jdbc:derby://localhost…” again and this time click on “Execute Command…” Enter the query directly into the screen or put the query in a simple text document and copy and paste into the screen. Click the “Run SQL” button. View the results. We will demonstrate this with the queries below. To start with a simple query, let’s show the value of all attributes for all tuples in the table. Note, of course, that this is not practical once the number of tuples gets into the hundreds and thousands of tuples. This can be done with the statement: SELECT FROM * STUDENT; Remember that the * indicates to display all attributes. All tuples are selected by not including a WHERE clause. Enter the above (again remember you can type the statement into a simple text file and then copy and paste). Then click the “Run SQL” button to view the results. Next, let’s run the query to find class rank and major for Julia Zhao. We did this with the following: SELECT RANK, MAJOR FROM STUDENT WHERE SNAME = 'Julia Zhao'; Note that this query uses the upper-case attribute names used in sub-module 6.5 when the database was actually created. Since the SNAME attribute has one of the char data types, there are single quotes around the name we are looking for. This query can be typed in or pasted over the prior query. Then click the “Run SQL” button to view the results. Now, let’s run the query to list all attributes for the students who are majoring in software engineering. We did this with the following: SELECT * FROM STUDENT WHERE MAJOR = 'SWENG'; Note that this query again uses the upper-case attribute name used in sub-module 6.5 when the database was actually created. Since the MAJOR attribute has one of the char data types, there are single quotes around the name. Enter the query, then click the “Run SQL” button to view the results. Finally, we will run two queries to order the output, first in ascending order by student name, then in descending order by student name. To list the name and advisor for each student where the output is ordered by name, we use the ORDER BY clause in a query as follows: SELECT SNAME, ADVISOR FROM STUDENT ORDER BY SNAME; Again, note that using the full name has kept things simple, but it will cause problems with the ordering, since the sort will be based on first name. "John Smith" will be ordered before "Prashant Kumar" since the "J" sorts before "P" even though Smith sorts after Kumar. First name and last name would need to be separate attributes to perform the sort in what would be considered "normal" order. Enter the query, then click the “Run SQL” button view the results. To sort in descending order, the keyword DESC is used. For example: SELECT SNAME, ADVISOR FROM STUDENT ORDER BY SNAME DESC; This would perform the same retrieval as the last example, but this time the results would be displayed in descending order. Enter the query, then click the “Run SQL” button view the results. We will practice and extend what we did in sub-modules 6.5 and 6.7 in Assignment 6. M 7.1: Introduction to Joins The basic use of the SELECT statement was presented in the last module. You might have realized at some point during the module that all questions in that module could be answered by using a SELECT statement that required only one table from the database. This meant that only one table needed to be specified in the FROM clause. However, there are some questions which cannot be answered with information from only one table. Information from two or more tables is required. This type of query is accomplished by joining the tables. Figure 1.2 with the attributes named as in Figure 5.6 STUDENT Student_id Name Class_rank Major 17352 John Smith SR SWENG 19407 Prashant Kumar JR CMPSC 22458 Rene Lopez SO SWENG 24356 27783 Rachel Adams FR CMPSC Julia Zhao FR SWENG FACULTY Faculty_id Name De 2469 Huei Chuang 5368 Christopher Johnson 6721 Sanjay Gupta S 7497 Martha Simons S DEPARTMENT Code Name Building CMPSC Computer Science Hopper Center MATH Mathematics Nittany Hall Software Engineering Hopper Center SWENG COURSE Number Name Credit_hours CMPSC121 Introduction to Programming 3 CMPSC122 Intermediate Programming 3 MATH140 Calculus with Analytic Geometry 4 O-O Software Design and Construction 3 SWENG311 SECTION Section_id Course_number Semester 044592 CMPSC121 Fall 046879 MATH140 Fall 059834 CMPSC122 Spring 061340 MATH140 Spring 063256 CMPSC121 Fall 063593 SWENG311 Fall TRANSCRIPT Student_id Section_id Grad 22458 044592 A 22458 046879 B 22458 059834 A 22458 063256 C 24356 061340 A 24356 063256 B Figure 1.2 An example database that stores information about a university An example of this, again using the example database from Figure 1.2, would be to list the IDs of all students majoring in software engineering. Since students have their major listed by code, not department name, we need to use the DEPARTMENT table to find the department code for the department named software engineering. That code (which turns out to be SWENG) is then used in the STUDENT table to find all students with the major of SWENG. So, in the query, we need information from both the STUDENT and DEPARTMENT tables. To answer the question asked, the query would be: SELECT FROM WHERE Student_id STUDENT, DEPARTMENT DEPARTMENT.Name = "Software Engineering" AND Code = Major; The result would look like: Student_id 17352 22458 27783 There are several things to note here. Since both the STUDENT and DEPARTMENT tables are needed, both are listed in the FROM clause. Since we need all students whose department name is "Software Engineering" we have the first condition in the WHERE clause. Since "Name" is the name of an attribute in both tables, if WHERE Name = "Software Engineering" and Code = Major; had been specified in the WHERE clause, the system would not know whether to use the Name attribute in the STUDENT table or the Name attribute in the DEPARTMENT table. In this database then, the attribute name "Name" is ambiguous. Using the same attribute name in different tables is allowed in SQL. However, doing so then requires that in a SELECT statement which includes both tables, the attribute name must be qualified with the name of the table we are referencing. The attribute name is qualified by prefixing the table name to the attribute name and separating the two by a period. Since we need to look up "Software Engineering" in the DEPARTMENT table (and not in the STUDENT table), we must use DEPARTMENT.Name in the WHERE clause. This will select the tuple from the DEPARTMENT table with the name "Software Engineering". This part of the WHERE clause is known as the selection condition. We then must match the Code value from the selected tuple in the DEPARTMENT table with the Major value from tuples in the STUDENT table. When we combine tuples from two tables it is known as a join. How the tuples are combined is specified in the above query by the second part of the WHERE clause, Code = Major. This is known as the join condition and will combine the two tuples where the student's major matches the department code associated with the name Software Engineering. Note that since Code and Major are unique attribute names across the two tables, the qualification by table name is not required. It can be added for clarity, but as shown here, it is not required. The attribute specified in the SELECT clause, in this case Student_id, then has its values displayed for all joined tuples which satisfy the conditions in the WHERE clause. As a reminder, we used the Boolean operator "AND" as part of the WHERE clause. Both AND and OR are valid keywords and perform their normal Boolean function. Another example would be to write a query to list the name of John Smith's advisor. Since the advisor's name is not available in the STUDENT table, to answer this question we need to look up the ID of John Smith's advisor in the STUDENT table. Once we get that ID, we need to go to the FACULTY table and find the name of the faculty member with the ID found in the STUDENT table. That would be accomplished with a join. The query would be: SELECT FROM WHERE FACULTY.Name STUDENT, FACULTY STUDENT.Name = "John Smith" AND Advisor = Faculty_id; In looking at the sample data in the tables, it can be seen that the ID of the advisor for John Smith is 1186. There is no faculty in the sample data with ID 1186. If we assume that in the full database, there is a faculty record where there is a faculty ID of 1186 and the name of the faculty with this ID is Mary Hart, the result of the above query would look like: FACULTY.Name Mary Hart In this query, the selection condition is STUDENT.Name = "John Smith". All tuples with the matching student name (in this case only one) will be selected. The join condition is Advisor = Faculty_id. This will combine the two tuples where the faculty id in the FACULTY table matches the advisor value from the STUDENT table associated with the name John Smith. From the result of the join, the name of the faculty is displayed. Note again that the attribute "Name" is ambiguous since it is used in both tables. It must be qualified by table name wherever it is used in the SELECT statement. In the first query, an ambiguous name was used only in the WHERE clause. In this query it is needed both in the WHERE clause and in the SELECT clause since both contain an ambiguous attribute name. Any number of selection and join conditions may be specified in the WHERE clause of an SQL query. We will look at this in the next sub-module. M 7.2: Joins - Continued To continue the discussion of joins, consider the following question. List the names of all courses taken by the student with ID 22458 during Fall 2017 semester. In looking at the database, course names are only available in the COURSE table. We need to look in the TRANSCRIPT table to see what courses were taken by the student with ID 22458. Unfortunately that table lists only the unique section ID of the courses taken. To find out which courses the sections belong to, we need to look up course number which corresponds to the section ID in SECTION table. So to answer the question, the query would be: SELECT FROM WHERE Name COURSE, TRANSCRIPT, SECTION TRANSCRIPT.Section_id = SECTION.Section _id AND Course_number = Number AND Student_id = 22458 AND Semester = "Fall" AND Year = 2017; The result would look like: Name Introduction to Programming Calculus with Analytic Geometry We can see that this query requires information from three database tables. When three tables are needed, two join conditions must be specified in the WHERE clause. The first join condition, TRANSCRIPT.Section_id = SECTION.Section _id, combines the tuples from the TRANSCRIPT and SECTION tables where the Section_id values in the tuples match. The second join condition, Course_number = Number, joins the SECTION and COURSE tables where the course numbers in the two tables match. Note that the second join condition does not require the attribute names to be qualified by the table names since the attribute names are unique in the two tables. If it seems to clarify the join condition, it can be specified with the qualified names as SECTION.Course_number = COURSE.Number. You can think of the joins as creating one large tuple containing all the information of the three separate tuples which have been joined together. From these joined tuples, the selection conditions will select all tuples containing student id 22458. From these, it will further select only those tuples from fall semester, 2017. This will be the final result set where the Name (in this case the course name since that is the only attribute in the three tables with that name) will be printed for each tuple in the result. Note that in the last sub-module, the selection condition was stated first in the WHERE clause and this was followed by the join condition. Here, the join conditions were listed first and these were followed by the selection conditions. Both work the same as far as the system is concerned. Many prefer to list the join conditions first, but the ordering of the conditions is based on user preference as to which seems clearer. Before performing the actual retrieval, the system will examine all conditions and determine in which order the conditions should be applied in order to produce the results most efficiently. The details of how this is done is a very interesting topic, but will not be covered in this basic course. If you are interested in query processing and query optimization, it is discussed in Chapters 18 and 19 in the text. Aliasing and Recursive Relationships In sub-module 4.4 we discussed recursive relationships. The example in that sub-module considered the FACULTY entity. We made the assumption that all department chairs are also faculty. We then considered a SUPERVISES relationship where the chair supervises the other faculty in the department. This relationship involved only the FACULTY entity, but it was on both sides of the relationship. We saw that only one entity instance represented the "chair side" of the relationship, but many entity instances represented the "other faculty in the department" side of the relationship. This makes the cardinality of the relationship 1:N, representing a 1:N unary relationship. To capture this relationship, we would add an attribute, call it Chair_id, to the FACULTY table. This would be considered a foreign key referencing the Faculty_id of the chair in the same table. This requires us to join the FACULTY table to itself. This can be thought of as two different copies of the table with one copy representing the role of chair and the other copy representing the role of "regular faculty." To demonstrate how this is used in a SELECT statement, consider a query to list the names of all faculty and the names of the chair supervising the faculty. So to answer the question, the query would be: SELECT FROM WHERE F.Name, C.Name FACULTY AS F, FACULTY AS C F.Chair_id = C.Faculty_id; Since this doesn't actually exist in the example database, an actual result cannot be shown. The join condition will cause all tuples on the "regular faculty" side to be joined with all corresponding tuples on the chair side. However, since there is only a join condition and no selection condition, all joined tuples will be included in the result. From the joined tuples, the name of the faculty and the name of his/her chair will be displayed. The result table would look like the following: F.Name C.Name first faculty name corresponding chair name etc. etc. Since the FACULTY relation is used twice, we need to be able to distinguish which copy of the relation we are referencing. Alternative relation names must be used for this. The alternative names are called aliases or tuple variables. The query above shows that an alias can be specified following the keyword AS. Note that it is also possible to rename the relation attributes within the query by giving them aliases. For example, we can write FACULTY AS F(Fid, Nm, De, Cid) in the FROM clause. This provides an alias for each of the attribute names: Fid for Faculty_id, etc. M 7.3: Types of Joins We have looked at joins in the last two sub-modules. What we have done so far is the basic join, more specifically the equijoin. There are additional types of joins available and they will be discussed below. These types come from relational algebra and because they are useful in many situations, they have been incorporated into SQL. We will not cover relational algebra in this course, but if you would like additional details, see sections 8.3.2 and 8.4.4 in the text. In the sections below, a description of each join will be followed by an example. The examples are based on the following database tables. Note that the tables have been modified slightly from the earlier examples. Also note that both the queries and the results shown might be somewhat different depending on the DBMS you are using. They are shown to illustrate the various points. STUDENT Student_id Name Class_rank Major Adviso 17352 John Smith SR SWENG 1186 19407 Prashant Kumar JR CMPSC 3572 22458 24356 27783 Rene Lopez SO SWENG 2842 Rachel Adams FR Julia Zhao FR 4235 SWENG 3983 DEPARTMENT Code Name Building CMPSC Computer Science Hopper Center MATH Mathematics Nittany Hall SWENG Software Engineering Hopper Center Assume there is a request to list student name and ID along with code of their major department and the name of the department. Based on what we have done so far, we would write a query like this: SELECT FROM WHERE Student_id, STUDENT.Name, Major, DEPARTMENT.Name STUDENT, DEPARTMENT Major = Code; This should produce the following result: Student_id STUDENT.Name Major DEPARTMENT.Name 17352 John Smith SWENG Software Engineering 19407 22458 27783 Prashant Kumar CMPSC Computer Science Rene Lopez SWENG Software Engineering Julia Zhao SWENG Software Engineering To look at the full joined table, we could execute: SELECT FROM WHERE * STUDENT, DEPARTMENT Major = Code; and we would see: Student_id STUDENT.Name Class_rank Major Advisor Code DEPARTMENT.Name Building 17352 John Smith SR SWENG 1186 SWENG Software Engineering Hopper Center 19407 Prashant Kumar JR CMPSC 3572 CMPSC Computer Science Hopper Center 22458 Rene Lopez SO SWENG 2842 SWENG Software Engineering Hopper Center 27783 Julia Zhao FR SWENG 3983 SWENG Software Engineering Hopper Center So far we have been specifying both the join condition(s) and the selection condition(s) in the WHERE clause. This is often convenient for simple queries. It is also allowable to specify the join condition(s) in the FROM clause. This format may make complex queries more understandable. An example of using the format would be: SELECT FROM * STUDENT JOIN DEPARTMENT ON Major = Code; This query would produce the same result shown above. Note that in this resultant joined table, both the Major and Code attributes are shown. Based on the join condition the values in the columns will be identical. To remove one of the duplicate columns, you can specify a natural join. In a natural join, the join is performed on identically named columns. The resultant table has the identically named column(s) repeated only once. The natural join is based on column names, so you do not specify the ON clause. With the above tables, there are no identically named columns. To perform a natural join, we need to use the AS construct to to rename one of the relations and its attributes so one of the attribute names matches an attribute name in the other table. We can perform a natural join with a query like the following: SELECT * FROM (STUDENT AS STUDENT (Student_id, Name, Class_rank, Code, Advisor) NATURAL JOIN DEPARTMENT); The resulting table would look like: Code Student_id STUDENT.Name Class_rank Advisor DEPARTMENT.Name Building SWENG 17352 John Smith SR 1186 Software Engineering Hopper Center CMPSC 19407 Prashant Kumar JR 3572 Computer Science Hopper Center SWENG 22458 Rene Lopez SO 2842 Software Engineering Hopper Center SWENG 27783 Julia Zhao FR 3983 Software Engineering Hopper Center The Major(aliased to Code)/Code column is now shown only once in the joined table. This is the feature of the natural join. Many systems will show the common name used in the join as the first column(s) in the result. The default type of join we have used, and the type used in the earlier examples is called an EQUIJOIN. Both the equijoin and natural join are forms of inner joins. An inner join is a join where a tuple is included in the result only if a matching tuple exists in the other relation. In the above example, a student such as Rachel Adams who does not have a declared major will not appear in the result. Similarly a department with no declared majors, such as Math, will also not appear in the result. Sometimes this is desirable, other times it is not. In situations where it is desirable to include tuples without a matching tuple in the other relation, a different type of join, called an OUTER JOIN must be performed. There are three different variations, and each is described below. Using the tables above, if we want to show all student tuples whether or not there is a matching tuple in the department table, we use a LEFT OUTER JOIN. The select statement would be: SELECT FROM * (STUDENT LEFT OUTER JOIN DEPARTMENT ON Major = Code); The resulting table would be: Student_id STUDENT.Name Class_rank Major Advisor Code DEPARTMENT.Name Building 17352 John Smith SR SWENG 1186 SWENG Software Engineering Hopper Center 19407 Prashant Kumar JR CMPSC 3572 CMPSC Computer Science Hopper Center 22458 Rene Lopez SO SWENG 2842 SWENG Software Engineering Hopper Center 25356 Rachel Adams FR <null> 2535 <null> <null> <null> 27783 Julia Zhao FR SWENG 3983 SWENG Software Engineering Hopper Center In this case Rachel Adams does not have a declared major. As such, there is a NULL value in her major field. Since NULL does not match any department tuple, with the left outer join Rachel's full information is shown and NULL values are shown for the department part of the tuple. I have shown NULL values the way they are shown in Java DB. They might be shown in a different manner in other DBMSs. Note that if Rachel had a value for Major that does not match a department code, the value for Major would be displayed in the joined tuple, but the other values above (those associated with the DEPARTMENT table) would still be NULL. Similarly, if we wished to display all department tuples whether or not there is a student tuple with that major, we would use a RIGHT OUTER JOIN. The select statement would be: SELECT FROM * (STUDENT RIGHT OUTER JOIN DEPARTMENT ON Major = Code); The resulting table would be (showing ordering from Java DB): Student_id STUDENT.Name Class_rank Major Advisor Code DEPARTMENT.Name Building 19407 Prashant Kumar JR CMPSC 3572 CMPSC Computer Science Hopper Center <null> <null> <null> <null> <null> MATH Mathematics Nittany Hall 17352 John Smith SR SWENG 1186 SWENG Software Engineering Hopper Center 22458 Rene Lopez SO SWENG 2842 SWENG Software Engineering Hopper Center 27783 Julia Zhao FR SWENG 3983 SWENG Software Engineering Hopper Center Again note that the Code for the Mathematics department is not NULL. However, there is not matching Major value in the student tuples. As such, the Mathematics department is represented, but the values for the student side of the tuple are shown as NULL. Unlike the last example, the value for Code is present here (not NULL) so the Code value is shown in the joined tuple. The final type of outer join is the FULL OUTER JOIN. The concept is that all tuples from both tables are included in the result. When there is a match, information from both matching tuples is included in the joined tuple. If there is not match for a tuple in the left table, the values for the right table appear as NULL in the joined tuple. If there is not a match for a tuple in the right table, the values for the left table appear as NULL in the joined tuple. The SQL statement would be: SELECT FROM * (STUDENT FULL OUTER JOIN DEPARTMENT ON Major = Code); The result would be a "combination" of the above two result tables. The full outer join is now in the SQL standard, but many DBMSs have still not implemented it. It has not yet been implemented in Java DB. A final type of "join" is the full Cartesian product of two tables. This pairs every tuple in the left table with every table in the right tuple. There is no matching of attribute values in this type of join. This may produce many, many joined tuples and is useful in very few cases. Since there is a complete pairing, the resulting set has m * n tuples where m is the number of tuples in the left table and n is the number of tuples in the right table. Since the Cartesian product is also known as a Cross product, it is known as a CROSS JOIN in SQL syntax. This is implemented in Java DB. The syntax is: SELECT FROM * (STUDENT CROSS JOIN DEPARTMENT); No join conditions are specified since all tuples are included. In our small example tables, we have five student tuples and three department tuples. This produces a result set with 5 * 3 or 15 tuples. Again, caution is called for when using the cross join since the result set can be very large. As mentioned at the beginning of this sub-module, if you are interested, additional details can be found in the relational algebra chapter in the text (Chapter 8), particularly Sections 8.3.2 and 8.4.4. M 7.4: Aggregate Functions Aggregate functions are provided to summarize (or aggregate) information from multiple tuples into a single-tuple. There are several aggregate functions which are built-in to SQL. These include COUNT, SUM, MAX, MIN, and, AVG. The COUNT function returns the number of tuples retrieved by a query. The other four functions are applied to a set of numeric values and return the sum, maximum, minimum, and average of the values in the set. These functions can be used in a SELECT clause, as will be demonstrated below. They can also be used in a HAVING clause as discussed in the next submodule. MAX and MIN can also be used with attributes which are not numeric as long as the domain of the attribute provides total ordering of possible values. This means that for any two values in the domain, it can be determined which of the two comes first in the defined order. An example of this would be alphabetic strings, which can be ordered alphabetically. Additional examples are the DATE and TIME domains, which we have not yet discussed. Although there are some numeric values in the example database we have been using, the attributes in that database do not provide good examples of the aggregate functions. For example, it makes no sense, in general, to ask for the maximum student id or for the sum of the student ids. As a better example, assume there is a small database table, called SCORES, which contains the scores of students for a particular test. The attributes will be only Student_id and Score, where score is a number between 0 and 100. It then does make sense to ask for the maximum score, the minimum score, and the average score on the test. The maximum score can be found by using the following query. SELECT MAX (Score) FROM SCORES; This will return a single-row containing the highest score on the test. Several aggregate functions can be included as part of the SELECT clause. For example SELECT MAX (Score), MIN(Score), AVG(Score) FROM SCORES; will return a single-row containing the highest score, the lowest score, and the average (mean) score on the test. The COUNT function can be used to display the number of tuples. Continuing with the above example, SELECT COUNT(*) FROM SCORES; will display the number of students (and thus the number of tuples) in the table. The use of (*) with the COUNT function is common. However, using the * will provide a count of all tuples, including those which contain NULL values in certain attributes. Often, this is the desired result, but not always. The use of an attribute name instead of * is discussed below. Up to this point, we have applied the aggregate functions to then entire table since the WHERE clause was not included. We can, however, include the WHERE clause to choose only some of the tuples to be used. To expand the above example, assume the SCORES table also included an attribute named Major which contains the code for the student's major. Further assume that the course is open to all students. It is primarily taken by CMPSC and SWENG students, but a few students from various other majors also take the course. If we want the test summary for only the SWENG students, the following query can be used. SELECT MAX (Score), MIN(Score), AVG(Score) FROM SCORES WHERE Major = 'SWENG'; To determine how many SWENG students took the test, we can use: SELECT COUNT(*) FROM SCORES WHERE Major = 'SWENG'; Here, all tuples are retrieved where the value in the Major attribute is SWENG. The number of tuples retrieved by this condition in the WHERE clause are then counted and the number of tuples is returned. The * in the above COUNT examples causes the number of tuples to be counted. It is also possible to specify an attribute rather than the *. We could have specified: SELECT COUNT(Score) FROM SCORES; This will often result in the same count as when the * is specified. The difference is that specifying an attribute name will count tuples with non-NULL values for the attribute rather than all tuples. If all tuples have a value in the specified attribute (column), the count will be the same. However, if any of the tuples have a NULL for the value in Score, those tuples will not be counted. In general, when an attribute is specified in an aggregate function, any NULL values are discarded before the function is applied to the values. If we desire to only count the number of unique scores on the exam, we can include DISTINCT as in: SELECT COUNT(DISTINCT Score) FROM SCORES; This will eliminate duplicate scores and provide a count of the number of unique scores on the test. M 7.5: GROUP BY and HAVING Clauses In the last sub-module, we applied aggregate functions to either the tuples in the entire table or to all tuples which satisfied the selection condition in the WHERE clause. It is sometimes desirable to apply an aggregate function to subgroups of tuples in a table. Consider the small SCORES table from the last sub-module where the attributes are Student_id, Score, and Major. Score is a number between 0 and 100. Suppose we want to find the average score on the test for each major. To do this, we need to partition the table into non-overlapping subsets of tuples. These subsets are referred to as groups in SQL. Each group will consist of tuples that have the same value for an attribute or attributes. This is called the grouping attribute(s). The aggregate function is then applied to each subgroup independently. This produces the summary information about each group. The GROUP BYclause is used in SQL to accomplish the grouping. The grouping attribute(s) should be listed as part of the SELECT clause. This provides a "label" for the value(s) shown in the aggregate function. This would be accomplished by the query: SELECT Major, AVG(Score) FROM SCORES GROUP BY Major; The results would look something like the following: Major AVG(Score) CMPSC 88 SWENG 90 MATH 86 PSYCH 85 The query will run if you do not follow the above suggestion and you leave the grouping attribute off of the SELECT clause. For example the query: SELECT AVG(Score) FROM SCORES GROUP BY Major; will run and the result will look something like: AVG(Score) 88 90 86 85 A result display such as this is usually somewhat meaningless. However, this can be the desired outcome in some instances. Note that the rows in the above queries are not ordered. ORDER BY can be used to specify a row order in this type of query also. Further note that when GROUP BY is used, non-aggregated attributes (such as Major above) listed in the SELECT clause must be included in the GROUP BY clause. For example, in the first query above, it would cause an error to start the SELECT clause "SELECT Major, Student_id, AVG(Score)". Because Major is specified in the GROUP BY clause, the same value of Major belongs to everyone in the group, so a single value applies to everyone in the group and this value can be listed in the result. However, many different Student_id values are contained within a given group, so a single value cannot be listed. Since the DBMS does not have an appropriate value to list, an error condition is generated. The number of students in each group can be included with the following: SELECT Major, COUNT(*), AVG(Score) FROM SCORES GROUP BY Major; This would provide a result similar to: Major COUNT(*) AVG(Score) CMPSC 14 88 SWENG 12 90 MATH 3 86 PSYCH 1 85 Conditions can also be put on groups. The selection conditions put in the WHERE clause are used to limit the tuples to which the functions are applied. For example, if we only wish to include tuples where the major is either SWENG or CMPSC, would use: SELECT Major, COUNT(*), AVG(Score) FROM SCORES WHERE Major = 'SWENG' or Major = 'CMPSC' GROUP BY Major; This would result in a table that looks like the following since only tuples with one of the two majors are included in the set to which the aggregate functions are applied. Major COUNT(*) AVG(Score) CMPSC 14 88 SWENG 12 90 The HAVING clause is used to provide for selection based on groups that satisfy certain conditions, rather than applying conditions to tuples prior to the grouping. For example, to list results for only those majors which have at least two students in the class, the following query would be used: SELECT Major, COUNT(*), AVG(Score) FROM SCORES GROUP BY Major HAVING COUNT(*) >= 2; That would produce the following: Major COUNT(*) AVG(Score) CMPSC 14 88 SWENG 12 90 MATH 3 86 M 7.6: Comparisons Involving NULL and Three-Valued Logic As discussed in sub-module 5.1.1, NULL is used to represent a missing value, but it can be used in three different cases. 1. The value is unknown. An example would be if a person's birth date is not known. The value would be represented as NULL in the database. 2. The value is unavailable or withheld. An example would be that a person has a home phone, but does not wish to provide the value for the database. Again, the home phone would be represented as a NULL in the database. 3. The value is not applicable. For example, if the database splits the name into first name, middle name, last name, and suffix. The suffix would be a value such as Jr. or III. In many cases this would not apply to a particular individual, so the suffix would be represented as a NULL in the database. Sometimes it can be determined which of the three cases is involved, but other times it cannot. If you consider a NULL in a home phone attribute, it could be NULL for any of the three reasons. Therefore, SQL does not try to distinguish among the various meanings of NULL. SQL considers each NULL value to be different from every other NULL value in the tuples. When a tuple with a NULL value for an attribute is involved in a comparison operation, the result is deemed to be UNKNOWN, meaning it could be TRUE or it could be FALSE. Because of this, the standard twovalued (Boolean) AND, OR, and NOT logic does not apply. It must be modified to use a threevalued logic. The three-valued logic is shown in the following tables which can be read as follows. In the first two tables, the operator is shown in red in the upper left hand corner of the table. The values in the first column, shown in blue, represent the first operand. The values in the first row, also shown in blue, represent the second operand. The values at the intersection of the row and column represent the result value. For example, the result of (TRUE AND UNKNOWN) is UNKNOWN. The value of (TRUE OR UNKNOWN) is TRUE. The third table represents the unary NOT operator. The first column represents the operand, while the second column shows the result in the corresponding row. AND TRUE FALSE UNKNOWN TRUE TRUE FALSE UNKNOWN FALSE FALSE FALSE FALSE UNKNOWN UNKNOWN FALSE UNKNOWN OR TRUE FALSE UNKNOWN TRUE TRUE TRUE TRUE FALSE TRUE FALSE UNKNOWN UNKNOWN TRUE UNKNOWN UNKNOWN NOT TRUE FALSE FALSE TRUE UNKNOWN UNKNOWN In general, when used in a WHERE clause, only those combination of tuple values that result in a value of TRUE are selected. When the result is either FALSE or UNKNOWN, the tuple is not selected. It is possible to check whether an attribute value is NULL. Instead of using = or <> to compare an attribute value to NULL, the comparison operators IS and IS NOT are used. Since all NULL values are considered distinct from all other NULL values, equality comparison will not work for NULL values. In our example database, if a student is allowed to not have a declared major, the value of NULL is stored in the Major column for that student. The following query would be used to list the students who currently do not have a declared major. SELECT FROM WHERE Student_id, Name STUDENT Major IS NULL; M 8.1: Substring Pattern Matching and Arithmetic Operators Before moving to discussing subqueries, which are the main topic of this module, this submodule covers two smaller topics not yet addressed: substring pattern matching and arithmetic operators. Substring Pattern Matching So far, when comparing with substrings, we have used the equality comparison operator. This requires an exact match to the value in the character field and this is often what we want. For example: SELECT * FROM STUDENT WHERE MAJOR = 'SWENG'; will retrieve all tuples where the value in the MAJOR attribute is exactly 'SWENG'. That is the desired action in this case. Similarly, the query SELECT RANK, MAJOR FROM STUDENT WHERE SNAME = 'Julia Zhao'; will retrieve all tuples where the value in the SNAME attribute is exactly 'Julia Zhao'. This is the desired action in this case as long as we know (remember) Julia's first and last name and the correct spelling of each. It also requires that the name be entered correctly in the database. For example, if the person entering Julia's name inadvertently put two spaces between her first and last name, the above query would not retrieve the tuple. For these and other reasons, SQL provides the LIKE comparison operator which can be used for matching patterns in strings. Partial strings can be specified using two reserved characters: % can be replaced by any number of characters (zero or more). The underscore is replaced by exactly one character. In the above example, if we remembered the name, but can't remember whether it is Julia or Julie, the query can be stated as SELECT SNAME FROM STUDENT WHERE SNAME LIKE 'Juli_ Zhao'; The _ will match any character. In this case it will pick up either Julie or Julia. Of course it will also pick up Julib and Juliz. We can accept this since it is unlikely that such a misspelling will be in the database. If we remember that her name is Julia, but can't remember anything about her last name except that it begins with Z, the query can be stated as: SELECT SNAME FROM STUDENT WHERE SNAME LIKE 'Julia Z%'; The % will match any string of zero or more characters. In this case it will find Julia Zhao. It will also find Julia Zee and Julia Zimmerman, as well as any other Julia's whose last name starts with Z. Since we assume that even for a moderately sized database the retrieved set of tuples will be somewhat small, we are willing to look through the retrieved tuples to find the Julia Z we want. Suppose we want to find a faculty member whose first name we don't know, and where all we know about his last name is that it starts with Ch. We can use the following query: SELECT FNAME FROM FACULTY WHERE FNAME LIKE '%Ch%'; This will find all faculty with either first name or last name beginning with 'Ch '. Note that since % can match zero characters, it will find first names beginning with 'Ch'. Since % can match many characters, the % can be matched against the first name and the space between names, which will then match last names beginning with 'Ch'. The pattern match is case sensitive, so it will not match a 'ch' in the middle of either the first or last name, such as 'Zachary White'. As indicated in prior modules, first and last names are usually stored as separate attributes. The above is just a sample of where the LIKE comparison operator can be useful. Arithmetic Operators in Queries SQL allows the use of arithmetic in queries. The arithmetic operations for addition (+), subtraction (-), multiplication (*), and division (/) can be used with numeric values or attributes with numeric domains. To see what would happen if all faculty were given a 5% raise, the following query could be used. SELECT FNAME, SALARY, 1.05 * SALARY AS RAISE FROM FACULTY; This would list the name, current salary, and proposed salary with a 5% raise for each tuple in the faculty table. To just see the total impact on the budget, the query can be stated as: SELECT SUM (SALARY), SUM (1.05 * SALARY) AS RAISE FROM FACULTY; This would show the current total of salaries as well as the total of the proposed salaries. Either of the two prior queries can be stated with a WHERE clause. For example, to see the impact of giving the 5% raise to only the faculty in the Software Engineering department, the query would be SELECT SUM (SALARY), SUM (1.05 * SALARY) AS RAISE FROM FACULTY; WHERE DEPT = 'SWENG'; An additional comparison operator, BETWEEN, is provided for convenience. An example is SELECT * FROM FACULTY; WHERE SALARY BETWEEN 50000 AND 70000; This would show faculty tuples for all faculty whose salary is between $50,000 and $70,000 (inclusive). The query can also be stated as SELECT * FROM FACULTY; WHERE (SALARY >= 50000) AND (SALARY <= 70000); Both queries return the same results, but many users find it easier to use the first query. M 8.2: Introduction to Subqueries In the last two modules we examined both basic SQL queries and SQL queries requiring the use of a join. This module introduces an additional construct known as nested queries or subqueries. In subqueries, complete select-from-where statements are nested as blocks within another SQL query. This query containing the nested query is known as the outer query. Consider the query: Find the name of the department of the major for the student with ID 17352. To answer this query we need to find the student tuple with id 17352 and get the major code from that tuple. We then need to look in the department table to find the tuple of the department with the matching code. From that tuple, we find the name of the department. One way to do this is a "manual" method. First run the following query: SELECT FROM WHERE Major STUDENT Student_id = '17352'; This will find the tuple for student with id 17352 (the John Smith tuple in the example database) and display the major (which is SWENG) from the tuple. We can write down this information using paper and pencil. We then run the following, inserting the "SWENG" value we wrote down from the first query: SELECT Name FROM DEPARTMENT WHERE Code = 'SWENG'; This will find the tuple for the SWENG department and list the name of the department contained in the tuple, which is Software Engineering. While this "manual" two-step process works nicely to demonstrate the point, it doesn't work well in practice, especially when we need to write down more than one or two short facts. As shown in the last module, a second way to handle this is to use a join. That query saves the "manual" step and retrieves the information with the following query: SELECT FROM DEPARTMENT.Name STUDENT, DEPARTMENT WHERE AND Major = Code Student_id = '17352'; Looking at this query, it may be viewed as finding the tuple from the STUDENT table with student id 17352, and then joining that tuple to the tuple from the DEPARTMENT table where the major/code matches. From the joined tuple, the department name is shown. Another way to look at this, however, is to join all the matching tuples first, and then select all joined tuples with student id 17352 (there will be only one such tuple - the joined John Smith tuple in the example database). From the selected tuples from the join, display the department name. In reality, the same result (Software Engineering) will be displayed by following either process. Which method is used will be determined by the steps of query optimization. Query optimization is a topic that will not be covered in this course. If you are interested in this topic, there is more information provided in Chapters 18 & 19 in the text. Another way this query can be addressed is by using a subquery. This is a solution provided by SQL to model the "manual" approach given above. To use this approach, we will create an inner or nested query to answer the first part of the question (what is the major code for student with ID 17352?). We will then use this answer to this inner query to form part of the outer query. In this example, the outer query will answer the second part of the question: now that we know the major (which is a department code) for the student with ID 17352, what is the department name that matches the code? The query would be stated as: SELECT Name FROM DEPARTMENT WHERE Code = (SELECT Major FROM STUDENT WHERE Student_id = '17352'); The inner query is executed first, then the outer query. This sub-module will close with an observation. Those who are new to writing SQL SELECT statements, but who have a programming background, will be tempted avoid using joins and instead use subqueries. Try to resist this tendency and use joins unless there is a good reason to use a subquery. Joins often seem a bit strange at first, but mastering them is important if you want to continue with database work. Another advantage to using joins is that multiple levels of nested subqueries can be difficult to write correctly and debugging them can prove difficult. Having said that, there are certain cases where a subquery must be used to obtain the correct results. This will be discussed in the next sub-module. M 8.3: When Subqueries are Required Although there are many cases where either a join or subquery can be used to answer a query, there are some cases where a join will not work and a subquery must be used. Consider, the example from the last module. We described a small database table, called SCORES, which contains the scores of students for a particular test. The attributes will be only Student_id and Score, where score is a number between 0 and 100. Suppose we want to know the id of the student (or students) who have the highest scores on the exam. At first glance, it seems a relatively simple query will work. SELECT FROM Student_id, MAX(Score) SCORES; Or possibly SELECT FROM WHERE Student_id SCORES Score = MAX(Score); Unfortunately, neither of these queries will work. Both return an error. The issue is that the above queries actually have two very different parts. First, the system must find the maximum score on the exam. Then the system must find the student or students who achieved this score. It is asking the system to perform two separate operations and then apply one to the other in correct order. This is asking the system to do something it is not able to do. In this case, we must work more like a programmer and specify the two steps in correct order. We must specify that the system should find the maximum score on the exam first, and then use that information to determine which students have that score. The first question is presented in the subquery and the answer from the subquery (the max score) is used in the main query to determine which student(s) earned that score. The query can be answered by using: SELECT Student_id FROM SCORES WHERE Score = (SELECT MAX(Score) FROM SCORES); Here the highest score on the exam is found in the subquery. The outer query then asks to retrieve the tuple(s) of the students whose score value is equal to the score found in the subquery. Consider a second example where we use the expanded SCORES table from the last module which also included an attribute named Major containing the code for the student's major. We now want to find the Student IDs of Software Engineering major(s) had the highest score on the exam (among the Software Engineering majors). The query would be posed this way: SELECT Student_id FROM SCORES WHERE Major = 'SWENG' AND Score = (SELECT MAX(Score) FROM SCORES WHERE Major = 'SWENG'); This subquery adds the additional restriction to the query that we want to only look at students in the SWENG major. One question the above SELECT statement raises is why the condition Major = 'SWENG' is listed in both the outer query and the subquery. In the subquery it is needed so we only select the tuples where the student is a SWENG major so the maximum score is determined for only those tuples. If a CMPSC major has a higher score, that is not of interest for this query. However, since only tuples with major of SWENG are used to calculate the maximum score, why can't the query simply be stated as: SELECT Student_id FROM SCORES WHERE Score = (SELECT MAX(Score) FROM SCORES WHERE Major = 'SWENG'); This seems like it will also answer the question that was asked, however the query is not correct. The subquery only returns a single number: the highest score on the exam for a SWENG student. If that score is, say 95, the outer query is working with the equivalent condition of Score = 95. In the second query, all tuples (which includes all majors) will be searched and those tuples with a score of 95 will be returned, regardless of major. This will list all SWENG students with a score of 95 (there will be at least one), but it will also list students from other majors who scored 95 on the exam. There may be no such student, but there also may be one or several. These student IDs will also be listed in the result. Note that it is also possible that one or more students from other majors actually scored higher than 95. Since an equality test was stated (Score =), these students will not be listed. That is why WHERE Major = 'SWENG' must also be listed in the outer query. Once the maximum score is returned (again, use 95 as an example), we only want the tuples of SWENG majors returned by the outer query. M 8.4: Operators IN, NOT IN, ALL, SOME, ANY The IN and NOT IN Operators The examples in the last sub-module used the = comparison operator for the results of the subquery. This works in cases where the results from the subquery are a single value. If you look back at the last sub-module, you will see that this was the case in all the queries shown there. However, in the general case, the subquery will return a table, which is a set of tuples. When a table is returned from the subquery, the comparison operation INis used. Note that the IN operator is more general than =, and it can be used as well as = when the subquery returns a single value. For example, the query from the last sub-module: SELECT Student_id FROM SCORES WHERE Score = (SELECT MAX(Score) FROM SCORES); can also be stated as: SELECT Student_id FROM SCORES WHERE Score IN (SELECT MAX(Score) FROM SCORES); This query will return the same results as the first query. However, when returning a single value, the first query is preferred. When a table of values is (or can be) returned, the second query is required. For example, the query SELECT Student_id FROM SCORES WHERE Score = (SELECT Score FROM SCORES WHERE Score > 80 ); will return an error. Since the inner query returns a list of values, the WHERE clause in the outer query cannot compare "SCORE = " against a set of values. In this case, IN must be used in the query. ELECT Student_id FROM SCORES WHERE Score IN (SELECT Score FROM SCORES WHERE Score > 80 ); This will return the desired result. The opposite of the IN operator is the NOT IN operator. This selects all tuples which do not match the criteria. For example, SELECT Student_id FROM SCORES WHERE Score NOT IN (SELECT MAX(Score) FROM SCORES); will list the ids of all students who did not receive the highest score on the test. Some DBMSs support using tuples with the IN clause. Assuming a larger set of scores in the above database, to find out if any other students have the same major and score as the student with id 12345, you can write the following query: SELECT Student_id FROM SCORES WHERE (Major, Score) IN (SELECT Major, Score FROM SCORES WHERE Student_id = '12345'); The subquery returns the major and score for the student with id 12345. The subquery will find only one tuple which matches the given student id, but it returns a tuple (Major, Score), not just a single value. The outer query then matches all tuples in the SCORES table against the Major, Score pair returned from the subquery. It will then display the student id of all students with the same score and the same major as student 12345. Not all DBMSs accept this tuple syntax in the IN clause. Java DB (Derby) does not support the tuple syntax. Subquery Comparison Operators ALL, SOME, ANY The ALL keyword, will be compared to all tuples in a given set. As an example, consider the query: SELECT Student_id FROM SCORES WHERE Score < ALL (SELECT Score FROM SCORES WHERE Major = 'SWENG'); The subquery first selects all tuples where the major is SWENG. The outer query then examines the entire table returned by the subquery and selects the tuples with scores less than all of the SWENG majors (i.e. each and every SWENG major). In other words, the outer query finds the tuples which have a score lower than the lowest score among all SWENG majors and then displays the ids of such students. In addition to the < comparison operator, the =, >, >=, <=, and <> comparison operators can be used with ALL. In addition to ALL, the SOME and ANY keywords can be used. These two keywords are synonyms and have the same effect. For example SELECT Student_id FROM SCORES WHERE Score < ANY (SELECT Score FROM SCORES WHERE Major = 'SWENG'); The subquery will again first select and return all tuples where the major is SWENG. The outer query then examines the entire table and selects the tuples with scores less than at least one of the score values in the returned set of SWENG major scores. In other words, the outer query finds the tuples which have a score lower than the highest score among the SWENG majors and then displays the ids of such students. As with ALL, in addition to the < comparison operator, the =, >, >=, <=, and <> comparison operators can also be used. M 8.5: Correlated Nested Queries When a condition in the WHERE clause of a nested query references an attribute of a relation declared in the outer query, the two queries are said to be correlated. For example, consider database you created in Assignment 7 which contained a FACULTY table consisting several attributes including the faculty id (FAC_ID), the name of the faculty (FNAME), and the faculty id of the chair of the faculty member's department (CHAIR). Suppose we want to write a query to list the id and name of each faculty who reports to a chair. This can be done using a join with the query: SELECT FROM WHERE F.FAC_ID, F.FNAME FACULTY AS F, FACULTY AS C WHERE F.CHAIR = C.FAC_ID; Conceptually, this can be viewed as having two copies of the FACULTY table. One copy, aliased to F, can be thought of as the normal view of the table. The second copy, aliased to C, can be thought of as an additional view of the table. The join builds a tuple by appending a tuple from the C table to a tuple from the F table where the value of the FAC_ID attribute in the C table matches the value of the CHAIR attribute in the F tuple. In a correlated nested query, we would use the following. SELECT F.FAC_ID, F.FNAME FROM FACULTY AS F WHERE F.FAC_ID IN (SELECT F.FAC_ID FROM FACULTY AS C WHERE F.CHAIR = C.FAC_ID); Here, the subquery selects all tuples where the value in the chair field of a tuple in the "faculty copy" of the FACULTY table (the F table) matches a value in the faculty id field of a tuple in the "chair copy" of the FACULTY table (the C table). The tuples selected will be, in effect, all those faculty who have a value in their chair field. The subquery then produces a result containing the faculty id (of the "faculty copy") for all retrieved tuples. As a result of the WHERE clause, these tuples are used by the outer query and the faculty id and name values in the "faculty copy" of these tuples are displayed. A note about the scope of names is useful at this point. In the first query, there are two copies of the FACULTY table in the query, one given the alias name of F, the other the alias name of C. Since aliases were not provided for the attribute names in either table, the attribute names of both copies of the table are in scope for the entire SELECT statement. Using any of the attribute names without qualification will produce an error since the name is ambiguous. In the second query, all of the attribute names were again qualified. This does not produce an error, and helps clarify which copy of the table we want to use in the various parts of the query. However, with nested queries, the scope of an unqualified name is affiliated with the table or tables specified in that part of the query. In the query above, any unqualified name in the subquery will reference the C copy of the table. Any unqualified name in the outer query will reference the F copy of the table. Based on this, any F qualifier can be dropped in the outer query and any C qualifier can be dropped in the subquery. This means that the above query can be given as SELECT FAC_ID, FNAME FROM FACULTY AS F WHERE FAC_ID IN (SELECT F.FAC_ID FROM FACULTY AS C WHERE F.CHAIR = FAC_ID); While this version of the query will produce correct results identical to the results produced by the query above, it is harder to read quickly. It seems easier to understand the intent of the query where all attribute names are fully qualified. M 8.6: The EXISTS Function EXISTS is a Boolean functions which return TRUE or FALSE. The EXISTS function in SQL tests to see whether the result of the subquery is empty (returned no tuples) or not. EXISTS will return TRUE if the result contains at least one tuple. If no tuples are in the result, EXISTS returns FALSE. Consider again the database you created in Assignment 7. We can restate the query from submodule 8.5 using the EXISTS function. The query wishes to list the id and name of each faculty who reports to a chair. A query which uses EXISTS to accomplish this is: SELECT FAC_ID, FNAME FROM FACULTY AS F WHERE EXISTS (SELECT * FROM FACULTY AS C WHERE F.CHAIR = C.FAC_ID); As in the last sub-module, the subquery selects all tuples where the value in the chair field of a tuple in the "faculty copy" of the FACULTY table matches a value in the faculty id field of a tuple in the "chair copy" of the FACULTY table. The tuples selected will be, in effect, all those faculty who have a value in their chair field. The subquery then produces a result containing all information (of the "faculty copy") for all retrieved tuples. In this query, it is irrelevant what result attributes are listed since the EXISTS function only cares whether or not a tuple is retrieved by the subquery. When such a tuple exists, the id and name of the faculty is displayed. This can also be thought of as taking each faculty tuple in the "faculty copy" one at a time and evaluating the subquery using the chair field from the tuple and matching it with the faculty id field from the "chair copy" of the FACULTY table. If there is a match, EXISTS returns TRUE and the faculty id and name from the "faculty tuple" are displayed. If there is no match, EXISTS returns FALSE (this faculty does not report to a chair) and nothing is displayed for this tuple. The opposite can be accomplished by using the NOT EXISTS function. This function will return TRUE if no tuples are in the result. If the result contains at least one tuple NOT EXISTS returns FALSE. For example to list the id and name of each faculty who does not report to a chair, the following query can be used. SELECT FAC_ID, FNAME FROM FACULTY AS F WHERE NOT EXISTS (SELECT * FROM FACULTY AS C WHERE F.CHAIR = C.FAC_ID); Here, again, the subquery selects all tuples where the value in the chair field of a tuple in the "faculty copy" of the FACULTY table matches a value in the faculty id field of a tuple in the "chair copy" of the FACULTY table. The tuples selected will be, in effect, all those faculty who have a value in their chair field. When no such tuple exists, the id and name of the faculty is displayed. This can also be thought of as taking each faculty tuple in the "faculty copy" one at a time and evaluating the subquery using the chair field from the tuple and matching it with the faculty id field from the "chair copy" of the FACULTY table. If there is no match, NOT EXISTS returns TRUE and the faculty id and name are displayed. If there is a match, NOT EXISTS returns FALSE (this faculty does report to a chair) and nothing is displayed for this tuple. M 8.7: Additional Functions and Features There are additional functions and features related to subqueries which we will not cover in this course. Some of the following are not available on all DBMSs. These topics are listed below. There is a UNIQUE function. It is discussed very briefly in the text at the end of Section 7.1.4 Subquery in group by (7.1.8) There is a WITH construct briefly described in the first part of Section 7.1.9. This is similar to creating a view which will be discussed in the next module. There is a CASE construct briefly described in the second part of Section 7.1.9. This can be used when a value can be different based on certain conditions. It is somewhat similar to the SWITCH statement in C++. Recursive queries are briefly discussed in Section 7.1.10. An example of how such queries can be useful would be in an employee table where each employee's direct supervisor is kept as an attribute in the employee tuple. The assumption is that, unlike the faculty table in the example we have been using where only the chair is stored in the table (one level of supervision), several layers of supervisor are stored (multiple levels - supervisors have supervisors, etc.). Certain types of query can use this feature. It isn't shown in the text, but it is possible to use a subquery in the FROM clause. This is useful in some cases. Subqueries can also be used in INSERT, DELETE, and UPDATE statements. We will look at this in the next module. M 9.1: Specifying Constraints as Assertions In this sub-module we will discuss the CREATE ASSERTION statement in SQL. Back in Module 5 we discussed the inherent constraints that are built into the relational model. These included primary and unique keys, entity integrity, and referential integrity. These constraints can be specified in the CREATE TABLE statement in SQL. This was discussed in Module 6. In Module 5 we also looked at schema-based or explicit constraints. These are expressed in the schemas of the data model, usually by specifying them in the DDL. They include: domain constraints - specified as a data type of an attribute key constraints - specified as PRIMARY KEY or UNIQUE in a CREATE TABLE statement constraints on null values - specified as NOT NULL in an attribute definition entity integrity - automatically enforces that NULL values are not allowed in attributes which make up the primary key referential integrity - set up using FOREIGN KEY ... REFERENCES Also mentioned in Module 5 was an additional type of constraint that must be expressed and enforced in a different way, often by using application programs. These are called semantic constraints or business rules. Constraints of this type are often difficult to express and enforce within the data model. They relate to the meaning and behavior of the attributes. These constraints are often enforced by application programs that update the database. In some cases, this type of constraint can be handled by assertions in SQL. Note that although CREATE ASSERTION is specified in the SQL standard, it has not been implemented in Java DB (Derby). It is presented here since at some point you might use one of the DBMSs which has implemented the feature. The syntax of the CREATE ASSERTION statement is: CREATE ASSERTION <Constraint name> CHECK (search condition); The keywords CREATE ASSERTION are followed by a user defined constraint name, which is used to identify the assertion. This is followed by the keyword CHECK, which is then followed by a condition in parentheses. For the assertion to be satisfied, the condition must be true for every database state. It is possible to disable a constraint, modify a constraint, or drop a constraint. The constraint name is used when performing any of these actions. It is the job of the DBMS to make sure that the constraint is not violated. The condition can be any condition that is valid in a WHERE clause. However, in many cases a constraint can be specified using the EXISTS and NOT EXISTS keywords. If one or more tuples in the database cause the condition of the assert statement to be evaluated as FALSE, the constraint is said to be violated. The constraint is said to be satisfied by a database state if no combination of tuples in the database state violates the constraint. One general technique for writing assertions is to write a query that selects any tuple which violates the desired condition. If this type of query is included inside a NOT EXISTS clause, the assertion specifies that the result of the query must be empty for the condition to be TRUE. This implies that the assertion is violated if the result of the query is not empty. As an example, consider the example database first introduced in Module 1 which showed only a few sample tuples for each table. When the database is fully populated, it will contain several thousand student tuples and several hundred faculty tuples. We want to make sure that no faculty has an excessive number of student advisees. This ensures that every faculty has sufficient time to work with all advisees who are assigned to him/her. We would like most faculty to have fewer than 20 advisees, but we want no faculty member to have more than 30 advisees. This can be checked with the following assertion. CREATE ASSERTION PREVENT_ADVISEE_OVERLOAD In this example, the name given to the assertion is PREVENT_ADVISEE_OVERLOAD. This is followed by the keyword CHECK and the condition which must hold true for every database state. In this condition, the SELECT statement will group all STUDENT tuples by ADVISOR. Any advisor group having more than 30 tuples will be returned in the select. If any such tuple is returned (meaning that the advisor has more than 30 advisees), NOT EXISTS will then be FALSE (since at least one such tuple does exist). This causes the constraint to be violated. Remember that in Module 6.3, CHECK was used to further restrict domain values. The example used was: If there is university policy that no course may be offered for more that six credit hours, this can be specified as: Credit_hours INT NOT NULL CHECK (Credit_hours > 0 AND Credit_hours < 7); This restricted the INT data type to only the values 1 - 6. Module 6.3 also showed using CHECK to check values across a row or tuple. The example used was a PRODUCT tuple where we did not want the sale price for a product set above the regular price. We can make sure this does not happen by adding the following CHECK clause at the end of the CREATE TABLE statement for the PRODUCT table. CREATE TABLE PRODUCT( rest of specification CHECK (Sale_price <= Regular_price) ); If the CHECK condition is violated, the insertion or modification of the offending tuple would not be allowed to be stored in the database. The two uses of CHECK from Module 6.3 are applied to individual attributes and domains and to individual tuples. These are checked in SQL only when tuples are inserted or modified in a specific table. This does allow the DBMS to implement the checking more efficiently in these cases. This type of CHECK should be used where possible, but only when the designer is certain that the constraint can only be violated by the insertion or modification of tuples. When this is not the case, CREATE ASSERTION will need to be used. However, CREATE ASSERTION should only be used when the more efficient simple checks cannot adequately address the desired constraint. M 9.2: Triggers It is often desirable to specify an action which will be taken by the DBMS when certain events occur or when certain conditions are satisfied. These actions are specified in SQL using the CREATE TRIGGER statement. Only the basics of triggers are presented in this module and in Chapter 7 in the text. If you are interested, additional information about triggers is presented in Chapter 26, Section 1 in the text as part of the discussion of active database concepts. The basic concept of triggers has existed since the early versions of relational databases. Triggers were included in the SQL standard beginning with SQL-99. Today, many commercial DBMSs have versions of triggers available, but many differ somewhat from the standard. More specifically, CREATE TRIGGER has been implemented in Java DB (Derby). However, the implementation is not as broad as that presented in the text. The broader version discussed in the text is presented here since at some point you might use one of the DBMSs which has implemented a more robust version of the feature. More specifics of the Java DB implementation are described at the bottom of this page. Triggers can be used to monitor the database. A trigger can be created to look for a particular condition and take an appropriate action when the condition is met. The action is most often a sequence of SQL statements. However, it can also be a database transaction or an external program which is executed. As an example of how this can be used, consider earlier examples where we included both salary and the faculty id of the chair in the FACULTY table. Suppose we want to check when a faculty member's salary is more than the salary of the chair. There are several events which can lead to this condition being satisfied. When a new faculty member is added, the salaries will need to be checked. They will also need to be checked when the salary of a faculty member is changed. Finally, salaries will need to be checked when a new chair is appointed to lead the department. When this salary condition is met, the desired action is that a notification should be sent. One possibility would be to inform the chair. However, depending on the university, the chair may not have knowledge of the faculty salary information. In our example, we will assume the notification is sent to someone with salary responsibility. For the example, we will assume that someone in HR should receive the notification, and this person will follow-up with the appropriate action. This could be specified as follows (note that the specific syntax will vary from DBMS to DBMS). CREATE TRIGGER SALARY_CONCERN AFTER INSERT OR UPDATE OF SALARY, CHAIR ON FACULTY FOR EACH ROW WHEN (NEW.SALARY > (SELECT SALARY FROM FACULTY WHERE FAC_ID = NEW.CHAIR)) INFORM_HR (NEW.CHAIR, NEW.FAC_ID); The trigger is given a name: SALARY_CONCERN. This name can be used to remove or deactivate the trigger at a later time. AFTER indicates that the DBMS will first allow the change of the tuple to be made in the database then the condition of the trigger is checked. There is the option BEFORE which will check the condition before the change is made. When using BEFORE, it is sometime the case that the trigger will prevent the update. The next part of the condition is that the trigger will be examined when a faculty tuple is inserted or when either the salary or chair of a faculty tuple is updated. FOR EACH ROW indicates that the trigger is checked for each tuple that is modified. The NEW qualifier indicates that the value being checked is the value in the tuple after the change has been made. So NEW.SALARY will represent the salary value after the tuple has been inserted or updated. The subquery will select the salary of the chair of the faculty member. This will cause the updated salary of the faculty member to be compared to the salary of his/her chair. If the salary of the faculty is greater, the INFORM_HR stored procedure will be executed. At this point, think of this stored procedure as a program which is executed to send an email to HR indicating the salary concern. Consider (NEW.CHAIR, NEW.FAC_ID) to be parameters passed to the stored procedure. These parameters will be values included in the email. Typically a trigger is regarded as an ECA (Event, Condition, Action) rule. It has three components: 1. The event - usually a database update operation applied to the database. In example above, the events are inserting a new faculty tuple, changing the salary of a faculty member, or changing the chair of a faculty member. The person writing the trigger must make sure all possible related events are covered. It may be necessary to write more than one trigger to cover all possible events. The events are specified after the keyword AFTER in the above example. 2. The condition - determines whether the rule action should be executed. Once the triggering event has occurred, an optional condition may be evaluated. If no condition is specified, the action will be executed once the event occurs. If a condition is specified, it is first evaluated and only if it got evaluated to TRUE will the rule action be executed. The condition is specified in the WHEN clause of the trigger. 3. The action to be taken - usually a sequence of SQL statements, but can be a database transaction or external program that will be automatically executed. In this example, the action is to execute the INFORM_HR stored procedure. Triggers can be used in various applications such as maintaining database consistency, monitoring database updates, and updating derived data automatically. Specifically for Derby, there are a few differences from the presentation in the text. The example above shows the event as "AFTER INSERT OR UPDATE ..." Both INSERT and UPDATE are valid (as is DELETE), but using the "OR" is not allowed in Derby. The same result can be achieved in Derby, but two CREATE TRIGGER statements would be needed. One would include the "INSERT" and the other would include the "UPDATE OF ..." The Derby Developer's Guide does not provide much detail about trigger actions. It states: "A trigger action is a simple SQL statement." The exact restrictions as to what the "simple statement" can and cannot be are not listed. M 9.3: Views (Virtual Tables) in SQL: Specification The Concept of a View in SQL SQL uses the term view to indicate a single table that is derived from other tables. Note that this use of the SQL term view is different from the use of the term user view in the early modules. Here the term view only includes one relation, but a user view might include many relations. The other tables a view is derived from can be base tables or previously defined views. A view does not necessarily exist in physical form: it is considered a virtual table. This is unlike a base table whose tuples are always physically stored in the database. Since a view is often not stored physically in the database, possible update operations that can be applied to views are somewhat limited. There are, however, no limitations placed on querying a view. A view can be thought of as a way of specifying a table that needs to be queried frequently, but it may not exist physically. For example, consider the question presented in Module 7.2 for the example database we have been using. List the names of all courses taken by the student with ID 22458 during Fall 2017 semester. In looking at the database, we saw that the query would require joining three tables. Course names are only available in the COURSE table, courses taken are only available in the TRANSCRIPT table by looking at the section taken, and to find out which courses the sections belong to we need to look in the SECTION table. The query to answer the question was: SELECT FROM WHERE Name COURSE, TRANSCRIPT, SECTION TRANSCRIPT.Section_id = SECTION.Section _id AND Course_number = Number AND Student_id = 22458 AND Semester = "Fall" AND Year = 2017; If we ask similar questions often, a view can be defined which specifies the result of these joins. Queries can then be specified against the view as single table retrievals rather than queries which require two joins on three tables. In this case, the COURSE, TRANSCRIPT, and SECTION tables are called the defining tables of the view. To define this view in SQL, the CREATE VIEW command is used. To use this command, we provide a table name for the view (also known as the view name), a list of attribute names, and a query to specify the contents of the view. In most cases, new attribute names for the view do not need to be specified since the default is to use the attribute names from the defining tables. In any of the view attributes are the result of applying functions or arithmetic operations, then new attribute names will need to be specified for the view. Specifying Views in SQL As an example, the view discussed above can be created by using: CREATE VIEW AS SELECT FROM WHERE AND TRANS_EXTEND Name, Student_id, SECTION.Section_ID, Grade, Semester, Year COURSE, TRANSCRIPT, SECTION TRANSCRIPT.Section_id = SECTION.Section _id Course_number = Number; This will create a view named TRANS_EXTEND. Note that Section_ID is used only once as an attribute name in the view. Although Section_ID needs to be qualified in the CREATE VIEW command, it will appear in the view as an attribute with its unqualified name: Section_ID. When used in a query against the view, it will not need to be qualified. To run the above query against the view, the SELECT statement is: SELECT FROM WHERE Name TRANS_EXTEND Student_id = 22458 AND Semester = "Fall" AND Year = 2017; You can see how the view simplifies the above query as compared to the original query which required the specification of the joins. This is one of the main advantages of views: the simplification of the specification of some queries. This is especially advantageous for queries that will be written frequently. Views are also useful for certain types of security. This will be discussed later in the module. A view should always be up-to-date. If tuples are modified in one or more of the base tables used to define the view, the view should automatically reflect the changes. However the view does not need to be materialized or "populated" when it is defined, but it must be materialized when a query is written using the view. It is the DBMS, not the user, which must maintain the view as up-to-date. How this can be accomplished by the DBMS is discussed in the next submodule. When a view is no longer needed, it can be removed using the DROP VIEW command. For example, to remove the TRANS_EXTEND view, you can use the command: DROP VIEW TRANS_EXTEND; M 9.4: Views: Implementation and Update How can a DBMS effectively implement a view for efficient querying? There is not a simple answer to this question. Two main approaches have been proposed. One approach modifies or transforms a query which specifies a view into a query on the underlying base tables. This approach is called query modification. For example the query from the last sub-module: SELECT Name FROM TRANS_EXTEND WHERE Student_id = 22458 AND Semester = "Fall" AND Year = 2017; would automatically be converted by the DBMS into: SELECT FROM WHERE Name COURSE, TRANSCRIPT, SECTION TRANSCRIPT.Section_id = SECTION.Section _id AND Course_number = Number AND Student_id = 22458 AND Semester = "Fall" AND Year = 2017; The issue with this approach is that it is slow for views which are defined using complex queries that take a relatively long time to execute. It saves time for the user when writing the query, but has no execution advantage compared to the user writing out the "long" query. This disadvantage is especially pronounced when many queries are run against the view in a relatively short span of time. The second approach is called view materialization. Using this approach, the DBMS creates a physical copy of the view table when the view is first queried or created. The physical table can be temporary or permanent and is kept with the assumption that there will soon be other queries on the view. When this approach is used, an efficient method is needed to automatically update the view table when the base tables are updated so that the view table remains current. This is often done using an incremental update strategy. With this strategy, the DBMS determines what tuples must be inserted, updated, or deleted in a materialized view table when one of the defining base tables is updated. Using view materialization, the view is generally kept as a physical copy as long as it is being queried. When the view has not been queried for a set amount of time, the system can remove the physical view table. It will be recreated when future queries use the view. Various strategies for view materialization can be used. A strategy which updates the view as soon as any of the base tables is changed is known as immediate update. Another strategy is to update the view only when needed by a query. This is called lazy update. A strategy called periodic update will update the view periodically, but has the drawback that it allows a query to be run against the view when it is not completely current. A query can always be run using a view. However, using a view in an INSERT, DELETE, or UPDATE command is often not possible. Under some conditions, it is possible to update using a view if it is based on only a single base table and is not constructed using any of the aggregate functions. If an update is attempted on a view which involves joins, it is often the case that the update can be mapped to the underlying base relations in two or more ways. When this is the case, the DBMS cannot determine which of the mappings is the intended mapping. All mappings will provide the desired update to the view, but some of the mappings will create side effects that will cause problems when querying one of the base tables directly. In general, an update through a view is possible only when there is just one possible mapping to update the base relations which will accomplish the desired update on the view. Any time this single mapping criterion does not hold, the update is not permitted. As an example, consider the small bookstore database we used in assignments 7 & 8. Bookstore Schema from Assignment 7.docx Assignment 7.docx Download Bookstore Schema from If it is often desired to list the names of books bought by a customer, given the customer name, we can create the following view. CREATE VIEW AS SELECT FROM SALE_BY_NAME CUSTNAME, BOOKNAME, DATE, PRICE BOOK, CUSTOMER, SALES WHERE AND BOOK.BOOKNUM = SALES.BOOKNUM CUSTOMER.CUSTNUM = SALES.CUSTNUM; This view can be used to list the books purchased by Roonil Wazlib as follows: SELECT FROM WHERE * SALE_BY_NAME CUSTNAME = 'Roonil Wazlib'; Suppose we examine the list and realize that it shows that Roonil bought the book Half-Blood Prince. We realize that Roonil did not buy that book, but rather bought the book Deathly Hallows. We attempt to correct the error as follows: UPDATE SET WHERE AND SALE_BY_NAME BOOKNAME = 'Deathly Hallows' CUSTNAME = 'Roonil Wazlib' BOOKNAME = 'Half-Blood Prince'; This change can be made in the base tables in at least two different ways. The first is equivalent to: UPDATE SET SALES SALES.BOOKNUM = WHERE SALES.CUSTNUM = AND SALES.BOOKNUM = (SELECT BOOK.BOOKNUM FROM BOOK WHERE BOOKNAME = 'Deathly Hallows') (SELECT CUSTOMER.CUSTNUM FROM CUSTOMER WHERE CUSTNAME = 'Roonil Wazlib') (SELECT BOOK.BOOKNUM FROM BOOK WHERE BOOKNAME = 'Half-Blood Prince'); The second mapping for the change is equivalent to: UPDATE SET WHERE BOOK BOOKNAME = 'Deathly Hallows' BOOKNAME = 'Half-Blood Prince'); Both mappings accomplish the goal when working through the SALE_BY_NAME view. When listing the books through the view, as in the above query, the listing will now correctly show that Roonil bought Deathly Hallows. The first mapping accomplishes this by changing BOOKNUM in the tuple in the SALES base table to contain the BOOKNUM for Deathly Hallows rather than the BOOKNUM for Half-Blood Prince. The second mapping finds the tuple in the BOOK base table which has BOOKNAME Half-Blood Prince and changes the BOOKNAME in that tuple to Deathly Hallows. The first mapping is probably what was intended. The sales record is updated to reflect the book number for Deathly Hallows rather than the book number for Half-Blood Prince. Making the update in this manner should have no unintended side-effects. The second mapping does not change the sales record, but changes the BOOK table to reflect that the book number which had been assigned to Half-Blood Prince is now assigned to Deathly Hallows. While solving the problem for the tuple when retrieved through the view, there are two side effects. First, there are now two book numbers which show the book name Deathly Hallows and no book number for Half-Blood Prince. Second, when looking through the view, all customers who actually purchased Half-Blood Prince will be seen as having purchased Deathly Hallows. Since either mapping can be argued to be "correct," DBMSs will not allow this type of update through a view. Research is being conducted to determine which of the possible update mappings is the most likely one. Some feel that DBMSs should use this strategy and allow updates through a view. In the above case, the choice of the first mapping can be considered the preferred mapping. Other researchers are looking into allowing the user to choose the preferred mapping during view definition. Most commercial DBMSs do not currently support either of these options. M 9.5: Views as Authorization Mechanisms The basics of database security will be presented in a later module. However, one form of security can be provided through views. Views can be used to hide certain attributes or tuples from unauthorized users. As an example of how this can be used, consider the faculty table which has attributes: FACULTY_ID, NAME, DEPARTMENT, and SALARY. While the salary data must be kept in the table for various administrative uses, it is not desirable to allow wide access to this data. One possibility is that we want the chair of a department to have access to salary data, but only for faculty in the department. To accomplish this, the following view can be created for the SWENG department. CREATE VIEW AS SELECT FROM WHERE SWENG_FACULTY * FACULTY DEPARTMENT = 'SWENG'; Permission to access the view can then be given to the chair of the SWENG department, but not to other faculty. This allows the chair to view tuples of SWENG faculty, but not view the tuples in the base table which contains full faculty information for all faculty across the university. Consider the case where it is desired to make some faculty information, such as name and department, widely available. However it is desired to limit additional information contained in the table. This can be done by creating the following view. CREATE VIEW AS SELECT FROM GENERAL_FACULTY NAME, DEPARTMENT FACULTY; Granting access to the view but not the underlying base table would allow the specified users to see the name and department for all faculty, but they would not be able to see any additional faculty information. M 9.6: Schema Change Statements in SQL This sub-module provides a brief discussion of commands which SQL provides to change a schema. These commands can be executed while the database is operational, and the database schema does not need to be recompiled after using the commands. The DBMS will perform necessary checks to verify that the changes do not impact the database in such a manner that the database becomes inconsistent. The DROP Command The DROP command is used to drop any element within a schema which has a name. It is also used to drop an entire schema. In a similar manner to what we saw in earlier sub-modules, the presentation of the DROP command in the text and the implementation of the DROP command in Derby are somewhat different. The information presented here first follows the material in the text. At the end of the sub-module, the Derby implementation is discussed. Remember that a schema can be thought of as an individual database within the DBMS. If the full schema is no longer required, it can be removed by using the DROP SCHEMA command. These keywords are followed by the name of the schema to be dropped. This is then followed by one of two options, RESTRICT or CASCADE. For example, DROP SCHEMA UNIVERSITY CASCADE; will remove the UNIVERSITY schema from the system. When the CASCADE option is used, the schema is removed along with all the tables, domains, and all other elements of the schema. When the RESTRICT option is chosen instead of CASCADE, the schema is removed only if it is empty. If any elements remain in the schema, the schema will not be dropped. All remaining elements must be dropped individually before the schema can be dropped. If a table (base relation) is no longer needed within the schema, the table can be removed by using the DROP TABLE command. These keywords are followed by the name of the table to be dropped. This is then followed by one of two options, RESTRICT or CASCADE. For example, if in a company schema it is no longer necessary to keep the locations of the various departments, the command DROP TABLE DEPT_LOCATIONS CASCADE; will remove the DEPT_LOCATIONS table from the schema. When the CASCADE option is used, in addition to the table itself being dropped, all constraints, views, and any other elements which reference the table are also dropped from the schema. If the command is successful, the definition of the table will be removed from the catalog. When the RESTRICT option is chosen instead of CASCADE, the table is removed only if it is not referenced in any constraints, views, or other elements. Constraints include being referenced by a foreign key in a different table, being referenced in an assertion, and being referenced in a trigger. If any such references exist, they must be removed individually before the table can be removed. If the goal is to remove the tuples from the table, but have the table structure remain, the tuples should be removed using the DELETE command. The ALTER Command The ALTER command can be used to change the definition of both base tables and other schema elements which have a name. The modifications which can be made to a base table include adding or dropping an attribute, changing the definition of an attribute, and adding or dropping constraints which apply to the table. For example, suppose it was decided to add a local contact phone number to the student table in the UNIVERSITY database. That can be done by using the following command. ALTER TABLE UNIVERSITY.STUDENT ADD COLUMN PHONE VARCHAR(15); If this command is successful, each tuple in the table will now have an additional attribute. There are two choices for the value assigned to the new attribute. The first option is to specify a DEFAULT clause to assign the same value for the attribute in all tuples. The second option is to not specify a DEFAULT clause. This will cause the value of the new attribute to be set to NULL. When using this option, the NOT NULL constraint cannot be added to the attribute at this time. In either case, the actual value desired must be added to each tuple individually using the UPDATE command or a similar alternative. In the above command, the value of the new attribute in each tuple will be set to NULL. Note that the actual data type is probably something other than VARCHAR(15) which is used here just as an example. The actual data type depends on exactly how the phone number will be stored and whether a standardized format for the phone number will be used. An attribute can be removed from a table by using the DROP COLUMN option. For example, if at some point it is no longer desired to keep the phone number of students, the attribute can be removed by: ALTER TABLE UNIVERSITY.STUDENT DROP COLUMN PHONE CASCADE; As with other commands, either CASCADE or RESTRICT must be specified. If CASCADE is chosen, the column is dropped as are any constraints and views which reference the column. If RESTRICT is chosen, the column is dropped only if it is not referenced by any constraints or views. Another use of ALTER TABLE is to add or remove a default value from a column. Examples of this would be: ALTER TABLE UNIVERSITY.STUDENT ALTER COLUMN PHONE DROP DEFAULT; ALTER TABLE UNIVERSITY.STUDENT ALTER COLUMN PHONE SET DEFAULT '000-000-0000'; Specifically for Derby, there are differences from the presentation in the text. For DROP SCHEMA, only the RESTRICT option is available. Even though it is the only option, it must be specified. For DROP TABLE, neither CASCADE or RESTRICT can be specified. The behavior is similar to the CASCADE option shown above. ADD COLUMN works as indicated above. There are some additional options which can be used. DROP COLUMN works as indicated above. ALTER COLUMN works as indicated above. There are more commands for changing a schema than are presented either here or in the text. In this course, these should be sufficient. Additional details can be found in the SQL standards and in reference manuals for individual DBMSs. 10.1: Overview of Database Programming In this sub-module, we will introduce some methods that have been developed for accessing databases from programs. Most database access is through software programs that implement database applications. These are usually developed using general-purpose programming languages such as Java and C/C++/C#. Also, scripting languages such as PHP, Python, and JavaScript are used to provide database access from Web applications. When database access statements are included in a program, the general purpose programming language is called the host language. The language used by the database (SQL for us) is called the data sublanguage. Some specialized database programming languages have been developed for the purpose of writing database applications. Although many such languages have been developed for research purposes, only a few , such as Oracle's PL/SQL, have been widely adopted for commercial applications. Please note that database programming is a broad topic. There are many database programming techniques, and each technique is realized differently in each specific DBMS. New techniques continue to be developed and existing techniques continue to be updated and modified. In addition, although there are SQL standards, the standards continue to evolve. Also, each vendor has usually implemented some variations from the standard. Some institutions provide a complete course which covers database programming. In this module, we will only be able to present an overview of the topic along with one specific example. This will show the general steps needed to interact with a database from a programming language. However, as you work with other programming languages and databases, the details will be different and you will need to see which languages the specific database supports and what tools it has available to support each language. Throughout the course we have used an interactive interface to the database. Using an interactive interface, commands can be typed in directly or commands can be collected into a file of commands and the file can be executed through the interactive interface. Most DBMSs provide such an interface. This interface is often used for database definition and running ad hoc queries, However, in most organizations,the majority of database interaction is by programs which have been thoughtfully designed and carefully tested. Such programs are usually known as application programs or database applications. Becoming increasingly common are application programs which implement web interfaces to databases. Despite their growing popularity, web applications are not covered in this course. If you are interested in this topic and have limited background with it, I suggest starting by reading Chapter 11 in the text. It provides an example of using PHP to access a database. This will provide one example and give you a basic background to pursue additional and more current information about the topic. Approaches to Database Programming 1. Embed database commands in a general purpose programming language. With this approach, database statements are embedded into the program. These statements contain a special prefix identifying them as embedded statements, often SQL commands. A precompiler or preprocessor is run on the source code. The precompiler processes each embedded statement and converts it to function calls to DBMS generated code. This technique is often called embedded SQL. Additional information can be found from the examples provided in Section 10.2 in the text. 2. Use a library of database functions or classes. Whether using functions or classes, the library provides functions (methods) for interacting with the database. For example, there will be function calls to connect to the database, to prepare a query, to execute a query, to execute an update, to loop over the query results a record at a time, etc. The actual database commands and any additional information which is required are provided as parameters to the function calls. This technique provides an API (Application Programmer Interface) for accessing the database from a given general purpose programming language. For OOPLs, a class library is used for database access. As an example, JDBC is a class library for Java which provides objects for working with a database. Each object has an associated set of methods for performing the needed operations. Additional information can be found from the examples provided in Section 10.3 in the text. We will work with JDBC (covered in Section 10.3.2) in the remainder of the module. 3. Designing a new language. A database programming language is a language designed specifically to work with the database model and query language. Examples of such languages are PL/SQL written to work with Oracle databases, and SQL/PSM provided as part of the the SQL standard to specify stored procedures. Additional information can be found from the examples provided in Section 10.4 in the text. The first two approaches are more common since many applications are already written in general purpose programming languages and require database access. The third approach can be used with applications designed specifically for database interaction. The first two approaches must deal with the problem of impedance mismatch, while the third approach can limit the portability of the application. Impedance Mismatch Impedance Mismatch is the name given to problems which may occur when there are differences between the database model and the model used by the programming language. One example of a potential problem is that there must be a mapping between data types of attributes which are permitted by the DBMS and the data types allowed by the programming language. This mapping specifies the data type to be used in the programming language for each data type allowed by the attributes. The mapping is likely to be different for each programming language supported by the DBMS since the data types allowed in different programming languages are not all identical. Another problem which must be addressed is that the results of queries are (in general) sets of tuples. Each tuple is made up of a sequence of attribute values. For many application programs, it is necessary to be able to access a single value from a single attribute of a single tuple. This means that there must be a mapping from the query result table to an appropriate data structure in the programming language. There then must be a way for the programming language to process individual tuples and to obtain any required values from the tuple and place the values into variables in the program. Most programming languages provide a cursor as a way to loop through the result set and process each tuple individually. Impedance mismatch is much less of a problem when a programming language is developed to work specifically with a particular DBMS. In this case, the language is designed to use the same data model and data types that are used by the DBMS. Typical Sequence in Database Programming When writing a program to access a database, the following general steps are followed. Note that in many cases, the application program is not running on the same computer as the database. Establish or open a connection to the database server. To do this it is usually required to provide the URL of the machine running the server as well a providing an account name and password for the database. Submit various database commands and process the results as required by the application. Terminate or close the connection once access to the database is no longer needed. M10.2: Using the JDBC Library The text covers JDBC in Section 10.3.2, JDBC: SQL Class Library for Java Programming. You should carefully read that section in the text and study the two program segments presented there. This presentation will follow the same general flow as that in the text, but the program segments presented here will work with a database built in NetBeans and will be slightly different from those in the text. We will walk through the JDBC programming steps using the small database used in Assignment 7. I have recreated the database and named it MODULE10. To do that, I simply followed the instructions given at the beginning of Assignment 7, but I changed the name of the schema from assignment7 to MODULE10. I then used the assgnment7.sql file to create the tables and load some sample data. This database has the same tables and data as the assignment7 database, but it has a different name so the database here has a "fresh copy" of the data. It would be a good idea for you to do the same and follow along with your own copy as we work through the process. After the database is set up, the next step is to write a JDBC program to access the database. JDBC is a class library for Java which provides the interface to access a database. Java is platform independent and widely used, so many RDBMS vendors provide JDBC drivers which allow Java programs to access their DBMS. A JDBC driver is an implementation of classes, objects, and function calls which are available in JDBC for a given vendor's RDBMS. If a Java program includes JDBC objects and function calls, it can access any RDBMS which has a JDBC driver available. In order to process the JDBC function calls in Java, the program must import the JDBC class libraries. These libraries can be imported by importing java..sql.*. A JDBC driver can be loaded explicitly by using the Class.forName( ) command. An example of this is shown in the text for loading a driver for the Oracle RDBMS. When using NetBeans, as part of the project which contains your Java code, right click on Libraries. In the drop down, click on Add Library. That drop down will contain a list of libraries. Click on Java DB Driver. This will provide the drivers to the project and drivers do not need to be loaded from inside the Java program in NetBeans. An example of this is shown in the text for loading a driver for the Oracle RDBMS. The drivers do not need to be loaded when accessing a Java DB database from a Java program in NetBeans. The drivers are alreadyloaded. A general program will use the following steps. 1. 1. 2. 3. 4. 5. 6. 7. 8. 9. Import JDBC class library Load the JDBC driver Create appropriate variables The Connection object The Prepared Statement object Setting the statement parameters Binding the statement parameters Executing the SQL statement Processing the ResultSet object The first example will be simple and will allow some of the steps to be skipped. Specifically, the query will be "hard coded," so there will be no user input used to create the query. As such, step 3 is not needed to produce the query. A Statement object will be used in Step 5, and Steps 6 & 7 will not be required. Also, as indicated above, the JDBC driver does not need to be loaded from an external source, so Step 2 is not required in this example (or the remaining examples). The following comments apply to Example Program 1 which can be viewed below. Click here to download Example Program 1.JPG Example Program 1.JPG Download Click here to download Example Program 1 in Word so you can preview it if you prefer Program 1 in Word so you can preview it if you prefer Download Example Following the steps above for this specific example, you can see the following required steps. 1. Import the JDBC class library. This is done on line 7 where java.sql.* is imported. 2. Load the JDBC driver. This is not required for our use. An example of how this might be used is shown on lines 11-22. 3. Create appropriate variables. Not required for this example. 4. The Connection object. This is shown on line 27. A Connection object is declared and given the name "con". The getConnection method of DriverManager is called to create the Connection object. One form of the method call takes three parameters. The first is the URL of the connection. This is the URL you have been right clicking on in the services tab to connect to the database. In my case it is "jdbc:derby://localhost:1527/UNIVERSITY". The second parameter is the name used to create the database, and the third is the password associated with the account. 5. The Prepared Statement object. As indicated above, we will use the Statement object in this example. This is shown on lines 29-31. A Statement object is declared and given the name "stmt". The createStatement method of the Connection object is called for the "con" instance of the object to create the Statement object. The two parameters are not actually needed in this example, but they will not hurt anything. They will be discussed in a future example. 6. Setting the statement parameters. Not required for this example. 7. Binding the statement parameters. Not required for this example. 8. Executing the SQL statement. Executing the SQL statement, a query in this example, returns a ResultSet object. This is shown In the example on line 35. When using the executeQuery method, the parameter is a string which contains an SQL SELECT statement. The SELECT statement in the string is what you would enter when executing an interactive query as we have done in the past. This method is called from the Statement object, "stmt", in the example. The returned value populates the ResultSet object, which is named "rs" in the example. You can think of this as the table of values returned (and displayed) when you execute the query interactively. 9. Processing the ResultSet object. This is shown on lines 37-41 in the example. Although the ResultSet object contains the equivalent of a table, using the object ("rs" in the example) is equivalent to using a cursor into the table. When the query is executed, rs refers to a tuple (which does not actually exist) BEFORE the first tuple in the result set. The call to rs.next( ) on line 37, moves the cursor to the next tuple in the result set. It will return NULL if there are no more tuples in the result set. In the example, the first call to rs.next( ) moves the cursor from "before the first row" to the first row. The program can refer to the attributes in the current tuple (at this point, the first tuple) by using various get methods. Line 38 shows referencing the SNAME attribute in the current tuple by using rs.getString("SNAME"). Here, SNAME is the name of the student name attribute in the database, and getString is used since the attribute has the data type of VARCHAR. Note that using the attribute name is one way to reference the attribute. An alternative method, demonstrated by the example in the text, is to use the attribute position in the table. In our table, SNAME is in the second position, so line 38 could have used rs.getString(2). The result of this is to store the value found in the tuple (the student with the name "John Smith") into the variable "s". Line 39 shows getting the value of the ADVISOR attribute in the tuple, by using rs.getInt("ADVISOR") since the ADVISOR attribute has data type INTEGER. As with SNAME, the get could have used the positional value as rs.getInt(5), since ADVISOR is the fifth column in the table. The assignment statement will store the returned value (the number of the advisor of the student : 6721) in the variable "n". Line 40 prints the values. When the while loop returns to execute line 37 again, rs.next() will return a reference to the next tuple in the result set. Since there is only one tuple in the result set, this call will return a NULL, and the loop will be exited. The output from the program is shown here: Click here to see Example Program 1 Output.docx M10.3: Example of Processing a Result Set Containing Multiple Tuples The second example will also be fairly simple. Again, the query will be "hard coded," so there will be no user input used to create the query. This example will produce a result set that contains multiple tuples to compare it to Example Program 1 where the result set contained only one tuple. Example Program 2 can be viewed below, and will then be followed by comments. Click here to download Example Program 2.JPG Example Program 2.JPG Download Click here to download Example Program 2 in Word so you can preview it if you prefer Program 2 in Word so you can preview it if you prefer Download Example Much of the code is the same as Example 1 and comments will not be repeated here. Comments to the changes from Example 1 are: 8. Executing the SQL statement. As with Example 1, executing the SQL statement, a query in this example, returns a ResultSet object. This is shown In the example on line 26. This SQL statement returns all attributes and all tuples in the STUDENT table. 9. Processing the ResultSet object. This is shown on lines 28-41 in the example. As with Example 1, the ResultSet object contains the equivalent of a table, and using the object ("rs" in the example) is equivalent to using a cursor into the table. As in the last example, when the query is executed, rs refers to a tuple (which does not actually exist) BEFORE the first tuple in the result set. The call to rs.next( ) on line 28, moves the cursor to the next tuple in the result set. In the example, the first call to rs.next( ) moves the cursor from "before the first row" to the first row. The program refers to the attributes in the current tuple (at this point, the first tuple) by using various get methods as shown in lines 30-34. Like Example 1, this example shows referencing the attributes by name rather than by position. Lines 36-40 print the values. When the while loop returns to execute line 28 again, rs.next() will move the cursor from the first row to the second row. The next call will move the cursor from the second row to the third row, etc. Finally, when there no more tuples in the result set, the call will return a NULL, and the loop will be exited. Click here to see Example Program 2 Output.docx M10.4: Examples of Building a Query Based on User Input The third example shows how the program can accept user input before running a query. We will ask the user to input a student id. The student id will be used to run a query to find and then display the record of the student. Click here to see Example Program 3. Download Click here to see Example Program 3. Much of the program is similar to the previous examples. The new code is discussed here. Line 8 indicates that the Scanner class will be included. We'll use that to obtain the student id that the user is going to enter. The Scanner object will be declared on Line 16. Line 22 includes the two parameters for the createStatement method. By default, a ResultSet object is not updatable and has a cursor that moves forward only. In order to allow the cursor to be "moved backward" (see the discussion of line 40 below), the result set can be permitted to move backwards by indicating TYPE_SCROLL_INSENSITIVE in the connection object as shown. The result set can be made updatable by inlcuding CONCUR_UPDATABLE as shown. This last parameter will be further discussed in the next sub-module. Lines 24-28 are used to build the statement. Remember that student id is stored as a CHAR in the database, which is actually a String in Java. The SQL statement (also a String) is constructed based on the user input. Since the SQL statement requires character values be enclosed in quotes,the quotes must be added to the to the student id in the statement String. These are included by the program. Otherwise, the user would need to enter them as part of the id being input. This would be cumbersome and error prone. Lines 34-41 deal with a potential issue. If these lines were not included, if the user enters an id number which is not currently in the STUDENT table, the program just ends with nothing begin printed. This is because the while loop starting on Line 43 would never be executed since the first call to rs1.next( ) would immediately return false (NULL) since no tuple exists in the result set. With these lines, the first call to rs1.next( ) is checked. If it returns false, there are no tuples in the result set and an appropriate message is printed before the program ends. The else statement on Lines 39-41 addresses an issue that arises should there be one or more tuples in the result set. Since the call to rs1.next( ) on Line 34 advances the cursor from "before the first tuple" to the first tuple (if one exists), without the else, the call to rs1.next( ) on Line 43 would advance the cursor from the first tuple to the second tuple. When the while loop is entered, the first tuple would be skipped and only the second and remaining tuples would be displayed. If the else statement is executed, there is at least one tuple in the result set, and Line 40 moves the cursor back to "before the first tuple," allowing the while loop on Lines 43 to 55 to execute beginning with the first tuple. Click here to see Example Program 3 Output.docx Program 3 Output.docx Download Click here to see Example Since building a statement as we did was a bit cumbersome, the following code will demonstrate an alternative which uses the PreparedStatement class. Click here to see Example Program 4.docx 4.docx Download Click here to see Example Program Much of this code is similar to that of Example 3, and both programs will look similar to the user. The changes when going from Example 3 to Example 4 are noted here. Line 24 is used to create a query string with one parameter. The place for the parameter to be inserted is indicated by the ? character. Line 25 shows that an object instance named pstmt of type PreparedStatement is created based on the string in s1. Lines 27 & 28 have the user provide the student id which is stored in the String variable sid. Lines 30 & 31 show how a statement parameter is bound to a program variable. Line 30 demonstrates the suggested good practice of clearing all parameters before setting any new values. Line 31 shows how a set function is used to bind a string parameter to a variable. There are different set functions that match different parameter data types. Examples are the setInteger and setDouble functions as well as the setString function shown here. The parameters to the set function are the position of the statement parameter in the statement and the name of the variable being bound to the parameter. In the code, the first (in this case the only) parameter in pstmt is bound to the variable sid. Note that if there are n parameters in a statement, n different set statements are required, one for each parameter. Note that in Line 25, rather than being declared CONCUR_UPDATABLE, the Result Set was set as CONCUR_READ_ONLY. This can be specified when no updates are being performed using the statement. The same parameter value could also have been used in Example 3. Only simple uses of settings which can be applied to Result Set objects are shown here. For additional information about these parameters please consult the documentation which is provided for the particular DBMS you are using. The remaining code has the same function as that in Example 3. Running this program will produce the same output as Example 3, so the output will not be repeated here. M 10:5: Examples of Using a Program to Update the Database The fifth example shows how the database can be updated using a program. For simplicity, the command will be hard-coded. Making a modification using Examples 3 & 4 as a guide would allow the program to accept user input. Click here to see Example Program 5 Download Click here to see Example Program 5 Parts of the program are similar to the previous examples. The new code is discussed here. Lines 18-24 retrieve a record from the database to show a "before" copy of the record. This is similar to previous examples, but it does show retrieving the attribute by position rather than by name. String s is assigned the value of the second attribute (SNAME), String t is assigned the value of the fourth attribute (MAJOR), and int n is assigned the value of the fifth attribute (ADVISOR). These values are then displayed by Line 23. Line 26 sets the value of String variable sql to a string which contains the code to change the major to "SWING" in all selected tuples. Line 28 is where the update is actually executed. Note that unlike Line 18 and the SQL statements executed in the previous examples, the function called is "executeUpdate" rather than "executeQuery". This function returns the number of records (tuples) in the database which were updated. In this case only one tuple was selected by the WHERE clause in the update statement. This value is printed on Line 29. Lines 32-38 retrieve the record from the database to show an "after" copy of the record. This is identical to the code on lines 18-24. This shows that the value in the MAJOR attribute was actually changed to "SWING". The output from a run of the program is shown below. Click here to see Example Program 5 Output Program 5 Output Download Click here to see Example The sixth example program provides an additional look at updating tuples, in this case updating several tuples rather than just one. Click here to see Example Program 6 Download Click here to see Example Program 6 Although the intent of this program is similar to the program in Example 5, there are some differences which are listed here. Line 19 shows the initial query which retrieves all tuples from the STUDENT table. Lines 21-34 display all attributes in the tuples. This is to get a more complete look at the "before" status of the tuples. Lines 36-39 build and execute the UPDATE string. The return value from the executeUpdate is printed by line 39. Lines 42-56 retrieve the tuples in the table after the update and display the "after" values in the tuples. The output from a run of the program is shown below. Example Program 6 Output.docx M 11.1: Informal Relation Schema Design Guidelines This sub-module discusses four informal guidelines that can be followed when designing a relation schema. Note that these guidelines are for a relation schema, not a relational schema. A relation schema references a single relation or table. A relational schema references the entire database. Clear Semantics for Attributes in Relations We have discussed in earlier modules that a database is a model of a part of the real world. As such, relations and their attributes have a real-world meaning. The semantics of a relation indicate how the real-world meaning can be interpreted from the values of attributes in a tuple. Following the design steps from earlier modules, the overall relational schema design should have a clear meaning. The easier it is to explain the semantics of a relation, the better the relation schema design will be. This guideline states that a relation should be designed so that it is easy to explain its meaning. If a relation schema represents one entity type or one relationship type its meaning is usually clear. Including multiple entities and/or relationships in a single database relation tends to cloud the meaning of the relation. Avoid Redundant Information and Update Anomalies We have discussed the goal of eliminating, or at least minimizing, redundant information in a database. In addition to using extra storage space, redundant information can lead to a problem known as update anomalies. Consider the FACULTY and DEPARTMENT relations we have used in earlier modules. Information specific to the faculty members was kept in the FACULTY relation, and information about the department was kept in the DEPARTMENT relation. However, there is a relationship between the two, namely that each faculty belongs to a department. This relationship was expressed by storing the primary key of DEPARTMENT as a foreign key in FACULTY. This allows the relationship to be retained, but no DEPARTMENT information was duplicated in the FACULTY relation. Consider an alternative where the two tables are combined into one as a base table. To do this, all department information such as name and building would be stored in the combined tuple (which would, in effect, be a table resulting from a natural join of the two tables). If this were done, the department name and the building it resides in would be repeated in several tuples, once for each faculty member in the department. This can lead to update anomalies, specifically insertion anomalies, deletion anomalies, and modification anomalies. An example of how this would look is shown in the table below, which contains only a few tuples for simplicity. FACULTY Faculty ID Faculty Name Department Code 2469 Huei Chuang CMPSC Com 5368 Christopher Johnson MATH Mat 6721 Sanjay Gupta SWENG Soft 7497 Martha Simons SWENG Soft Insertion anomalies can be caused when inserting new records into the combined relation. This can happen in two cases. To add a new faculty member, the department information must also be included. If the faculty member being added is in the SWENG department, the department name and building that is included must be identical to the information included in the tuples for other faculty in the SWENG department. If there is not a match, there is now conflicting information about the SWENG department stored in the database. So, if Sally Brown is added as a new faculty member in the SWENG department, but in the tuple the building is incorrectly entered as Nittany Hall, the table would show the SWENG department residing in two buildings: Nittany Hall and Hopper Center. This is incorrect and is an example of an insertion anomaly. Also, assuming that faculty id is the primary key of the combined relation, a new department cannot be added if the department has no faculty currently assigned. To do so, NULL values would need to be entered into the faculty-specific attributes of the tuple, which violates the constraint that the primary key value cannot be NULL. In the table above, if a new Cyber Engineering department (CYENG ) is created, information about the department cannot be added until a faculty member is assigned to the department. Neither of these problems exist in the design which keeps the base tables separate. Deletion anomalies can be caused when deleting records from the combined relation. If the last faculty member of a department is deleted from the database, all information about the department will be removed from the database. This may not be the intent of the deletion. In the table, if Christopher Johnson leaves the university and is removed from the table, all information about the Math department is also removed. Again, this problem does not exist in the original design. Modification anomalies can be caused when values in a tuple in the combined relation are modified. For example, if a department is moved to a new building, this value must be changed in all tuples for faculty in the department. If this is not done, the database will be in an inconsistent state: some tuples will have the new building location and some will have the old building location. In the table, if SWENG is moved to Nittany Hall and the change is made in Sanjay Gupta's tuple, but not in Martha Simons' tuple, the table would show the SWENG department residing in two buildings: Nittany Hall and Hopper Center. This is incorrect and is an example of a modification anomaly. As with the other two examples, this problem does not exist in the original design. This guideline states that base relation schemas should be designed which do not allow update anomalies to occur. NULL Values in Tuples If attributes are included in a relation which will contain NULL values in a large number of the tuples, there are at least two issues which can arise. First, there will be a large amount of wasted storage space since the value NULL will need to be stored in many tuples. Also, we saw in earlier modules that the value of NULL can have different meanings (does not apply, unknown, known but not yet available for the database). This can lead to unclear semantics for the attribute. This guideline states that the designer should avoid putting an attribute in a base relation if the value of the attribute may frequently be NULL. If NULL values cannot be avoided, make sure they apply to relatively few tuples in the relation. It is often possible to create a separate relation whose attributes are the primary key of the original relation and the attribute from the original relation which contains many null values. Only tuples which do not have null values would need to be stored in the new relation. This would lead to a relation with many fewer tuples, none of which would need to contain a NULL value. As an example, suppose some students serve as student assistants and some of the assistants are assigned to offices. We want to keep track of the office assignments. If only 2% of the students are student assistants who have offices, including a Stu_office_number attribute in the STUDENT relation, would add an attribute which has a NULL value in 98% of the student tuples. Instead, a STU_OFFICES relation can be created with attributes Student_id and Stu_office_number. This relation will include tuples for only the students who have been assigned an office. Generation of Spurious Tuples When taking a table which included redundant information and splitting the table to remove the redundancy, it is possible to split the table in a way that leads to a poor design. The design will be poor if the tables are such that a natural join of the two tables produces more tuples than would be contained in the original combined tables. The extra tuples not in the original combined tables are called spurious tuples, because such tuples provide spurious (false) information. This guideline states that the designer should design relation schemas so that they can be joined using (primary key, foreign key) pairs that are related in a manner that will not cause spurious tuples. Do not design relations where (primary key, foreign key) attributes do not appropriately match, since joining on these attributes may yield spurious tuples. M 11.2: Functional Dependencies Normalization is the most well known formal tool for the analysis of relational schemas. The main concept on which much of normalization is based is that of functional dependency. The concept of functional dependency is defined in this sub-module. The discussion presented here is based on Section 14.2 in the text and can seem overly "mathematical" and somewhat abstract, so a simple example is presented along with the formal definitions to illustrate the concepts. We can start by thinking of listing all the attributes we want to capture in the database. Without regard for entities at this time, give each attribute a unique name. Consider defining the database as one large relation which contains all the attributes. This relation schema can be called a universal relation schema. As an example which will be carried through the normalization process, we will consider a subset of the database we have been using for most of the course. Consider the university database, but to keep the example simpler, we will focus on some of the attributes required to produce a transcript. Specifically, consider student id, student name, class rank, section id, course number, course name, course credit hours, and grade. Student_id Section_id Student_name Class_rank Formally, if we call the universal relation schema R, a functional dependency can be defined as follows. Consider two subsets of R which are denoted as X and Y. Y is functionally dependent on X, denoted by X -> Y, if there is a constraint on a relation state r of R that for any two tuples in r (which we will call t1 and t2) that if t1[X] = t2[X], then it must also be true that t1[Y] = t2[Y]. This means that the values of the attributes which make up the Y subset of the tuple depend on (are determined by) the values which make up the X subset of the tuple. Looking at this in the other direction, it can be said that the values which make up the X subset of the tuple uniquely determine (or functionally determine) the values which make up the Y subset of the tuple. Other words used to describe this are that there is a functional dependency from X to Y, or that Y is functionally dependent on X. Functional dependency can be abbreviated FD. The set of attributes making up X is called the left-hand side of the FD, and set Y is called the right-hand side of the FD. Whether or not a functional dependency holds is partly determined by assumptions about data. For example, if the assumption is made that Student_id is unique, the functional dependency Student_id -> Student_name will apply. This says that Student_name (the Y) is functionally dependent on Student_id (the X). If two tuples (the t1 and t2 above) have the same Student_id, they must also have the same Student_name. However, it would not be true that Student_id -> Grade. Different grades would be associated with any individual student, so given the ID, you could not uniquely determine the grade. Since both X and Y are subsets, either or both can consist of multiple attributes. For example, Student_id -> {Student_name, Class_rank}. Given a student id, both student name and class rank are determined for the tuple. Similarly, {Student_id, Section_id} -> Grade. If both the Course_ student id and section id are provided, the combination will determine the grade the student earned for that section. Note that if there is a further constraint on X that requires that only one tuple can exist with a given X-value (which implies that X is a candidate key for the relation, as discussed in the next sub-module), this implies X -> Y for any subset Y of R. If X is a candidate key for R, then X -> R. This concept will be expanded in the next sub-module. Also note that X -> Y in R, does not indicate whether or not Y -> X in R. As we showed above, Student_id -> Student_name. Is it true that Student_name -> Student_id? Maybe. It depends on the assumption about student name. Is it unique? If so, it it unique only in the current set of tuples (the relation state) or will it hold for any potential set of tuples? If the assumption is made that it is possible to have duplicate student names, then it isnot true that Student_name > Student_id. However, if a scheme is devised to guarantee that no two student names can ever be identical, then it is true. Functional dependency is not a property of the data, but rather it is a property of the semantics or meaning of the attributes. Functional dependencies, therefore, are specified by the database designer based on the designer's knowledge of dependencies which must hold for all relation states r of R. Thus, functional dependencies represent constraints on the attributes of a relation which must hold at all times. A functional dependency is a property of the relation schema, not of a particular relation state. Because of this, the functional dependency cannot be inferred from a given relation state, but must be defined by a designer who knows the semantics of the attributes of R. When looking at a relation state only, it is not possible to determine which FDs hold and which do not. Looking at the data in a particular relation state, all that can be said is that an FD may exist between certain attributes. However, it is possible to determine that a particular FD does not hold if the data in even one tuple would violate the FD if it were to hold. M 11.3 The Normalization Process The normalization process usually starts with a set of relations which have already been proposed or are already in use. This set of relations can be developed by first creating an ER diagram and then converting the ER diagram into a set of relations to form a database schema. The set of relations can also come from an existing implementation of some type, whether it is an existing database, a file processing system, or a hard-copy process using paper forms. We generated a set of relations in earlier modules using an ER diagram and an algorithm to produce the database schema from the ER diagram. The normalization process was first proposed by E.F. Codd in 1972 in a paper which defined the first three normal forms, which Codd called first, second, and third normal forms. Codd is considered the father of the relational database, since much of the relational database theory is based on a paper he wrote in 1970 while working for IBM. Several years later a stronger version of third normal form was proposed, and this version is now known as Boyce-Codd Normal form (BCNF). In the late 1970s two additional normal forms were proposed and are called fourth and fifth normal form. We will cover the first three normal forms. We will not cover BCNF, and the fourth and fifth normal forms in detail. In addition to the concept of functional dependency, the normalization process relies on the concept of keys. Keys were discussed in Module 5.1.3. The main concepts are repeated here for convenience. We indicated above that in the definition of a relation, no two tuples may have the exact same values for all their elements. It is often the case that many subsets of attributes will also have the property that no two tuples in the relation will have the exact same values for the subset of attributes. Any such set of attributes is called a superkey of the relation. A superkey specifies a uniqueness constraint in that no two tuples in the relation will have the exact same values for the superkey. Note that at least one superkey must exist; it is the subset which consists of all attributes. This subset must be a superkey by the definition of a relation. Consider the set of attributes from the last sub-module. Student_id Section_id Student_name Class_rank As indicated above, the set of all eight attributes will always be a superkey. Similarly, the subset of attributes {Student_id, Section_id, Student_name, Class_rank} is also a superkey since no two tuples will have the exact same values for this subset of attributes. However the subset of attributes {Student_id, Student_name, Grade} is not a superkey since is is possible (even likely) that a given student will have earned an A in several courses over time, so there will be several tuples with the same values for these three attributes. A superkey may have redundant attributes - attributes which are not needed to insure the uniqueness of the tuple. An example of this can be seen from the above. All eight attributes are a superkey, but a subset of just four of the attributes is also a superkey. This means that at least four of the eight attributes were redundant. A key of a relation is a superkey where the removal of any attribute from the set will result in a set of attributes which is no longer a superkey. More specifically, a key will have the properties: 1. 1. 1. 1. Two different tuples cannot have identical values for all attributes in the key. This is the uniqueness property. 2. A key must be a minimal superkey. That means that it must be a superkey which cannot have any attribute removed and still have the uniqueness property hold. Note that this implies that a superkey may or may not be a key, but a key must be a superkey. Also note that a key must satisfy the property that it is guaranteed to be unique as new tuples are added to the relation. In the above example, the superkey consisting of the subset of attributes {Student_id, Section_id, Student_name, Class_rank} is still not a key. If Class_rank is removed, the remaining three attributes still comprise a superkey. Further, if Student_name is also removed, the remaining two attributes still comprise a superkey. However, the removal of either attribute from the subset {Student_id, Section_id} will no longer guarantee the uniqueness property, so the superkey consisting of the subset {Student_id, Section_id} is also a key. In many cases, a relation schema may have more than one key. Each of the keys is called a candidate key. One of the candidate keys is then selected to be the primary key of the relation. The primary key will be underlined in the relation schema. If there are multiple candidate keys, the choice of the primary key is somewhat arbitrary. Normally, a candidate key with a single attribute, or only a small number of attributes is chosen to be the primary key. The other candidate keys are often called either unique keys or alternate keys. Again consider the above set of attributes. Also, add the assumption that Student_name is unique (a dubious assumption as we have said, but we'll make the assumption to illustrate the point). We saw above that the subset {Student_id, Section_id} is a key. Assuming the uniqueness of the Student_name attribute, the subset {Student_name, Section_id} is also key. Both of these are therefore candidate keys. We choose one, let's say {Student_id, Section_id} to be the primary key. Once this choice is made, {Student_name, Section_id} becomes an alternate key. In addition to the definition of keys, one additional definition is used in the normalization process. An attribute of a relation R is called a prime attribute of R if it is a member of at least one candidate key of R. Any attribute which is not a prime attribute of R is called a nonprime attribute of R. This means that any nonprime attribute of R is not a member of any candidate key of R. The normalization process is discussed in the following sub-modules. M 11.4: First Normal Form First normal form (1NF) requires that the domain of an attribute may contain only atomic values. Atomic values are those which are simple and indivisible. Further, the value in any attribute in a tuple must be a single value from the domain of the attribute. This requires that an attribute value cannot be an array of atomic values, a set of atomic values, etc. Note that in this course, in our earlier modules we forced attribute values to be atomic, so what was developed was always in 1NF. As shown by some of the material in the text, this is not the case for all methodologies. To provide an example of the normalization process, we will continue with the subset of the database we have been using for this module. Consider the university database with attributes student id, student name, class rank, section id, course number, course name, course credit hours, and grade. If we place all the attributes in one table, we might choose student id as the primary key. If we do that, we see that in order to keep tuples unique, all the attributes except student id, student name, and class rank would need to be multivalue attributes. Each student will have taken many sections of courses. Each would need to be listed in the attribute. This would also be true for the remaining attributes. In order to remove the need for multivalue attributes, a primary key for this table will need to be determined so that each attribute has an atomic value in each attribute. Examining the table and the attributes, we can list the functional dependencies as follows. Student_id -> Student_name Student_id -> Class_rank Section_id -> Course_number Section_id -> Course_name Section_id -> Credit_hours {Student_id, Section_id} -> Grade To remove the need for multivalue attributes, section id should be added as part of the primary key. This leads to each tuple having attributes that have atomic values. This would yield a table in first normal form that would look like the following: Student_id Section_id Student_name Class_rank This "large" table meets the qualifications for a relation. Since each student id and each section id combination has a unique value, each tuple in this table would be unique. Hopefully, based on what we have covered in this course, you have the feeling that building a database based on this large table would not be ideal. This feeling would be correct, so the normalization process continues to second normal form. M 11.5: Second Normal Form In sub-module 11.2 we discussed functional dependency. Second normal form (2NF) requires an extension of that concept. 2NF is based on the concept of full functional dependency. If X -> Y represents a functional dependency, X -> Y is a full functional dependency if removing any attribute A from X results in the functional dependency no longer holding. A functional Course dependency X -> Y is a partial dependency if at least one attribute A can be removed from X and the functional dependency X -> Y still holds. For example, from our table in the previous sub-module, {Student_id, Section_id} -> Grade is a full dependency since Student_id -> Grade does not hold, nor does Section_id -> Grade hold. Thus neither attribute can be removed and have the dependency still hold. However, {Student_id, Section_id} -> Student_name is only a partial dependency since Section_id can be removed and Student_id -> Student_name is still a functional dependency. A relation schema R is defined to be in 2NF if every nonprime attribute of R is fully functionally dependent on the primary key of R. The normal forms are considered sequential, so testing a relation for 2NF requires that the relation already be in 1NF. Consider the table from the last sub-module which is in 1NF. Since the combination {Student_id, Section_id} is the primary key of this relation, all non-key attributes are functionally dependent on the primary key. The question is which are fully functionally dependent. STUDENT_GRADE Student_id Section_id Student_name Class_rank Course_ Functional dependencies: Student_id -> Student_name Student_id -> Class_rank Section_id -> Course_number Section_id -> Course_name Section_id -> Credit_hours {Student_id, Section_id} -> Grade From this, only Grade is fully functionally dependent on the primary key. Second normal form requires that the 1NF table be decomposed into tables where each nonprime attribute is fully functionally dependent on its key. Note that if a primary key consists of a single attribute, that relation must be in 2NF. Based on the above, the table would be decomposed into three tables. One table requires the combined attributes for the primary key, one table requires only the Student_id for the primary key, the third table requires only the Section_id for the primary key. The tables would be as follows. STUDENT Student_id Student_name Class_rank SECTION Section_id Course_number Course_name Credit_hou GRADES Student_id Section_id Grade These tables represent the current state of the database. Each table is in 2NF. Since STUDENT and SECTION now have a single attribute as primary key, the non-key attributes in these tables must be fully functionally dependent on the primary key. In the GRADES table, as discussed above, the Grade attribute is fully functionally dependent on the primary key since Grade is not functionally dependent on either of the attributes which make up the primary key. At this point, the STUDENT and GRADE tables look OK, but there is still something not quite right about the SECTION table. This issue will be addressed as we move to third normal form. M 11.6: Third Normal Form Third normal form (3NF) is based on a concept known as transitive dependency. Informally, this means that non-key attributes are not allowed to define other non-key attributes. Another way to state this is that no non-key attribute can be functionally dependent on another non-key attribute. More formally, a functional dependency X -> Y in relation R is a transitive dependency if there exists a set of attributes Z in R such that Z is not a candidate key nor is Z a subset of any key of R, and both X -> Z and Z -> Y are true. According to Codd's definition, a relation is in 3NF if it is in 2NF and no nonprime attribute of R is transitively dependent on the primary key. This means that transitive dependencies must be removed from tables to put them in 3NF. Consider again the three tables from the last sub-module, all of which are in 2NF. STUDENT Student_id Student_name Class_rank SECTION Section_id Course_number Course_name Credit_hou GRADES Student_id Section_id Grade In earlier modules we indicated that the name of a student may not be unique. This means that Student_name -> Class_rank does not hold. Also, since there are many students in each rank (freshman, sophomore, etc.), Class_rank -> Student_name does not hold. Since neither non-key attribute is dependent on the other, the STUDENT table is already in 3NF. Similarly, since the GRADES table has only one non-key attribute, there are no transitive dependencies in this table so it is also already in 3NF. However, informally, we see a redundancy problem with the SECTION table. Since there are many sections of the same course, Course_number, Course_name, and Credit_hours will be duplicated in many tuples. We see that Course_number -> Course_name holds and Course_number -> Credit_hours also holds. This means that the SECTION table contains transitive dependencies since non-key attributes are functionally dependent on another nonkey attribute. The SECTION table is in 2NF, but it is not in 3NF. This is resolved by creating another table which contains the non-key attribute that functionally determines other non-key attributes. The determining non-key attribute will be the primary key of the new table. All of the impacted attributes will be removed from the original table and placed in the new table. The exception is that the attribute which is the primary key in the new table will remain as a foreign key in the old table. Following this description, the SECTION table is split into two tables as follows. SECTION Section_id Course_number COURSE Course_number Course_name Credit_hours Since the new SECTION table has only one non-key attribute, it is now in 3NF. An interesting question arises when looking at the COURSE table. It is clear that Credit_hours -> Course_name does not hold since many, many courses are three credit hours. What about Course_name -> Credit_hours? If the assumption is made that Course_name is not unique, then it does not determine Credit_hours and the table is in 3NF. If the assumption is made that Course_name is unique, then Course_name does determine Credit_hours. However, if Course_name is unique, it is a candidate key which was not selected as the primary key. Looking carefully at the definition above, a candidate key may determine a non-key entity without causing a transitive dependency, so the table shown above is in 3NF. M 11.7: Further Normalization As mentioned in sub-module 11.3, there are normal forms beyond third normal form. They will not be covered in this course, but they will be presented in brief form here. Boyce-Codd normal form was originally proposed as a simpler statement of 3NF. However, it was discovered that it is not equivalent to 3NF, but is actually stricter. It turns out that every relation in BCNF is in 3NF, but the inverse is not necessarily true. While most relations in 3NF are also in BCNF, there are some relations which are in 3NF but are not in BCNF. Because of this, some refer to BCNF as 3.5NF, but most use the designation BCNF. Presenting an example of BCNF is a bit more complex than providing examples of the earlier normal forms, and the university database does not provide an example of a table which is in 3NF, but not in BCNF, so an example will not be provided. Further information, including an example, can be found in Section 14.5 in the text. Fourth normal form (4NF) was defined to address an issue known as multivalued dependency, and fifth normal form (5NF) was defined to address an issue known as join dependency. These types of dependency occur very infrequently, and in practice, most database designs stop at either 3NF or BCNF. Additional information is provided in Sections 14.6 and 14.7 in the text. Regardless of how far the normalization process is taken, it is sometimes desirable to denormalize part of a database to improve performance. Denormalization is the term used to describe the process of taking two base tables and storing the join of the two tables as a base relation. The joined table will be in a lower normal form than either of the two tables participating in the join. If the joined table is used quite often in queries, storing the joined table eliminates the need to perform the join for each query. This improves performance when querying the table. Denormalization does lead, however, to the issues inherent in a table which has a lower normal form. The tradeoff needs to be evaluated in each situation to determine whether or not denormalization is appropriate for the situation. M 12.2: Introduction to ObjectOriented Concepts and Features The term object-oriented was first applied to programming languages, often referred to as OOPLs. These date back as far as the late 1960s. One of the early OOPLs was Smalltalk, developed in the 1970s. Smalltalk was designed to be an object-oriented language, so is known as a "pure" OO programming language. This is different from a "hybrid" OOPL which takes an already existing language and extends it with OO features. An example of this is that the C programming language was extended in the 1980s to incorporate OO features. The extended language is known as C++. Java was developed later and is considered by some to be a pure OOPL, but many do not consider it to be a pure OOPL. Regardless of which side of the argument someone takes, all would agree that it is closer to a pure OOPL than C++ is. The debate as to whether or not Java is a pure OOPL is largely irrelevant for the purposes of this course and will not be pursed here. Note that much of what is covered in parts of this module you will likely have seen in the context of working with an object-oriented programming language. For most of you, that is probably Java or C++, possibly both. Various object-oriented concepts will be presented briefly with the assumption that the basic concepts will be a review for you. How these concepts are incorporated into ODBs will be the main focus of the module. An object consists of two components: its current value or its state and its behavior or its operations. Objects can, among other things, have a complex data structure and specific operations defined by the programmer. In an OOPL, objects exist only for the life of the program. Such objects are called transient objects. An OODB can save objects permanently in a database. Such objects are called persistent objects. These objects exist after a program terminates and can be retrieved later by another program. An OODB stores objects permanently in secondary storage, which provides the ability to share the objects among multiple programs and applications. This requires features common to other types of DBMSs such as indexing to efficiently locate the objects, concurrency control to allow sharing, and recovery from failures. An OODB will usually interface with at least one OOPL to provide persistent and shared object capabilities. In OOPLs, objects have instance variables, which hold internal state of the object. Instance variables are somewhat similar to attributes in RDBMSs. Unlike attributes, however, instance variables may be encapsulated and are not always made visible to end users. Instance variables may be of arbitrarily complex data types. ODBs permit definition of operations (functions or methods) that can be applied to objects of a particular type. Some OO models insist that the operations be predefined. This restriction forces the complete encapsulation of objects. This restriction has been relaxed in most OO data models. One reason is because users often need to know attribute names to specify selection conditions to retrieve the desired objects. Another reason is that complete encapsulation requires that even simple retrieval operations must have a predefined operation. This makes it difficult to write ad hoc queries. To promote the use of encapsulation, operations are defined in two parts. The first part is the signature or interface of the operation. The signature is the name of the operation and its parameters. The second part is the method or body of the operation. This provides the implementation of the operation, often written in a general purpose programming language. An operation is invoked by passing a message to an object. The message includes the operation name and its parameters. The object will then execute the method for the operation. Encapsulation enables the internal structure of an object to be modified without the need to change the external program that calls the operation. Another main concept of OO systems is that of type and class hierarchies and inheritance. Inheritance permits creation of new types or classes that inherit much of their structure and operations from previously defined types and classes. Early OODBs had an issue with representing relationships among objects. Early insistence on complete encapsulation led to the thought that relationships should not be specifically represented, but rather described by defining methods to locate related objects. However, this does not work well with complex databases which have a large number of relationships since in these cases it is useful to identify relationships and make them visible to users. The ODMG standard recognized this and the standard represents binary relationships as a pair of inverse references. Another object-oriented concept is operator overloading, or the ability to apply a given operation to different types of objects. This is also called operator polymorphism. The next few sub-modules present the main characteristics of ODBs. M 12.3: Object Identity and Objects versus Literals One goal of ODBs is to preserve the direct correspondence between real-world and database objects. To accomplish this, a unique identity is assigned to each independent object in the database. The unique identity is usually accomplished by assigning a unique value. This value is normally generated by the system. The value is known as the object identifier (OID). The OID may not be visible to the external user, but is used internally by the database to identify each object uniquely. The OID can be assigned to program variables of the appropriate type when needed. The main property of the OID is that it must be immutable. This means that the value does not change. This preserves the identity of the real-world object that the database object represents. To accomplish this, an ODB must have a method to generate OIDs and maintain their immutability. It is also preferable to use each OID only once. Even if the object is deleted from the database, the OID should not be assigned to a different object. This implies that the OID should not depend on any attribute values of the object since the values might be changed. The OID can be somewhat likened to the use of the primary key for tables in relational databases. It is also not a good idea to base the OID on the physical address where an object is stored, since these addresses can change over time. Some early ODBs used physical addresses as OIDs to increase efficiency. Indirect pointers were then used if the physical address of an object changed. Now it is more common to use long integers as OIDs and use a hash table to map the OID to the current physical address of the object. Early OO models required that everything be an object. This forced basic values (integer, etc.) to have an OID. This led to the possibility that a value, such as the integer 3, may have several different OIDs. While this can be useful in a theoretical model, it is not very practical since it would lead to too many OIDs. Most ODBs now allow for both objects (which get an OID) and literals (which just have values) which are not assigned OIDs. This requires that literals must be stored within an object and the literals cannot be referenced directly from other objects. In many systems, it is allowable to create complex structured literals which do not need an OID. M 12.4: Complex Type Structures for Objects and Literals Objects and literals may have an arbitrarily complex type structure to contain all necessary information to describe the object or literal. In RDBMSs, information about a particular object is spread across many relations and tuples. This leads to a loss of the direct mapping between a real world object and its database representation. In an ODB, it is possible to construct a complex type from other types by nesting type constructors. The three most basic constructors are atom, struct (tuple), and collection. One constructor has been known as the atom constructor. This term is still used, but was not used in latest standard. It includes the basic built-in types of the object model, which are similar to basic types such as integer, floating-point, etc. found in many programming languages. They are called single-valued or atomic types, since the value is considered an atomic single value. Another constructor is called the struct or tuple constructor. As with the atom constructor, these terms are not used in latest standard. This constructor can be used to create standard structured types such as tuples of the relational model. This type is made up of several components and is sometimes called a compound or composite type. The struct constructor is not actually a type, but rather a type generator since many different struct types can be created. An example would be a struct Name which is made up of FirstName (a string), MiddleInitial (a char), and LastName (a string). Note that struct and atom type constructors are the only two available in the original relational model. There are also the collection (or multivalued) type constructors. These include the set(T), list(T), bag(T), array(T), and dictionary(K,T) type constructors. They allow part of an object/literal value to include a collection of other objects or values when needed. These are also considered to be type generators since many different types can be created. For example, set(string), set(integer), etc. are different types. All elements in a given collection value must have the same type. The atom constructor is used to represent basic atomic values. The tuple creates structured values and objects of the form <a1:i1, a2,i2, etc.> where ai is an attribute name (instance variable in OO terminology) and ii is a value or OID. The other constructors are all different, but are collectively called collection types. The set constructor creates a set of distinct elements, all of same type. The bag constructor (also known as a multiset) is similar to the set constructor, but the elements in a bag need not be distinct. The list constructor creates an ordered list of OIDs or values of same type. The array constructor creates a single dimensional array of elements of the same type. Arrays and lists are same, except that an array has a maximum size, while a list can contain an arbitrary number of elements. The dictionary constructor creates a collection of key-value pairs. A key can be used to retrieve its corresponding value. An object definition language (ODL) is used to define object types for a particular ODB application. An example of this using a simplified demonstration ODL described in the text would be: define type EMPLOYEE tuple ( Fname: string; Minit: char; Lname: string; Ssn: string; Birth_date: DATE; Address: string; Salary: float; Supervisor: EMPLOYEE; Dept: DEPARTMENT; ); define type DATE tuple ( Year: Month: Day: integer; integer; integer; define type DEPARTMENT tuple ( Dname: string; Dnumber: integer; Mgr: tuple ( Manager: Start_date: Employees: set(EMPLOYEE); Projects: set(PROJECT); ); EMPLOYEE; DATE; ); In this example, the EMPLOYEE type is defined using the tuple constructor. Several of the types or attributes within the tuple, such as Address and Salary, are defined by the atom constructor using basic types. Three of the items, including Birth_date, refer to other objects, in this case to the DATE object. Attributes of this kind are basically OIDs that are references to the other object. These are used to represent the relationships among the various objects. A binary relationship can be represented in one direction, such as with Birth_date. In other cases, the relationship is represented in both directions, which is known as an inverse relationship. An example of this is the representation of the relationship between EMPLOYEE and DEPARTMENT. In the EMPLOYEE type, the Dept attribute is a reference to the DEPARTMENT object where the employee works. In the DEPARTMENT type, the Employees attribute has as its value a set of references to objects of the EMPLOYEE type. This set represents the set of employees who work in the department. M 12.5: Encapsulation of Operations and Persistence of Objects Encapsulation of Operations: Encapsulation is one of the main characteristics of object oriented languages and systems. Encapsulation is related to abstract data types and information hiding in programming languages. This concept was not used in traditional database systems since the structure of database objects was made visible to the users and application programs. For example, in relational database systems, SQL commands can be used with any relation in the database. In ODBs, the concept of encapsulation can be used since the behavior of a type of object is based on the operations that can be externally used with the object. Some operations can be used to update the object, others can be used to retrieve the current values of the object, and others can be used to apply calculations to the object. Usually the implementation of an operation is specified in a general purpose programming language. External users of the object are only made aware of the interface of the operations: the name and parameters of each operation. However the actual implementation is hidden from the users. The interface is called the signature, and the implementation is called the method. The restriction that all objects should be completely encapsulated is too strict for database applications. This restriction is eased by dividing the object into visible and hidden attributes (the instance variables). Visible attributes are visible to the end user and can be accessed directly through the query language. Hidden attributes are not visible and can be accessed only through predefined operations. The term class is used when referencing both the type definition and definitions of the operations for the type. An example of how the EMPLOYEE type definition shown in the last sub-module could be extended to include operations would be: define class EMPLOYEE type tuple ( Fname: string; Minit: char; Lname: string; Ssn: string; Birth_date: DATE; Address: string; Salary: float; Supervisor: EMPLOYEE; Dept: DEPARTMENT; ); operations age: integer; create_emp: EMPLOYEE; delete_emp: boolean; end EMPLOYEE; In this example, an operation age is specified. It needs no parameters and returns the current age of the employee by computing the age based on the birth date of the employee and the current date. This method would be defined externally using a programming language. A constructor operation is usually included. In the example it is given the name create_emp, but many ODBs have a default name for the constructor, usually the name of the class. A destructor operation is also usually included to delete an object from the database. In the example, the destructor is named delete_emp and will return a Boolean indicating whether or not the object was successfully deleted. An operation is invoked for a specific object by using the familiar dot notation. Specifying Object Persistence: An ODB is often closely aligned with a particular OO programming language which is used to specify the operation implementation. In some cases an object is not meant to be stored in the database. These transient objects will not be kept once the program using them terminates. This type of object is contrasted with persistent objects which are stored in the database and remain after a program terminates. The two methods used most often to make an object persistent are naming and reachability. The naming technique involves giving an object a unique persistent name within a the database. This can be done by giving an object a persistent object name in the program used to create the object. The named objects are entry points to the database which are used to access the object. Since it is not practical to give names to all objects in a large database, most objects are made persistent using the second method: reachability. This works by making the object reachable from some other persistent object. An object Y is said to be reachable from an object X if a sequence of references can lead from X to Y. It is interesting to note that in a relational database, all objects (tuples) are assumed to be persistent. In an RDB, when a table is created, it represents both the type declaration and a persistent set of all tuples. In an object-oriented database, declaring a class defines only the type and operations for the class. The designer must separately declare an object to be persistent. M 12.6: Type Hierarchies and Inheritance Simplified Model for Inheritance: Similar to OO programming languages, ODBs allow type hierarchies and inheritance. In Section 12.1.5, the text presents a simple OO model in which attributes and operations are treated together since both can be inherited. Inheritance permits the definition of new types based on, or inheriting from, existing types. This leads to a class hierarchy. A type is defined by first specifying a type name and then by defining attributes and operations for the type. In the simplified model, attributes and operations taken together are called functions. A function name can be used to refer to the value of an attribute or to the value generated as the result of an operation. The simplest form of a type is a type name and a list of visible, or public, functions. In the simplified model presented in the text, a type is specified in the format: TYPE_NAME: function, function, function, ..., function The functions are specified without parameters. Attributes would not have parameters, and operations are listed only by name for simplicity. Although they are not identical, an ODB type specification can look very much like a table definition in a relational database.. When a new type needs to be created and is not identical to but is similar to an already defined type, ODBs allow a subtype to be created based on an existing type. The subtype inherits all of the functions from the existing type. The existing type is now called the supertype. For example, using the syntax shown above, a person type might be defined as: PERSON: Name, Address, Birth_date, Age, Ssn Similarly, a student type might be defined as: STUDENT: Name, Address, Birth_date, Age, Ssn, Major, Gpa It can be seen that these two types are somewhat similar. A student is a person, but it is desired to include two additional attributes in the student type. Based on this, it is reasonable to derive the STUDENT type from the PERSON type and add the two additional attributes. That can be accomplished with syntax similar to the following: STUDENT subtype-of PERSON: Major, Gpa. Here, STUDENT will inherit all of the functions from PERSON. Only the Major and Gpa functions will need to be defined for STUDENT. Constraints on Extents Corresponding to a Type Hierarchy: In most ODBs, an extent is defined to store the collection of persistent objects for each type or subtype. If this is true for the ODB being used, there is a corresponding constraint that every object in a subtype extent must also be a member of the supertype extent. Some ODBs have a predefined system type (usually called either the ROOT class or the OBJECT class ) whose extent contains all objects in the system. Then all objects are placed into additional subtypes that are meaningful based on the application. This creates a class hierarchy or type hierarchy for the system. All extents then become subsets of the OBJECT class. An extent can be defined as a named persistent object whose value is a persistent collection that holds a collection of objects of the same type that are stored permanently in the database. Such objects can be accessed and shared by multiple programs. It is also possible to create a transient collection. For example, it is possible to create a transient collection to hold the result of a query run on a persistent collection. The program can then work with the transient collection which will go away when the program terminates. The ODMG model is more complex than what is presented here. This standard allows two types of inheritance. One is type inheritance, which the standard calls interface inheritance. The other type is the extent inheritance constraint. Details of the standard will not be covered in this course. M 12.7: Additional Object-Oriented Concepts and a Summary of ODB Concepts There are a few additional object-oriented concepts that you may have already seen in an object-oriented programming class. One is the polymorphism of operations which may be be better known as operator overloading. This concept permits the same operator name (which may be the symbol for the operator) to be applied differently depending on the type of object the operator is applied to. In ODBs, the same concept can be used. In an ODB, different implementations of a given method may need to be used depending on the actual type of object the method is being applied to. Another concept is multiple inheritance. Multiple inheritance occurs when a subtype inherits from two or more types thereby inheriting the functions of both supertypes. There are some issues with multiple inheritance such as the potential for ambiguity if any functions in the supertypes have the same name. There are methods to handle the potential problems, but some ODBs and OOPLs prefer to avoid the issue completely and do not allow multiple inheritance. The main concepts used in ODBs and object relational systems are: Object identity Type constructors Encapsulation of operations Programming language compatibility Type hierarchies and inheritance Extents Polymorphism and operator overloading M 12.8: Object Database Extensions to SQL Some features from object databases were first added to SQL in the SQL standard known as SQL:99. These features were first treated as an extension to SQL, but these and additional features were included in the main part of SQL in the SQL:2008 standard. As we have seen, a relational database which includes object database features is usually called an object-relational database. Some of the object database features which have been included in SQL are: o o o o Type constructors to specify complex objects A method for providing object identity using a reference type Encapsulation of operations through user-defined types (UDTs) which may include operations a part of their declaration. In addition, userdefined routines (UDRs) allow the definition of general operations (methods). Inheritance is provided using the keyword UNDER Additional details and some of the syntax specified in SQL to (make the above happen) are provided in Section 12.2 in the text. These details will not be covered in this course. M 13.1: Transactions A transaction is the term used to describe a logical unit of processing in a database. For many years now, large databases with hundreds, even thousands, of concurrent users are being used by almost everyone on a daily basis. The use of these databases continues to increase. Such databases are often referred to as transaction processing systems. These systems are exemplified by applications such as purchasing concert and similar event tickets, shopping online, online banking, and numerous other applications. As further discussed below, a transaction must be completed in its entirety or the database is possibly left in an incorrect state. This module focuses on the basic concepts and theory used to make sure that transactions are executed correctly. This includes dealing with concurrency control problems which can occur when multiple transactions are submitted by different users. The problem happens when the requests of the different users interfere with one another. Also discussed is the issue of recovering the database when transactions fail. Back in Module 2, we saw that one of the ways to classify DBMSs was by number of users: single-user vs. multiuser. Although not a problem with single-user systems, there is a potential problem of interference with multiuser systems since the database may be accessed by many concurrent users. Multiuser systems have been available for decades, and have been based on the concept of multiprogramming, where an operating system (OS) permits the execution of multiple programs (technically processes) at apparently the same time. For most of these decades, computers running multiprogramming operating systems contained a single CPU. A single CPU can actually run only one process at any given time. The OS runs a process for a short time, suspends its execution and then begins or continues the execution of another process, and so on. Since both the time slice given to a process and the necessary switching between processes happen at computer speed, the execution appears to be simultaneous to the user. This leads to interleaved concurrency in multiuser systems. Newer systems have multiple CPUs and these systems can execute multiple processes concurrently. Details of these and related topics will be left to an operating systems class and will not be further discussed here. Most of the theory presented in this sub-module was developed many years ago (before multiple CPU systems) and since the general concepts still apply, a single CPU system will be assumed for the presentation in this sub-module. A transaction includes one or more database access operations. The access can be a retrieval or any of the update operations. A transaction can be specified using SQL or can be part of an application program. In either case, both the beginning and end of a transaction must be specified. The details of the specification can vary, but the statements will be similar to begin transaction and end transaction. All database operations between the transaction delimiters are considered to form a single transaction. If all the database operations within the transaction only retrieve data, but do not update the database, the transaction is considered to be a readonly transaction. If one or more of the operations can possibly update the database, the transaction is considered to be a read-write transaction. Since the OS may interrupt a program in the middle of a transaction, it is possible that two different programs attempt to modify the same data item at the same time. This possibility can lead to problems in the database unless concurrently executing transactions can be run with some type of control. As an example, consider purchasing a ticket online for an event. To keep the example simple, consider that the event has open seating. The ticket allows entry to the event but the seats are not reserved, so all tickets are "equal." Assume that customers Nittany and Lion are both interested in purchasing tickets to the event. Assume the timeline of their interactions looks something like this: Time 9:05 AM Nittany Reads the record and sees that 50 tickets are available - begins to discuss price and other de friends 9:08 AM 9:17 AM A total of 6 decide to go, so an update is issued to get the tickets - this updates the record le total of 44 available tickets (50 - 6) 9:19 AM Based on this example, both customers get their tickets. However, this leaves the database showing that 43 tickets are still available when there are actually only 37 tickets remaining (50 - 6 - 7). Based on this scenario, the update from Nittany was overwritten by the update from Lion. This is known as the lost update problem since the update from Nittany was lost. There are related issues which also must be addressed. The dirty read problem occurs when transaction A updates a database record, say record R, and then the updated record R is read by transaction B. However transaction A is not able to complete and and the update to record R must be rolled back to its original value, the value it had before transaction A began. Since transaction B is looking at a value in record R that was later replaced by an older value, transaction B sees an invalid or dirty value in record R. Another potential problem is called the incorrect summary problem. If one transaction is calculating values for an aggregate function while other transactions are updating the value(s) of one or more tuples being used in the computed (summarized) value by the aggregate function, it is possible that the values in some of the tuples being used are the values they had before the update and other tuples the values being used are the values they have after the update. This leads to incorrect results. The unrepeatable read problem occurs when a transaction reads the same item twice and another transaction changes the value of the item in between the two reads. The first transaction sees different values the two times the value is read. The next sub-module discusses some techniques which can be used to control concurrency. The DBMS must guarantee that for every transaction, either all database operations requested by the transaction complete successfully and are permanently recorded in the database or that the transaction has no impact on the database. In the first case, the transaction is said to be committed, while in the second case the transaction is said to be aborted. Therefore, if a transaction makes some of its updates, but then fails before completing its remaining updates, the completed updates must be undone so they will have no lasting impact on the database. There are several reasons a transaction might fail. These include: A computer failure A transaction or system error Local errors or exception conditions detected by the transaction Concurrency control enforcement Disk failure Physical problems and catastrophes The DBMS must account for the possibility of all of these failures and have a procedure in place to recover from each type. M 13.2: Concurrency The last sub-module briefly discussed the need for concurrency control. This sub-module discusses concurrency control techniques which can be used to insure that transactions which are running concurrently can be executed so they do not interfere with each other. Most of the techniques guarantee serializability. When transactions can be interleaved, it is impossible to determine the order in which the transactions will be executed and how and when each might be interrupted by the other. If two transactions don't interrupt each other, one of two situations will occur: either transaction A will run first and at a later time transaction B will run, or transaction B will run first and at a later time transaction A will run. Both orderings are considered to be correct. Both are called serial schedules since the transactions run in a series: first one, then the other. When the transactions interrupt each other, it is possible that they interact in a way that is not desired. An example of this was shown in the last sub-module when discussing the lost update problem. One way to prevent the problems discussed in the last sub-module is to simply prohibit concurrent execution of transactions. Transactions would be required to execute in a serial fashion. Once a transaction starts, no other transaction can start until the executing transaction completes. While this is a valid solution to the problems, it is not acceptable in practice since it would eliminate concurrency and cause a vast under-utilization of system resources. Since most transactions will not interfere with each other, this loss of concurrency is considered unacceptable. In practice, concurrent transactions are permitted as long as it can be determined that the results of their execution will be equivalent to a serial schedule. We will not further pursue the detailed concepts surrounding serializability. If you are interested, please see Chapter 20, Section 5 in the text. An early method used to guarantee serializability of schedules is the use of locking techniques, specifically two-phase locking. Although still in use today by some DBMSs, most consider locking to have high overhead and have moved to other protocols. Since basic locking techniques are relatively easy to understand and since many of the newer protocols are derived from locking, the basics of locking are described here. A lock is usually implemented as a variable associated with a data item which indicates which operations can be applied to the data item at a given time. To start the discussion, we will describe a binary lock. This type of lock is simple to implement and to understand, but it is too restrictive to be used in database systems. A binary lock has two states: locked and unlocked (which can be represented as 1 and 0). If the value of the lock is 1 (locked), the data item cannot be accessed. When this happens, a transaction needing to access the item will wait until the lock is unlocked. If the value of the lock is 0 (unlocked) the item can be accessed. If a transaction finds the value of the lock is 0, it will change the lock value to 1 before accessing the item so any transactions which follow will see the item as locked. Once the transaction finishes modifying the data item, it will unlock the lock by resetting the value to 0. A binary lock scheme is easy to implement. For any item which is locked, an entry is put in a lock table. This entry contains the name of the data item and the ID of the locking transaction. A queue is also kept of any transactions currently waiting on this item to be unlocked. Any data item not in the lock table is unlocked. The binary lock is too restrictive for database use since only one transaction at a time can hold a lock on a particular item. More than one transaction should be allowed to access the same item as long as all the transactions are accessing the item only to read the item. If any transaction wants to modify the item, it will need an exclusive lock on the item. To allow this type of locking, a different type of lock, which has three possible states, must be used. This type of lock is called a shared/exclusive or read/write lock. The three states are unlocked, read-locked, and write-locked. The read lock is a shared lock, and the write lock is an exclusive lock. Once a transaction holds a write lock, it may modify the item since no other transaction can hold a lock on the item at that time. When the transaction finishes modifying the item, it will unlock the item. When a transaction only wants to read the item, it can ask for a read lock. If the item is unlocked or other transactions hold read locks on the item, the read lock is granted. If another transaction holds a write lock on the item, the lock is not granted. Once a transaction holding a read lock is finished reading the item, it will release its read lock. For this type of locking scheme, the lock table entries need to be modified. In addition to the name of the data item and locking transaction ID, the entry must contain the type of lock (read or write), and if a read lock, the number of transactions holding the lock. If more than one transaction holds a read lock on the item, the locking transaction ID becomes a list of the IDs of all transactions holding the lock. When a read lock is released and more than one transaction is holding the lock, the count is decreased by one and the ID of the transaction releasing the lock is removed from the list. An additional enhancement is to allow a transaction to convert a lock from one type of lock to the other if certain conditions are met. It is possible to convert a read lock to a write lock if no other transaction has a read lock on the item. This is called a lock upgrade. It is also possible to convert a write lock to a read lock once a transaction is finished updating the data item. It would perform this lock downgrade if it still needs to have the item available for reading. If it is completely finished with the item, it would simply release the lock to unlock the item. Serializability can be guaranteed if transactions follow a two-phase locking protocol. A two phase locking protocol is followed if all locking operations, both read locks and write locks, are completed before the first unlock operation is executed. This divides the transaction into two parts. The first is called the expanding phase where locks are acquired but not released. The second is called the shrinking phase where locks are released but no new locks can be acquired. If lock conversion is permitted, upgrading from a read lock to a write lock must be done during the expanding phase, and downgrading from a write lock to a read lock must be done during the shrinking phase. The above describes a general form of two-phase locking (2PL) that is known as basic 2PL. There are other variations of 2PL. A variation known as conservative 2PL requires a transaction to acquire all of its locks before the transaction can begin execution. If all the required locks cannot be obtained, the transaction will not lock any item and will wait until all locks can be obtained. This is a deadlock-free protocol (see below), but difficult to use since a transaction may not know all the locks it will need prior to beginning execution. A more widely used version is known as strict 2PL. In this variation, a transaction does not release any write locks until after the transaction either commits or aborts. This will prohibit another transaction from reading a value which may not end up being committed to the database. This variation is not deadlock-free. Another version, rigorous 2PL, is more restrictive than strict 2PL, but less restrictive than conservative 2PL. Using rigorous 2PL, a transaction does not release any locks until after it commits or aborts. This makes it easier to implement than strict 2PL. One issue which must be addressed is deadlock. Deadlock occurs when two (or more) transactions are waiting to obtain a lock which is held by the other. Since neither can proceed until obtaining the required lock, neither will free the lock needed by the other. One general method for dealing with deadlock is to prevent deadlock from happening in the first place. These protocols are called deadlock prevention protocols. One such protocol is the conservative 2PL protocol discussed above. There are other protocols which have been developed to prevent deadlock. A second general method for dealing with deadlock, deadlock detection, is to detect that deadlock has occurred and then resolve the deadlock. This method usually provides more concurrency, and is attractive in cases where there will be minimal interference between transactions. This is usually the case when most transactions are short and lock only a few data items. It is also usually the case when the transaction load is light. Once deadlock is detected, at least one of the transactions involved in the deadlock must be aborted. This is known as victim selection. There are several criteria that can be considered for selecting the "best" victim, which is the one that causes the least amount of upheaval when it is aborted. An additional issue which must be addressed when using locking is known as starvation. One form of starvation is when a transaction must wait for an indefinite time to proceed while other transactions are able to complete normally. This can happen with certain types of lock waiting schemes, but there are modifications to give some types of transaction priority while still making sure that no transaction is blocked for an extremely long period of time. A second form of starvation occurs when a given transaction repeatedly becomes deadlocked and is chosen as the victim each time, so it is not able to finish its execution. Again, there are schemes that can be implemented to prevent this type of starvation. Additional discussion of this topic, can be found in Chapter 21 in the text, beginning with Section 21.2. M 13.3: Database Recovery There are a variety of events that can affect or even destroy data in a database. DBMSs and system administrators must be prepared to handle these events when and if they occur. The event can be as simple as a valid user entering an incorrect value in a tuple or the event can be something as massive as a fire, earthquake, or other disaster destroying the entire data center and everything in it. Thus, the possibilities range from a single incorrect data value on one end to the complete destruction of the database on the other end. Since it is almost certain that such events will occur at least occasionally, tools and procedures must be in place to correct or reconstruct the database. This falls under the heading of database recovery and the related database backup. One aspect of this is the concept of database backup which has been around for a very long time. On a regular basis, the database must be backed up (copied) and the backup must be kept in a safe place which is away from the main database site if the backup is placed on a physical medium. This was the case for many years, but more recent advances in networking technology and related speed improvements have made it possible to perform a backup directly to a remote site or to the cloud. The second basic backup task is to keep a disk log or journal of all changes which are made to the data. This includes all updates (insertions, deletions, and modifications) but does not include reads since they do not change the data. There are two basic types of database logs. One is called a change log or a before and after image log. This type of log records the value of a piece of data just before it is changed and then again just after it is changed. In addition to the values, the log records the table name and the primary key of the tuple which is changed. The other type of journal, normally called a transaction log keeps a record of the program (including interactive SQL) which changed the data and all the inputs that the program used. Regardless of which type of log is used, a new log is started as soon as the data is backed up (by making a backup copy). Given that backups and logs are available, how are they used for recovery? The answer is that it depends on what type of recovery needs to be performed. If the problem is a major problem such as the loss of a database table, the entire database, or even the loss of the disk which contains the database table, then at least one table needs to be recreated. This is usually done using a procedure called roll-forward recovery. Considering the recovery of one table (which can be repeated if the need is to recover two or more tables), the process begins with the most recent backup copy of the table. This is a copy of the table which was lost, but it does not reflect the most recent changes to the table. To bring the table up to date, the log is used. Starting from the beginning of the log (which began just after the backup) each log entry which represents an update to the table in question is then applied to the table. Starting at the beginning of the log and then rolling forward in time sequence order will update the table in the order that the changes were made, thus bringing the table up to the point where it was just before the table was "lost." Now consider a different reason for recovery. Assume that during normal database operation an error is found in a recent update to a data value. This can be caused by something as simple as a user entering an incorrect value, or by something more complex like a program crashing after updating some, but not all records in a transaction. Since the program crashed, it was able to neither commit the transaction nor to abort it. At first glance, it seems that an easy solution is to go in and update the affected values with the accurate data. However, this does not take into account the possibility that after the error occurred but before it was discovered, other programs made updates to the database based on the incorrect data. Because of this possibility, all changes made to the database since the error occurred must be backed out of the database. This is done with a process called rollback by starting with the database as it currently exists (a backup copy is not needed in this scenario) and then starting at the end of the log moving backwards through the log and restoring the data values to their "before" values. Working through the log stops when it reaches the point in time where the error was made. Once the database is in this state, the value that was in error can be changed to the correct value. Then the transactions which made changes after the error was made can be rerun. This may need to be done manually, but if a transaction log is kept, a utility program can roll forward through the transaction log and automatically rerun all of the transactions which occurred beginning at the point the error was made. This sub-module only describes the basic concepts of database recovery. Ch. 22 in the text discusses additional concepts in more detail and also provides an overview of various database recovery algorithms. If you are interested, please read that chapter in the text. M 13.4: Database Security This sub-module provides a very brief introduction to database security. Database security can be quite comprehensive and it is not the intent to provide a thorough discussion here. Also, database security shares much in common with overall site and network security. If you have taken a course in security, much of what is covered there, such as physical security of the site, applies to the database system also. These general security techniques will not be specifically discussed here. Our brief overview will be limited to database-specific issues. Types of Security Database security covers many areas. These including the following: Legal and Ethical issues regarding the right to access certain information. Policy issues at the governmental and corporate level. System related issues - at what levels should various security functions be enforced? Some issues can be handled by the hardware or the operating system, others will need to be handled by the DBMS. The need of some organizations to provide multiple security levels such as top secret, etc. Threats to Databases Loss of integrity: integrity is the requirement to protect data from improper modification. Integrity is lost by unauthorized changes to the data, whether intentional or accidental. Loss of availability: availability is defined as providing data access to humans or programs that have a right to access the data. Loss of confidentiality: confidentiality deals with the protection of data from unauthorized disclosure. Database Security When examining database security, it must be remembered that it is not implemented in a vacuum. The database is usually networked and is part of a complete system which usually includes firewalls, web servers, etc. The entire system needs to work together to implement security. A DBMS usually has a security and authorization subsystem which provides security to prevent unauthorized access to certain parts of a database. Database security is usually split into two main types: discretionary and mandatory. Discretionary security mechanisms involve the ability to grant various database privileges to database system users. Mandatory security mechanisms involve putting users into various security categories which are then used to implement the security policy of the organization. Control Measures The four main control measures used to provide database security are: Access control: Used to restrict access to the database itself. Inference control: Statistical databases allow queries to retrieve information about group statistics while still protecting private individual information. Security for statistical databases must provide protection to prevent the private information from being retrieved or inferred from the statistical queries. Flow control: This prevents the flow of information in a way that it reaches users unauthorized to see the data. Data encryption: To add an extra security measure, data in the database can be encrypted. When this is done, even if an unauthorized user is able to get to the stored data, the unauthorized user will have trouble deciphering it. The Role of the DBA As we discussed in Module 1, the DBA has many responsibilities in the oversight of the database. One of these responsibilities is for the overall security of the database. This includes creating accounts, granting privileges to accounts, and assigning users to the appropriate security category. The DBA will assign account numbers and passwords to users who will be authorized to use the database. In most DBMSs, logs will be kept each time a user logs in. The log will also keep track of all changes the user makes to the database. If there are unauthorized changes made to the database, this log can be used to determine the account number (and therefore the user) who made the changes. Additional Information If you are further interested in this topic, additional information is provided in later parts of Chapter 30. You might find of particular interest the discussion of the SQL injection threat discussed in Section 30.4. M 14.1 Introduction and Basic Concepts As discussed in Module 2, early DBMSs were based on a centralized architecture where the DBMS ran on a mainframe and users accessed the database via "dumb" terminals. This was dictated by the technology of the time when the centralized mainframe was the technology which was available. Early research was based on networking the mainframes, but for several years work in the area remained at the research level. As PCs moved from being considered only hobby machines to being used in business in the 1980s, PCs moved from stand-alone computers to computers which were networked over local area networks. Some DBMSs were reduced in size to run stand-alone on PCs for small databases. However, DBMSs were also deployed on small servers and used in a client/server architecture. The DBMS was still centralized on the server, but some of the processing was moved from the server to the PC client. This was enabled partly by the improvement in local area networks (LANs). In the late 1990s the web became more prevalent, especially with the development of the first browser in 1994. Throughout the rest of the 1990s and the 2000s, improvements continued to be made in both browsers and in the networking infrastructure supporting the web . This fueled an explosion in web usage. As part of this explosion, web sites were hosting databases to support their evolving online businesses. This led to a move to a three-tier architecture, with one tier being the client running on a browser, the second being an application server running various applications for the company, and the third being a database server which the application server can query. Note that while this distributes the overall processing, the database itself is not necessarily distributed. What is a Distributed Database? A distributed database (DDB) is a collection of multiple logically interrelated databases distributed over a computer network. A distributed database management system (DDBMS) is a software system which manages a distributed database while making the distribution transparent to the user. To be classified a distributed database, the database must meet the following: There are multiple computers called sites or nodes. The sites are connected by an underlying computer network which transmits data and commands among the sites. The information in the database nodes must be logically related. It is not necessary that all nodes have data, hardware, and software that is identical. The sites may be connected by a LAN or connected by a wide-area network (WAN). When considering database architecture issues at a high level, the network architecture does not matter. The only requirement is that every node needs to be able to communicate with every other node. The network design is, however, critical to overall distributed database performance issues. Network design is a topic that is beyond the scope of this course and will not be covered here. Transparency In this context, transparency refers to hiding implementation details from users. For a centralized database, this refers just to logical and physical data independence. In a distributed database, there are additional types of transparency which need to be considered. They include the following. Distribution transparency: This deals with the issue that the user should not be concerned about the details of the network and the placement of data. This is divided into two parts. Location transparency is where the command used to perform a task is independent of the location of the data and of the location of the node issuing the command. Naming transparency provides that once named, an object can be accessed without additional location information being specified. Replication transparency: This allows the database to store data at multiple sites, but user is unaware of the copies. Fragmentation transparency: This provides that a relation (table) can be split into two or more parts and the user is unaware of the existence of the fragments. This is further discussed in the next sub-module. Availability and Reliability In computer systems, in general, reliability is defined as the probability that the system will be running at a specified point in time. Availability is the probability that the system will be always available during a specified time interval. In distributed databases, the term availability is used to indicate both concepts. Availability is directly related to the faults, errors, and failures of the database. These terms describe related concepts and the terms are described in slightly different ways by various sources. Here we will describe a fault as a defect in the system which can be the cause of an error. An error is when one component of the database enters a state which is not desirable. An error may lead to a failure. A system failure occurs when the system does not function according to its specifications. Potential improvement in availability is one of the main advantages of a distributed database. A reliable DDBMS tolerates failures of underlying components. The recovery manager of the DDBMS needs to deal with failures from transactions, hardware, and communications networks. Scalability and Partition Tolerance Scalability indicates how much a system can expand and still continue to operate without interruption. Horizontal scalability is the term used for expanding the number of nodes. As nodes are added, it should be possible to distribute some of the data and processing to new nodes. Vertical scalability is the term used for expanding the capacity of individual nodes. This would include expanding the storage capacity or the processing power of the node. If there is a fault in the underlying network, it is possible that the network may need to be partitioned (for a time) into sub-networks. Partition tolerance says that the database system should be able to continue operating while the network is partitioned. Autonomy Autonomy is the extent to which individual nodes or databases connected to a DDB can operate independently. This is desirable to increase flexibility. M 14.2: Distributed Database Design We have looked at database design issues and concepts throughout the course. When working with a distributed database, in addition to the general design of the database, other factors must be considered. Data Fragmentation and Sharding When using a centralized database, there is no decision to be made as to where to store the data: it is all stored in the database. When the database is distributed, it must be determined which sites should store which portions of the database. For now, assume that the data is not replicated; data is stored at one site only. First, the logical units of the database must be determined so they can be distributed. The simplest logical unit is a complete relation (table). The entire relation will be stored at one site. However, a relation can be split into smaller logical units. One way to do this is by creating a horizontal fragment (also called a shard) of a relation. This is a subset of the tuples of a relation (the table is split horizontally). The fragment can be determined by specifying a condition on one or more attributes of the relation. For example, in a large company with multiple sales locations, the customer relation may be split into several horizontal fragments by using the condition of which sales site is primarily responsible for providing service to the customer. Each fragment can then be stored at the database site closest to the sales site. A second way to split a table into smaller logical units is by vertical fragmentation. This is a division of the relation by columns (attributes); the vertical fragment keeps only certain attributes of the relation. If a relation is split into two parts using vertical integration, at least one of the attributes must be kept in both fragments so the original tuple can be recreated from the two fragments. This common attribute kept in both tables must be the primary key (or some unique key). The two types of fragmentation can be intermixed resulting in what is known as mixed or hybrid fragmentation. A fragmentation schema includes a definition of the set of fragments which must include all tuples and attributes and also must allow the entire database to be recreated with a series of outer join and union operations. An allocation schema describes the distribution of fragments to nodes of the DDB. If a fragment is stored at more than one site, it is said to be replicated. Data Replication and Allocation Replication is used to improve the availability of data. At one end of the spectrum is a fully replicated distributed database where the entire database is kept at every site. This improves the availability of the database since the database can keep operating even if only one site is up. This full replication also improves retrieval performance since results of queries can be obtained locally. The disadvantage is that full replication can slow down update performance since the update must be performed on every copy of the database. At the other end of the spectrum is no replication. This provides for faster updates and the database is easier to maintain. However, availability and retrieval performance can suffer if the database is not replicated. In between is a wide variety of partial replication options. Some fragments can be replicated, but not others. The number of copies of each fragment can range from as few as one to as many as the total number of sites. A description of the replication of fragments is called the replication schema. The choice of how the database should be replicated is based on both availability needs, and update and retrieval performance needs. M 14.3: Concurrency Control and Recovery Several concurrency control methods for DDBs have been proposed. They all extend the concurrency control techniques used in centralized databases. These techniques will be discussed by looking at extending centralized locking. The general scheme is to designate one copy of each data item as a distinguished copy. The locks for the data item are associated with the distinguished copy, and all locking and unlocking requests are sent to the site which houses the distinguished copy. All the methods based on this idea differ in how the distinguished copies are chosen. In all the methods, the distinguished copy of a data item acts as a coordinator site for concurrency control on that item. Primary Site Technique Using this technique, a single site is chosen to be the primary site and this site is used as the coordinator site for all database items. This site, then, keeps all locks and all requests for locking and unlocking are sent to this site. The advantage of this technique is that it is a simple extension of the centralized locking approach. The disadvantages are that since all locking requests are sent to a single site, the site might become overloaded and cause an overall system bottleneck. Also, if this site goes down, it basically takes the entire database down since all locking is done at this site. This limits both reliability and availability of the database. Primary Site with Backup Site This approach addresses the problem of the entire DDB being unable to operate if the primary site goes down. With this approach, a second site is chosen as the backup site. All locking information is maintained at both sites. If the primary site fails, the backup site takes over and a new backup site is chosen. This scheme still suffers from the bottleneck issue, and the bottleneck potential is actually worse than that with the primary site technique since all lock requests and lock status information must be recorded at both the primary and backup sites, thereby leading to two potential bottleneck points. Primary Copy Technique This method distributes the load of lock coordination. The distinguished copies of data items are stored at different sites, reducing the bottleneck issue seen with the previous two techniques. If one site fails, only transactions which need a lock on one of the data items whose lock is at the site are affected. This method can also use backup sites to improve reliability and availability. Choosing a New Coordinator in Case of Site Failure If a primary site fails, a new lock coordinator must be chosen. If a method with a backup site is being used, all transaction processing of impacted data items is halted while the backup site becomes the new primary. Transaction processing will not resume until a new backup site is chosen and all lock information is copied from the new primary site to the new backup site. If no backup site is being used, all executing transactions must be aborted and a lengthy recovery process is then launched. As part of recovery, a new primary site must be chosen and a lock manager process must be started at the chosen site. Once this is done, all lock information must be created. If no backup site exists, an election process is used for the remaining active sites to agree on a new primary site. The elected site then establishes the new lock manager. Distributed Concurrency Control Based on Voting In this technique, no distinguished copy is used. A lock request is sent to all sites that have a copy of the data item. Each copy has a lock table for the item and can allow or refuse a lock for the item. When a transaction is granted a lock on the item by a majority of the sites holding the item, the transaction holds the lock and informs all copies that it has been granted the lock. If the transaction does not receive the lock from a majority of sites within a timeout period, it cancels the lock request and informs all sites of the cancellation. This method is a true distributed concurrency control method. Studies have shown that this method produces more message traffic than the techniques which use a distinguished copy. Also, if the algorithm deals with the possibility that a site might fail during the voting process, the algorithm becomes very complex. Distributed Recovery Recovery in a DDB is quite involved and details will not be discussed here. One issue that must be considered is communication failure issues. If one site sends a message to a second site and does not receive a response, there are several possibilities for not receiving the response. One is that the second site is actually down. Another is that the second site did not get the message because of a failure in the network delivery system. A third possibility is that the second site received the message and sent a response, but the response was not delivered to the initial site. Another problem which must be addressed by distributed recovery is dealing with a distributed commit. When a transaction is updating data at two or more sites, it cannot commit the distributed transaction until it is sure that the data at every site has been updated correctly. This means that each site must have recorded the effects of the transaction at the local site before the distributed transaction can be committed. M 14.4: Overview of Transaction Management and Query Processing in Distributed Databases Distributed Transaction Management With a distributed database, a transaction may require that tables be updated on different nodes of the DDBMS. The concept of a transaction manager must be extended to include a global transaction manager to support distributed transactions. The global transaction manager coordinates the execution of the transaction with the local transaction manager of each impacted site. It will ensure that the necessary copies of each table will be updated and committed at each site. The transaction manager will pass information to the concurrency controller to acquire and then eventually release the necessary locks. Originally, a two-phase commit (2PC) protocol was used. With this protocol, in phase 1 each participating database will inform the coordinator that it has completed the changes required. The coordinator will then send a message to all nodes indicating that they should prepare to commit. Each node will then write to disk all information needed for local recovery, and then send a ready to commit message back to the controller. If any local database has an issue where it cannot commit its part of the transaction, it will send a message to the controller that it cannot commit. If the coordinator receives a ready to commit message from all participating nodes, the coordinator sends a commit signal to all nodes, at which time the commit is completed by the local controller of each node. If one or more nodes had sent a cannot commit to the controller, the controller will send a roll back message to all nodes. Each node will then roll back the local part of the transaction. The two-phase commit protocol has an issue when the global coordinator fails during a transaction. Since this is a blocking protocol, any locks on other sites will continue to be held until the coordinator recovers. This problem was resolved by extending the protocol to a threephase commit (3PC) protocol. This extension divides the second phase of 2PC into two phases which are known as prepare-to-commit and commit. The prepare-to-commit phase is used to communicate the results of the replies from phase 1 to all participating nodes. If all replies were yes, the coordinator indicates that all nodes should move to the prepare-to-commit state. The commit phase is the same as the second part of the 2PC. With this extension, if the coordinator crashes during this sub-phase, another participant can take over and continue the transaction to completion, whether that be an eventual commit or an abort. Distributed Query Processing Since this course did not cover how queries are processed or optimized by a non-distributed DBMS, there is no knowledge base to build upon to discuss the details of distributed query processing. However, a few points about query processing with a distributed database can be addressed at a high level. When processing a distributed query, the data must first be located using the global conceptual schema. If any of the data is replicated, the most appropriate site to be used to retrieve the data for this query is identified. These algorithms must take several items into consideration. One is the data transfer costs. To complete some queries, it is necessary to move intermediate results and possibly final results over the network. Depending on the type of network, this data transfer might be relatively expensive, and if so, minimizing how much data is transferred is a consideration in how the query is executed. Also, the initial design of the database should consider the necessity of performing joins on tables which are not located at the same site. If two tables will often need to be joined, the design should strongly consider storing a copy of each table at the same site. When joins do need to be performed across the network, an operation called a semijoin is often used. With a semijoin, only the joining column of one of the tables is sent across the network to where the second table is located. The column is then joined with the second table. Only the resultant tuples are then returned to the site of the first table where they are joined with the first table. So rather than sending an entire table across the network, only one column of the first table is sent over and then only the required subset of tuples from the second table is sent back. Query optimization is a very interesting topic which time limitations did not allow us to cover in this course. If you would like to study this topic, start with Chapters 18 and 19 in the text. This will provide the background to more completely study Section 23.5 in the text and additional material in the literature. M 14:5 Advantages and Disadvantages This sub-module highlights some of the advantages and disadvantages of distributed databases. Centralized Database Advantages include: A high level of control over security, concurrency, and backup and recovery since it can be covered at one site. No need to deal with the various tables and directories needed to manage distribution. No need to worry about distributed joins and related issues. Disadvantages include: All database access from outside the site where the database is located requires WAN communication and related costs. The database site can be a bottleneck. If the site goes down, there is no database access. This can cause availability issues. Distributing Tables to Various Sites with No Replication and No Partitioning Advantages include: Local autonomy. Reduced communications costs when each table is located at the site that uses it the most often. Improved availability since some parts of the database are available even if one or more of the sites are down. Disadvantages include: Security, concurrency, and backup and recovery issues concerns are spread across multiple sites. Requires a distributed directory and related global tables as well as the software required to support transparency. Requires distributed joins. Distributed Including Replication Advantages beyond distributing tables include: Reduced communications cost for queries since copies of tables can be kept at all sites which heavily use the table. Additional improvement to availability since if a site goes down, it is possible that another site may have copies of one or more tables hosted by the site which went down. Disadvantages beyond distributing tables include: Additional concurrency control is required across multiple sites when data in replicated tables is updated. Distributed Including Partitioning Advantages include: Highest level of local autonomy. Data at tuple or attribute level can be stored at the site that most heavily uses it. Additional reduction of communication costs since data at tuple or attribute level can be stored at the site that most heavily uses it. Disadvantages include: Retrieving an entire table or a large part of the table might require accessing multiple sites. M 15.1: Introduction Over time, many organizations started to need to manage large amounts of data which did not fit nicely into the relational database model. Newer applications also required this ability. Examples of such organizations include Google, Amazon, Facebook, and Twitter. The applications include social media, Web links, user profiles, marketing and sales databases, navigation applications, and email. The systems which were created to manage this type of data are generally referred to as NoSQL systems, or Not Only SQL systems. Such systems focus on the storage of semi-structured data, on high performance, on high availability, on data replication, and on scalability. This contrasts with traditional databases which focus on immediate data consistency, powerful query languages, and the storage of structured data. Many of the needs of the above organizations and applications did not match well with the features and services provided by traditional databases. Many services provided by the traditional databases were not needed by many of the applications and the structured data model was too restrictive. This led many companies such as Google, Amazon, and Facebook to develop their own systems to provide for the data storage and retrieval needs of their various applications. Additional NoSQL systems have been developed for research and other uses. M 15.2: Characteristics and Categories of NoSQL Systems Characteristics of NoSQL Systems Although NoSQL systems do differ from each other, there are some general characteristics which can be listed. The characteristics can be divided into two main groups: those related to distributed databases and distributed systems, and those related to data models and query languages. Many of the features required by NoSQL systems are features related to the distributed database features discussed in the last module. The characteristics of NoSQL systems related to distributed databases and systems include the following. Scalability: Since the volume of data handled by these systems keeps growing, it is important to the organizations that the capacity of the systems can keep pace. This is most often done with horizontal scalability: adding more nodes to provide more storage and processing power. Availability, Replication, and Eventual Consistency: Many of these systems require very high system availability. To accomplish this, data is replicated over two or more nodes. We saw in the last module that replication improves availability and also improves read performance since the request can be handled by any of the replicated nodes. However, update performance suffers since the update must be made at all replicated nodes. The slowness of update performance is primarily due to the distributed commit protocols introduced in the last module. These protocols provide the serializable consistency required by many applications. However, many NoSQL applications do not require serializable consistency, and they implement a less rigorous form of consistency known as eventual consistency. Replication Models: There are two main replication models used by NoSQL systems. The first is called master-slave replication. This designates one copy as the master copy. All write operations are applied to the master copy and then propagated to the slave copies. This model usually uses eventual consistency meaning that the slave copies will eventually be the same as the master. If reading is done at a slave copy, the read cannot guarantee that the value seen is the same as the current value in the master. The second model is called master-master replication. With this model, both reads and writes are allowed at any copy of the data. Since there may be a concurrent write to two different copies, the data item value may be inconsistent. This model requires a scheme to reconcile the inconsistency. Sharding of Files: Since many of the applications using NoSQL systems have millions of records, it is often not practical to store the entire file in one node. In these systems sharding (horizontal partitioning) is often used along with replication to balance the load on the system. High-Performance Data Access: To achieve quick response time when millions of records are involved, records are usually stored based on key values using either hashing or range partitioning. The characteristics related to data models and query languages include the following: Not Requiring a Schema: This provides flexibility to NoSQL systems. A partial schema may be used to improve storage efficiency, but it is not required. Any constraints on the data would be provided by application programs. Less Powerful Query Languages: Many applications do not require a query language nearly as powerful as SQL. A basic API is provided to application programmers to read and write data objects. Versioning: Some systems provide for storage of multiple versions of a data item. Each is stored with a timestamp as to when the data item was created. Categories of NoSQL Systems NoSQL systems have generally grouped into four major categories. Some systems fit into two or more of the categories. Document-based NoSQL Systems: These systems store data in the form of documents. The documents can be retrieved by document id, but can also be searched through other indexes. An example of this type of system is MongoDB which was developed by a New York based company. An open source version of this database is now available. NoSQL Key-value Stores: These systems provide fast access through the key to retrieve the associated value. The value can be a simple record or object or can be a complex data item. An example of this type of system is DynamoDB developed by Amazon for its cloud based services. Oracle, best known for its RDBMS, also offers this type of NoSQL system which it calls Oracle NoSQL Database. Column-based or Wide Column NoSQL Systems: These systems partition a table by column into column families. Each column family is stored in its own files. Note that this is a type of vertical partitioning. An example of this type of system is BigTable which was developed by Google for several of its applications including Gmail and Google Maps. Graph-based NoSQL Systems: In these systems, data is represented as graphs, and related nodes are found by traversing edges of the graph. An example of this type of system is Neo4J developed by a company based in both San Francisco and Sweden. An open source copy is available. M 15.3: The CAP Theorem The CAP Theorem The CAP theorem is used to discuss some of the competing requirements in a distributed system which uses replication. In the last module we introduced concurrency control techniques which can be used with distributed databases. The discussion assumed that a key aspect of the database is consistency. The database should not allow two different copies of the same data item to contain different values. We saw that the techniques used to insure consistency often came with the price of slower update performance. This is at odds with the desire of NoSQL systems to create multiple copies to improve the performance and availability of the database. When discussing distributed databases, there is a range of consistency which can be applied to replicated items. The levels of consistency can range from weak to strong. Applying the serializability constraint is considered the strongest form of consistency. Applying this constraint reduces the performance of update operations. CAP represents three desirable properties of distributed systems: consistency, availability, and partition tolerance. Consistency means that all nodes will have the same copies of replicated data visible to transactions. Availability means that each read or write request will either be processed successfully or will receive a message that the operation cannot be completed. Partition tolerance means that the system can continue to operate if the underlying network has a fault that requires that the network be split into two or more partitions that can communicate only within the partition. The CAP theorem states that it is not possible to guarantee all three of the desirable properties at the same time in a distributed system with replicated data. The system designer will need to choose which two will be guaranteed. In traditional systems, consistency is considered of prime importance and must be guaranteed. In NoSQL systems a weaker consistency level is often acceptable and guaranteeing the other two properties is important. NoSQL systems often adopt the weaker form of consistency known as eventual consistency. The specific implementations of eventual consistency can vary from system to system, but a simplified description is that if a data item is not changed for a period of time, all copies will eventually contain the same value, so consistency is obtained "eventually." Eventual consistency allows the possibility that different read requests will see different values of the same data item when each is reading from a different copy of the data at the same time. This is considered an acceptable tradeoff by many NoSQL applications in order to provide higher availability and faster response times.