BACS485 (485Data.Doc) DATA CHARACTERISTICS AND MODELS I. Introduction Purpose of chapter is to provide background to understand rest of course. It emphasizes data in an organizational context. There are a lot of terms and a lot of disjoint abstract concepts. Necessary for all that we do later. Lecture Objectives: To distinguish between data and information To describe the 3 levels of data abstraction: reality, metadata, and data To define the various associations between data entities: 1:1, 1:M, M:M, and conditional To introduce the use of graphical notations (bubble charts, E/R) to model data and associations To understand the basics of the ANSI/SPARC 3-level model To define and illustrate data independence To introduce the semantics of data models To introduce the relational, semantic, and object-data models A main point is that if you can not represent the data unambiguously in logical terms, then you cannot implement a database that serves the needs of the organization. Another key point is to realize that there is no one best model for all situations. You have to match the model to the task (and the level of data abstraction needed). Let's begin by talking about the "nature of data"... II. Nature of Data When you talk about data it is important to distinguish between objects in the real world, the structure of the database, and the data stored in the database. Copyrighted to Jay Lightfoot, Ph.D. 1 There are actually 3 levels of abstraction to be considered when you talk about data: - Reality - MetaData - Data itself Note, the term abstraction will be used a lot. It means to present only the essential factors or to pull back from and ignore detail. You use abstractions all the time (but don't realize it). For example, the table of contents is an abstraction of a book, a model airplane is an abstraction of the real thing, your resume is an abstraction of your job history. 1st level of data is reality... A. Reality This is the level of the real world. At this level you often talk about the enterprise and the organization for which the database is designed. For example, a bank, a school, a government branch... . The most basic building block at this level are entities. 1. Entity Entities represent a "thing" of interest in the real world. An entity may be an object with a physical existence (e.g., a particular person, car, house, employee...) or it may be an object with a conceptual existence (e.g., a company, a job, a university). Sort of like a noun (person, place, or thing). You would not want to create entities about everything. Only those things of interest to the organization. Copyrighted to Jay Lightfoot, Ph.D. 2 2. Attribute Each entity has particular properties called attributes that describe it. For example, an employee entity may be described by the employee's name, age, address, salary, ... Again, you only record the attributes about an entity that the organization has need to know. For example, employee eye color is seldom needed by organization, so it is not collected. An attribute that is composed of several more basic attributes is called composite while attributes that are not divisible are called simple or atomic attributes. Composite attributes can form a hierarchy. For example, Address can be further subdivided into 3 simple attributes--street, city, zip. If you always refer to a composite attribute as a whole then there is no need to subdivide it. a) Value A particular entity will have a value for each of its attributes. Note that I said a particular entity. You don't have a single value for a group of entities. For example, each individual employee has a distinct value for social security number and job title. Some of the job titles may be duplicated among different employees, but they still have their own. Most attributes have a single value for a particular entity and are called single-valued. For example, a specific person entity has one value for Age, so Age is a single-valued attribute of person. In other cases, an attribute can have a range of values and is called multi-valued. For example, a Major attribute for a student could have two values if dual majors are allowed. Copyrighted to Jay Lightfoot, Ph.D. 3 b) Identifier Each specific entity must have at least one attribute (or several in combination) that uniquely distinguish it from all other similar entities. This is called the identifier. For example, the social security number attribute uniquely identifies each employee entity from the other. An entity identifier is said to functionally determine the other attributes in the entity. (This is not a mathematical function.) This is an important point for later. 3. Entity Class You usually have groups of similar entities in a company. For example, there may be thousands of employee entities in an organization. These employee entities share the same attributes, but each has its own value(s) for each attribute. All of these entities have identical structure, so you can group them into what is called an entity class or entity type. This is a concept similar to a data file made up of records (except at a different level of abstraction... remember we are talking about reality). The records all have identical structure but distinct values. Again, consider that an attribute must exist in every entity class that uniquely identifies distinct entities. The set of individual entity instances at a particular moment in time is called an extension of the entity class. The entity class does not change often, but the extension normally does. 4. Associations Attributes are properties of individual entities. Another essential type of property is called associations. This is the relationship between 2 or more entities. Copyrighted to Jay Lightfoot, Ph.D. 4 The associations can exist between entities in the same entity class and between entities in different entity classes. Associations is one of the keys to the power of database over the traditional file based approach. In the traditional approach, these associations were hard-coded into the programs, making them difficult to maintain. Database approach captures this aspect of reality directly. Note that there are association types and association instances. For example, an association type says that Departments have multiple Employees. An association instance says that Accounting entity has 12 Employee instances. Remember however, we are still talking about reality. We aren't up to anything concerning the computer yet. Entities can be within and between entity classes... a) Between Attributes (within entity class) Associations can be between attributes within an entity class. For example, in the Employee entity class, some are managers while the rest are workers. An association called MANAGES lets you determine which employees work for the manager (who is also an employee). (recursive) b) Between Entity Classes (between entity classes) Associations can be between attributes in different entity classes. This is the equivalent of associations between files (on another level of data abstraction). For example, the PRODUCT entity is associated with the CUSTOMER entity by the association ORDERED-BY. Or, the STUDENT entity is associated with the CLASSES entity by the association ENROLLED-IN. This (again) is where database is different from traditional file processing. The database approach can capture this information while the file based approach cannot. Copyrighted to Jay Lightfoot, Ph.D. 5 The 2nd level of data... B. Metadata (Structure) You can't really directly do much with reality on a computer system. You are normally limited to capturing the essence of reality. This is the metadata level. This information is normally stored in the Data Dictionary or the Repository. You will notice some parallels in metadata and reality. This is intentional. The metadata is supposed to capture the essential elements of reality (and thus form an abstraction or model of reality that can be coded into the database). More specifically: reality entity class -----> attribute ----> Associations ----> metadata record type data item type relationships Be very careful to distinguish metadata from data. This is equivalent to the distinction between entity instance and entity class. 1st element of metadata is the data item... 1. Data Item A data item is the smallest named unit of stored data. Also known as data element, field, or attribute. Field usually has a physical connotation (area on a disk drive). Attribute usually is associated with the reality level. A data item is indivisible by definition. In other words, the organization has no need to view its component parts individually if they exist. For example, Employee social security number, salary, department number are all data items at the metadata level. Data items have certain characteristics normally used to describe their structure to the data dictionary (or repository). Copyrighted to Jay Lightfoot, Ph.D. 6 a) Name Data items must have names (so you can refer to them). Because they can be "quantified", the names do not need to be unique in the entire database. (They do need to be unique locally however.) For example: You can have a data item named ADDRESS in both the EMPLOYEE record type and the CUSTOMER record type without problems. This would be notated EMPLOYEE.ADDRESS and CUSTOMER.ADDRESS. Data item names must be unique within a single record type. Some implementation models put restrictions on data item names, but conceptually there are none. b) Type The "type" of a data item determines what kind of data can be stored in it and what operations can be performed on it. For example; you have NUMERIC -- numbers CHARACTER -- text DATE -- dates You can add and subtract NUMERIC and DATE, you can also multiple and divide NUMERIC, but you can only perform addition on data of type CHARACTER (this is called string concatination). There are lots of data types (different for each DBMS) and you can sometimes define your own (called "abstract data types"). For example, data type for suits in playing cards would allow HEART, CLUB, SPADE, and DIAMOND. c) Length Determines the number of characters allowed in the data item. This is the maximum number allowed (less is usually OK). Copyrighted to Jay Lightfoot, Ph.D. 7 It should not have anything to do with the way it is actually stored, but it usually does (i.e., 1 char = 1 byte). 4th characteristic of data items... 1. Source (actual/virtual) The "source" of data tells where it comes from. "Actual" data items are actually stored somewhere on disk. "Virtual" data items are not. They are derived when needed. For example, EMPLOYEE record type might store employee_birthdate and age data item. You would not really store the AGE data item because it would keep getting out-of-date. Instead you calculate it when needed using the birthdate and the system date. User is not aware of this. This is similar to the different between "logical" and "physical" data items. Physical data is actually stored in the format presented to the user while logical data is not. Logical can leave out data items or rearrange. d) Domain The "domain" of a data item is the set of allowable values it may take on. For example, the domain for the GPA data item is real numbers between 0.0 and 4.0 inclusive. Thus, 4.3 is not in the domain, so it is illegal. There are 2 types of domains: - Implicit domain - determine from type and range of data item - Explicit domain - list specific allowable values Domain is useful for integrity checks and should be built into the metadata of the Database. 6th (last) characteristic of data items... e) Value The "value" is the specific domain value stored in each data item instance. Copyrighted to Jay Lightfoot, Ph.D. 8 If you don't know the value, DBMS usually use the special value called NULL to hold the place. This is different from blank or 0 because it means that the value is unknown. This is the source of many problems in advanced database use. 2nd level of metadata... 2. Data Aggregate Data items are sometimes grouped together to form data aggregates. These are named groups of data items. They are used to connect several related data items together. For example, ADDRESS could be made up of STREET, STATE, and ZIP. Normally you create data aggregates because you need to be able to see both the individual data items in some cases and the group of data items in other cases. You can build arbitrarily complex hierarchies of data aggregates (i.e., ones made up of other items and aggregates). You can store the same characteristics for data aggregates that you store for data items (e.g., NAME, LENGTH, TYPE), or you can let system defaults take over for some of the characteristics (not all). 3rd level of metadata... 3. Record A record is a named collection of data items and/or data aggregates. Usually, all data items of interest about a specific entity class are stored in a record type. The table below shows how the terms covered so far are related to each other. Metadata RECORD ---> RECORD TYPE ---> DATA ITEM ---> Data RECORD OCCURRENCE -> FILE ----> FIELD ----> Reality ENTITY ENTITY CLASS ATTRIBUTE For example, for the student entity class you have a STUDENT record type and a data file. Copyrighted to Jay Lightfoot, Ph.D. 9 A record contains several data Items (one for each attribute of an entity). In the STUDENT record type you could have data items for ID, NAME, ADDRESS, MAJOR, GPA, ... . The same characteristics (e.g., name, length, components...) for record types are stored in the repository as are stored for data items with a few additions. 1st Additional characteristic of records... a) Keys (Primary, Secondary) A key is a data item (or several data items put together) used to identify a record or group of records. This is the equivalent concept to identifier in the reality realm. For example, the "key" of the EMPLOYEE record type would probably be the social security number because it identifies each specific employee record. There are several types of keys, we'll look at 2 for now ... (1) Primary Key A primary key is one or more data items that uniquely identifies a specific record. As stated above, the primary key for EMPLOYEE would be social security number. For a PURCHASE-ORDER it could be the PO-NUM printed on the top of each sheet. There may be several potential primary keys (called candidate keys) in a record. There also are cases where the primary key is made up of several data items (called composite key or concatenated key). Every record must have a primary key so you can tell specific records apart and that key cannot usually be NULL. Copyrighted to Jay Lightfoot, Ph.D. 10 (2) Secondary Key A secondary key is one or more data items that identifies several records with the same value for the data item(s). For example, in the EMPLOYEE record the JOB-TITLE data item could be a secondary key because it would allow you to quickly identify all employees with the same job title. Secondary keys do not uniquely identify record instances (because then you would call it a "candidate key". 2nd piece of additional category for records... b) Intersection Records Some record types describe associations between entities instead of describing the entity itself. These are called intersection records. Intersection records are used when you want to store data about the relationship or when the entities are related in a complex way called many-to-many (more on this later). 4th (last) aspect of metadata... 4. Relationship Relationships are the metadata level version of associations between entities and entity classes. NOTE: You can have relationship types and relationship instances. A type connects entity classes while the instance connects specific entity occurrences. The first is metadata level while the second is data level. As stated above, you can capture relationships at the metadata level using intersection records. Thus, relationships can have attributes just like entity classes. For example, you have EMPLOYEE entity class and PROJECT entity class associated by WORKS-ON relationship type. You could have a NUMBER-OF-HOURS attribute in the relationship type to denote hours for specific instances of employees working on projects. Copyrighted to Jay Lightfoot, Ph.D. 11 Traditional file based processing is unable to capture this information in the data, instead you had to write programs to accomplish the association. 3rd level of data...(reality, metadata, data) C. Data (occurrences) The 3rd level of data is concerned with the actual data in the database itself. It consists of data instances or occurrences. For each entity in the real world there is an occurrence of a corresponding record in the database. For example, for each student in the university there is an occurrence of a student record. So, while there is only one STUDENT record type (which is described in the data dictionary and corresponds to the student entity class), there may be thousands of student record occurrences. Similarly, there are many instances of each of the data item types that correspond to attributes. So each record instance is made up of a group of data item instances that correspond to attributes in reality. 1st aspect of data level... 1. Record Occurrence A record occurrence holds data about a specific entity in reality. For example, the university has a specific record occurrence for you on file. 2nd aspect of data level... 2. Field Each record occurrence is build of fields. The fields hold values concerning the attributes of the entity. Copyrighted to Jay Lightfoot, Ph.D. 12 For example, your record occurrence has a field with your address, your GPA, your major... . 3. File A file is a named collection of all occurrences of a given record type. For example, the record occurrences for all students in the university make up the STUDENT file. A file can be visualized as a 2-dimensional table. This is called a "flat file" and is the highest level of abstraction possible in the traditional file based approach. Note that the "flat file" limitation does not allow you to store information about associations. Highest level of data level... 4. Database A database is a named collection of interrelated files. Thus it is able to describe both the data occurrences and the associations between them. III. Associations Between Data Items Now that you can distinguish between the 3 levels of data, you need some background on associations between data items. This section introduces you to the different types of associations and ways to graphically represent them and the associations between them. A. Types of Associations Copyrighted to Jay Lightfoot, Ph.D. 13 There are 4 different types of associations: - none - 1:1 - 1:N - M:N We will ignore "none" and concentrate on the 3 primary associations between data items. Remember, the purpose is to capture the essence of reality. You want to model (or abstract) the way things are so you can simulate the organization in the computer. An association implies that the values for the associated data items are in some way dependent on each other. 1. One-to-One Association A one-to-one association means that at a particular moment in time each value of one data item 'X' is associated with up to 1 value of data item 'Y'. This is typically written 1:1. For example, there is a one-to-one association between the data item STUDENT-NUM and STUDENT-NAME. It is also true that a 1:1 association exists for the reverse. In real life, true 1:1 associations are rare. In our culture, husband-to-wife is a 1:1 association. One common way to graph this is with the following bubble chart. manager-name <---------------> department-name 2. One-to-Many Association A one-to-many association means that it is possible for each value of data item 'X' to be associated with 0, 1 or more values of data item 'Y'. For example, one STUDENT-NUM data item can be associated with 1 or more COURSENAME data items. However, each COURSE-NAME data item is associated with exactly 1 value for the STUDENT-NUM data item. Copyrighted to Jay Lightfoot, Ph.D. 14 STUDENT-NUM <--------------->> COURSE-NAME When you reverse a one-to-many association it becomes a many-to-one. (Same, except viewed from the other side.) The key to this is that the association at the metadata level is between data items while at the data level it is between specific instances of field values. These are very common in the reality level. Can you think of examples? Try father-tochild, department-to-employee. 3. Many-to-Many Associations A many-to-many association mean that a value of 'X' is associated with 0, 1 or more values of 'Y'. Likewise, each value of 'Y' is associated with 0, 1 or more 'X's. For example, 1 or more EMPLOYEE-NUM can work on 1 or more PROJECTS, and each PROJECT can have 1 or more workers. PROJECT-ID <<---------->> EMPLOYEE-NUM These are also fairly common in the real world. They are difficult conceptually because each individual data item can be associated with many others. Usually an intermediate entity is created to handle the mapping. (intersection record). 4. Conditional Associations / Existence Dependency Technically, we have been talking about the cardinality of the association. Copyrighted to Jay Lightfoot, Ph.D. 15 Cardinality is a restriction on the number allowed to participate in the association. Lines between bubbles represent mappings and arrow heads represent the cardinality. Another aspect of cardinality is the concept of conditional associations. In these, you put a range on the allowable values. For example, a conditional association from SEAT-NUMBER to STUDENT-NUM could indicate that a seat would have 0 or 1 student at any moment in time. seat-no <-------------O--> student-num You can place conditional association can also be on both sides of the relationship. Also, it can be combined with 1:1, 1:N, and M:N. For example, one TEACHER-NAME could have 0, 1, or many CLASSE-NAME and one CLASS-NAME could have 0 or 1 TEACHER-NAME (until the schedule is firmed up). teacher-name <-O-----------O-->> class-name The opposite of a conditional association is a existence dependency. This says that one instance of the data value cannot exist with another. The effect is to place a lower bound of 1 on the conditional range of the association. For example, when the semester gets underway, each CLASS must have at least 1 TEACHER and each TEACHER may teach 0, 1, or many CLASSES. teacher-name <--|---------------O-->> class-name There are other aspects of associations that we will cover later. B. Graphing Data Associations There are numerous ways to graph data associations between data items. I'll cover the one that we'll use in this course (it is one of the most popular methods in industry). Copyrighted to Jay Lightfoot, Ph.D. 16 1. Bubble Chart We have been using bubble charts for the last few examples. In them, data items are represented by named bubbles, mapping by lines and cardinality by arrows. We also introduced some conditional notation with small circles and lines. Another notation used is to underline the name of the bubble that is the identifier (key) of the record type. Bubble charts are useful for grouping data items into records and for deriving more complex data models. Note that you can represent record types and record occurrences with bubble charts. We will use them to represent functional dependencies in the conceptual design lectures later in the semester. IV. Associations Between Records When you group data items into record types then you can represent a higher level of data associations. These are associations between records. Again, remember the difference between record type and record occurrence (different levels of data). Associations between record types is at the metadata level while associations between record occurrences is at the data level. You graph these associations in several ways. Initially I will use what are called data structure diagrams. These are blocks connected by lines with "crows feet". The lines are labeled with the name of the association. Other methods exist and we will cover them later. Copyrighted to Jay Lightfoot, Ph.D. 17 STUDENT # ADDRESS MAJOR GPA CLASSIFICATION TAKES COURSE # ROOM CREDIT TIME INSTRUCTOR A. Types of Associations As before, there a 4 types of associations. The first is none and we generally ignore it because we are only interested in situations of interest to the organization 1. One-to-One Two record types can be associated by a one-to-one relationship. The associations between record types means the same thing as between data items. That means the relationship goes both ways. The difference, or course, is the level of abstraction of the data (you are working with a bigger chunk with records). Note that the diagrams make it look like data items are still connected. This is not correct. The whole record is associated with the other, not just the place where the lines are drawn. You can also have conditional relationships with record types. 2. One-to-Many A one-to-many relationship says that one occurrence of a record type is related to 0, 1 or more instances of another record type. If you reverse it, it becomes a many-to-one association. 3. Many-to-Many Copyrighted to Jay Lightfoot, Ph.D. 18 A many-to-many relationship connects 0, 1 or more instances of one record type to one or more of another. The reverse of a M:N is denoted a N:M. With data structure diagrams you can represent a M:N relationship directly. This is handy and allows you to capture the true essence of the real organization. Depending upon what entities you are modeling, a M:N can be represented as two 1:N relationships. For example, the following say the same thing: invoice <<--------->> product OR invoice <------->> line_item <<--------> product 4. Loop/Cycle (recursive) A cycle is a path that begins at an occurrence of a given record type and proceeds through a set of related occurrences of different types and eventually leads back to the original starting type (though not necessarily the same occurrence). A loop (also called a recursive relationship) is a 1:N relationship among occurrences of the same type. 5. Required or Optional Existence Record types can also have conditional association. This is the situation where you limit the range of the 1:1, 1:N, and M:N. For example, assume that a company is set-up so that each manager has 0 or 1 secretary and each secretary is either unassigned or assigned to 1 manager (no multi-manager secretaries). secretary <--O---------------O--> manager Copyrighted to Jay Lightfoot, Ph.D. 19 As stated before, 1:1 are fairly rare. An example of a conditional 1:N association is a hospital where each patient can have 0, 1, or many tests, but each test must be associated with exactly 1 patient. patient <--|----------------O-->> test An example of a conditional M:N association is a hospital where each patient can have 1 or many physicians and each physician can have 0, 1, or many patients. patient <<--O------------|-->> physician B. Graphing Record Associations Graphing the structure of an organization is very important. The information in verbal form may be correct, but not of much use to database designers (because of ambiguities inherent in language). Once you know the basics of graphing entity types and relationships you can build organizational models of arbitrary complexity that unambiguous "semantically rich". 1. Data Structure Diagram (Bachman) You have already been introduced to DSDs. They are also called Bachman charts after Charles Bachman (the man who first used them). True Bachman charts use single and double arrows without labels. There is no way to show condition associations. Because of these limitations, DSDs have been modified to the form we learned. These diagrams have the advantage of being simple to understand and easy to draw; however, they don't give a whole lot of information. Because of this, other diagramming techniques have been developed. Copyrighted to Jay Lightfoot, Ph.D. 20 2. Entity-Relationship (E/R) The Entity-Relationship (E/R) diagram puts more emphasis on the relationship between entities than does the DSD method. Some textbooks jump right into the complex form of the E/R diagram where you indicate conditional relationships for every line. I prefer to ease you into it, so I won't put the extra cardinality symbols for now. The box symbol still represents record types and the line indicates mapping. The diamond symbol is new. It stands for the relationship itself and must be named. For example, for husband---marriage---wife the 1:1 relationship using E/R diagram would be: 1 1 husband ---------- marriage ------------- wife Note how there are no arrows or crows feet, thus it is a 1:1. If you wanted to include conditional information you would do the following: husband --|------------|-- marriage --|--------------|-- wife A 1:N E/R example would be: 1 N department ---------- employees -------------- employees OR department ---------- employees -------------< employees A M:N E/R example would be: M N Copyrighted to Jay Lightfoot, Ph.D. 21 projects ----------- have --------------- tasks OR projects >---------- have ---------------< tasks Note how the name of the relation implies the perspective of the relationship. The direction implied is part of the semantics of the organization you are modeling. M N products ------------ contain ------------ parts M N parts----------------- make up ------------ products M N teacher --------------- teaches ------------ students Bubbles represent attributes off of entities and relationships in the E/R model. An underlined bubble represents a primary key. 3. Relational Notation (not really a diagramming technique) This is not really a diagramming technique, but it is often used to denote entities, attributes, and the primary keys. It can also be used to imply the relationships (though not directly). I call this relational notation but different books call it different things. It is really just a DSD without the boxes and the lines. For example: PRODUCT (PRODUCT-NUM, DESCRIPTION, PRICE, QTY-ON-HAND) VENDOR (VENDOR-NUM, VENDOR-NAME, VENDOR-CITY) SUPPLIES (VENDOR-NUM, PRODUCT-NUM, VENDOR-PRICE) Note that SUPPLIES has the primary keys of both entities, so you could "look up" the price if you know both VENDOR-NUM and PRODUCT-NUM. Copyrighted to Jay Lightfoot, Ph.D. 22 Technically, SUPPLIES is an intersection record between an M:N relationship thereby creating 2 1:N relationships. V. ANSI 3-Level Model Now you are able to represent the semantics (meaning) of a real-world organization via the DSD and E/R diagramming method. You can generate a model of the organization. Models are useful because they represent the essential basics of a system without all the clutter of detail. For example, a working model of a house can help you locate the furniture. Likewise, a scale model of a city in an earthquake can help you see if buildings will stand up. Models are useful abstractions of reality. They have always used models in the field of database. However, originally they tried to code the model directly into a big program. TOO COMPLEX! When reality changed, the model was out-of-date and it was too hard to update it. (Exact same problems as traditional file based system.) About 1968 the ANSI/X3/SPARC committee decided that a way to avoid these problems was to have 3 levels of model. In this way you could change the low level physical details and the implementation details without having to bother the users. There are 3 levels to the ANSI/SPARC model. 1. Conceptual You haven't known it, but I have been diagramming the middle layer of this model in the discussion so far. This layer is known as the conceptual model or of the organization. The E/R diagramming technique is normally used to capture the structure (metadata) of the enterprise (notice how I tied the level of data to the level of the model -- not an accident). Copyrighted to Jay Lightfoot, Ph.D. 23 There is 1 conceptual model for an organization. All the semantics of interest to that organization are captured in the model. At the conceptual level, the DBA is concerned with entities, attributes, and relationships. The conceptual view is totally independent of the hardware used and the data that specific users want to see. A. External (view, sub-schema) The External model (or view, or subschema) is concerned with the way the user views the organization. This is a subset of the conceptual model because users do not need (or want) to see the whole database. Thus, there are many external views (contrast that to a single conceptual model.) Each user is able to define the entities, attributes, and relationships they require, but they cannot touch other areas of the database. Sort of "free security". These views are also independent to the hardware and the DBMS used. B. Internal (Implementation) The internal level (also called the implementation model or the schema) defines the whole database in a technological dependent style. In other words, this level is concerned with the specific computer and the physical details of the DBMS. This level is needed because the conceptual level is hardware independent. There are 3 basic implementation models - hierarchical - network - relational 1. Physical (not really model) Below the internal level you have the physical reality of the computer. Copyrighted to Jay Lightfoot, Ph.D. 24 This includes the low level details of access methods and pointers and disk mirroring, etc. The purpose of the ANSI/SPARC 3-level model is so you don't have to worry about this stuff above the internal level. C. Data Independence The ANSI 3-level model helps "insulate" the upper levels from low level detail. In that way you can change the computer without affecting users. Also, you should be able to change the DBMS without affecting the users. This is called data independence. Officially it can be defined as the capacity to change the model at one level of a database system without having to change the model at the next higher level. There are 2 types of data independence... 1. Physical Data Independence When physical independence exists, changes can be made to the physical characteristics of the data (like moving to a new disk or changing an indexed field) without affecting existing programs at the external level. 2. Logical Data Independence When logical independence is present, fields can be added or deleted (or their names and specs changed) without having to recompile existing programs. In other words, the logical link between how the data is defined and how it is used is flexible (independent). Each level of independence is really a mapping between the two models. When the lower model changes, all you have to do is update the mapping. The data dictionary (repository) holds these mappings. VI. Semantics of Data Models Now that you are familiar with data modeling and the ANSI 3-level approach, we can go a little deeper into the semantics of data models. Copyrighted to Jay Lightfoot, Ph.D. 25 Remember that semantics are the meaning of the data. A good model captures the basic meaning without getting too much detail. Actually a good model is a balance between too much and not enough semantic detail. The more detail you capture, the more realistic the model is (but there is a limit where too much information is counter-productive.) What follows are typical data semantics of captured by data models. A. Cardinality (1:1,1:N,M:N) We discussed cardinality before when we talked about 1:1, 1:N, and M:N relationships between entity classes. To review, the cardinality describes the nature of the relationship between entities. It can be 1:1, 1:N, M:N. B. Degree The degree of a relationship describes the number of entities that can participate in the relationship. 1. Unary, Binary, Ternary Up to now we have seen strictly binary and uniary (loop/cycle) relationships, but relationships of higher degree are possible. Ternary relationships happen occasionally. You cannot break a ternary relationship into 3 binary relationships. Instead you have to call the relationship an entity involved in 3 1:N relationships. Higher degree relationships are possible, but stay away from them if possible because it gets too complex. C. Existence Dependency (Referential Integrity) Existence dependency is the opposite of conditional association. Copyrighted to Jay Lightfoot, Ph.D. 26 In other words it says that one entity cannot exist without the existence of an instance of some other entity. For example, a CUSTOMER order cannot exist with an associated CUSTOMER. Or, a STUDENT-GRADE cannot exist with a related STUDENT. Existence dependency is also called referential integrity. This is important in database systems because you want to make sure the relationships between entities are "in sync". For example, if you delete a CUSTOMER, you must also delete all ORDERS for that customer or they will be hanging out there forever. D. Time One of the more complex aspects of database theory is time. We won't get into this deeply in this course. In general, commercial DBMSs handle the time semantic poorly. They assume that you are only interested in the current situation, not in any cumulative history. Some of the difficulty occurs because not all of the database is modified at once. Only one or two data items may change. So you wind up storing thing called difference files and having to reconstruct the database for the desired time-frame. The problem is that the difference file chain soon becomes long and bulky (tough to process). An additional complication is that you can add and remove metadata entities over time. This means that some data items are not valid except during a certain time-frame to you also have to store time related metadata. Then there is the problem with derived data... it goes on and on. E. Uniqueness The primary concern with uniqueness is related to primary keys. At least one data item must uniquely identify each record in a record type. Copyrighted to Jay Lightfoot, Ph.D. 27 Another aspect of uniqueness requires that specific instances of a record be associated with exactly 1 instance of another record. For example, M:N relationship between STUDENTS and GRADE with relationship SECTION. Each STUDENT gets exactly 1 GRADE. A 3rd type of uniqueness is called exclusivity. Exclusivity is like an either-or filter between entity types. Either one relationship between the entities or another may exist, but they both may not. For example, the book mentions how 2 uniary relationships can exist between EMPLOYEES, the MANAGES and the MARRIED relationships. Either can exist between 2 occurrences, but both may not (i.e., you can be married to your boss). F. Class/Subclass (Generalization and Aggregation) Sometimes an entity is made up of several classes of similar "sub-entities" that should really be handled differently occasionally. For example, the conceptual model of an organization may have an EMPLOYEE entity. This entity could have been broken down into HOURLY-EMPLOYEES and SALARYEMPLOYEES for a specific purpose (e.g., payroll). However, that division would cause problems with the rest of the model (you would have to have 2 relationships instead of one for everything that connected to the old EMPLOYEE). To solve this problem you have a class/subclass structure. One entity is the "parent" and the other are the "children". The children could be thought of as specializations of the parent and the parent can be thought of as a generalization (abstraction) of the children. In this way you get the best of all worlds. You can have the EMPLOYEE entity and the subclass HOURLY and SALARY entities in the same model. Usually the models that allow class/subclass relationships also have inheritance. Inheritance allows the children to implicitly keep the structure of the parent without duplicating it on the lower level. Any specific differences between parent and child are Copyrighted to Jay Lightfoot, Ph.D. 28 stored in the children (for example, HOURLY would have HOURLY-RATE and SALARY would have MONTHLY-WAGE). In diagramming this, usually an oval with ISA is drawn. It says that the child ISA parent (e.g., HOURLY-EMPLOYEE ISA EMPLOYEE) In the diagram, aggregation is down and generalization is up. Next, we move on to looking at the actual data models. You have already seen DSD and E/R models as conceptual data models. Next, I will present 3 more conceptual models: - Relational model - Semantic data model - Object-Oriented model But first, a glance at the extended E/R model VII. Extended E/R Data Model The original E/R model was just boxes, diamonds, and arrows. It did not capture the semantics of conditionality, existence dependency, class-subclass, participation... . Since then, it has been extended several times to make it a more powerful tool. MultiValued Attribute - Double circle represents an attribute that can have more than one value (e.g., STUDENT-MAJOR). Conditional Association - Small circles show when the cardinality is 0, 1, or more for a specific relationship. Sometimes this is denoted with a dotted line. Existence Dependency - Short line shows when cardinality range starts at 1 (i.e., mandatory existence). Exclusive Relationship - An arc that says that either one or the other (but not both) relationship types are used. (i.e., service performed by nurse or doctor, but not both). Exclusive Occurrence - Convex lens says that a specific occurrence may participate one or the other relationship, but not both (i.e., MANAGE or MARRIAGE, but not both). Copyrighted to Jay Lightfoot, Ph.D. 29 Class/Subclass - Shaded boxes denotes generalization (up) and specialization (down). Sometimes an ISA oval is used instead. We won't use this much in class; however, I do want you to realize that it exist and be familiar with the symbols given in the book. A few years after you go out and get jobs this probably will be used frequently. VIII. Relational Model The relational model was first proposed by Codd in 1970. It is based on a branch of mathematics called set theory (thus the name "relational" because you have relations between the sets). The relational model is a definitional one--that is, it is intended to define the metadata to the DBMS. For humans to use it, however, it does have a common graphical representation. A relation can be viewed as a 2-dimensional table that is similar to what you normally would call a flat-file. The relational model is currently the most important DBMS model. It is both a high-level conceptual model (sort of) and an internal level implementation model. The relational model is popular for several reasons: it is fairly intuitive so non-technical users can understand it it is designed around sets and set operations so you can design technically "good" structures easier (you have the tools built in) it maps easily from the E/R model to the implementation version of a relational DBMS it has been implemented on PC platforms for a reasonable price - thus widely available it has mathematical basis, so you can design one on paper to work before starting to code is flexible (allows ad hoc queries via SQL and QBE) and it can be efficient on large batch jobs Relations have the following properties: Copyrighted to Jay Lightfoot, Ph.D. 30 each column contains values about the same attribute and each cell must have a single value each column has a distinct name and the order of the columns is immaterial each row is distinct (no duplicates) the order of rows is immaterial A. Table A table is the physical representation of a relation. Tables hold information about entity classes for the database. A relation is represented as a 2-dimensional table in which the rows correspond to individual entity occurrences (records) and the columns correspond to attributes (fields). B. Tuple Each row of a table corresponds to an individual record instance. These rows are called tuples. C. Attributes Each column of the table contains values of a single attribute. Like a field on the data level. For example, the STUDENT-ID column of the table contains the values for STUDENTIDs of all tuples. D. Degree The degree of the relation is a count of the number of attributes. This has nothing to do with the semantic term degree which means the number of entities participating in the relationship. E. Cardinality The cardinality of a relation is a count of the number of tuples. Again, no relation to the semantic term cardinality (1:1, 1:n, M:N) Copyrighted to Jay Lightfoot, Ph.D. 31 F. Domain A domain is the set of possible values for an attribute. For example, the domain of GPAs is the real numbers between 0 and 4 inclusive. The domain for CITY names is the alphabet (restricted to names of valid cities). G. Key (primary, foreign, cross-reference) Each relation must have at least one attribute (or combination) that uniquely identifies each tuple from the others. This attribute is called the primary key. If it is a combination of attributes then it is called a concatenated key. If more than one attribute can do this then it is called a candidate key or alternate key. If one non-key attribute in a relation is the primary key in another relation it is called a cross-reference key or a foreign key. These are how the relational model connects distinct relations. IX. Semantic Data Model The Semantic Data Model (SDM) is not really intended for diagrams (although it does have some nice graphical methods). It is a way to define the semantics of an organization (a model). The SDM is intended to capture all meaning of an organizations data and to describe it to the DBMS as a set of metadata structure and integrity constraints -- pretty ambitions. It contains all the features described for the extended E/R data model and more. (Extended E/R got its ideas from SDM). The SDM is not a single standard model. There are many active research projects currently going on. It is all very interesting and very new. It is most significant because it has generated interest in methods to store the meaning of data and it donated some key concepts to the object-oriented data model. Copyrighted to Jay Lightfoot, Ph.D. 32 X. Object-Oriented Data Model The object-oriented data model combines procedure (called methods) and data into an inseparable package called an object. Essentially, an object knows how to process its own data. In that way you must send it a message saying "print yourself" and the internal procedure does what is necessary. The structure of objects is defined so that you have a network of subclasses and superclasses. Subclass objects inherent attributes and methods from their superclass objects. An object can be composed of any kind of data; for example, text, procedures, pictures, other objects... . A major goal of the OODM is to model the organizational behavior, not just its structure. The object-oriented model was developed for dynamic situations where the structure changes often and the data required to perform data requests would normally be widely scattered in traditional DBMS metadata. Also, OODB model systems where there are few object instances relative to the number of object classes (different from business processing). For example, CAD/CAM, electronic circuit design/simulation A. Object In OODB, each object represents a physical entity, a concept, an idea, an event, or some aspect of interest to the database application. Each object instance is a self-contained mixture of data and procedure. B. Object Class The structure of the OODB is captured in the network of object classes. This is similar to the schema in more traditional DBMSs. Copyrighted to Jay Lightfoot, Ph.D. 33 Subclasses can inherent the methods and attribute values of their superclasses. C. Object Instance Equivalent to a record instance. A specific instance of an object class. D. Message Objects communicate and perform all operations via messages. A message consists of an object (or several objects) followed by a method to be applied to these objects. An alternate approach is to send the name of the message and any needed attribute values to the object instance. E. Method A method is similar to a procedure. Object methods are stored in the object class hierarchy and are available to all object instances. If several methods have the same name but perform different actions it is called operator overload or polymorphism. F. Actor (Demon) Objects that wait for messages before they do something are called passive objects. Ones that perform operations based on other stimuli are called active objects or demons. Active objects are started without specific messages and can be used to perform background work and "watch dog" type functions. Copyrighted to Jay Lightfoot, Ph.D. 34