CHAPTER 4 DATABASES AND DATA WAREHOUSES A Gold Mine of Information 4-2 Introduction Today, Organizations Need... Information to compete effectively Information just to stay alive in the information age Information organized in such a way that you can easily and quickly get to it Information-processing tools that help you work with information 4-3 Introduction YOUR FOCUS IN THIS CHAPTER The Difference Between Logical and Physical Views of Information Databases and Database Management Systems How You Can Develop Database Applications Data Warehouses and Data Mining Tools 4-4 Information Revisited THREE THINGS ORGANIZATIONS DO WITH INFORMATION 1.Process information in the form of transactions 2.Use information to make a decision 3.Manage information while it’s used 4-5 Information Revisited PROCESSING INFORMATION IN THE FORM OF TRANSACTIONS Such as payroll processing, order processing, and handling your registration requests for classes. This is called ONLINE TRANSACTION PROCESSING (OLTP) - the gathering of input information, processing that information, and updating existing information to reflect the gathered and processed information. Operational databases support OLTP. 4-6 Information Revisited USING INFORMATION TO MAKE A DECISION For answering such questions as, “How many senior-level marketing majors have not taken statistics?” This is called ONLINE ANALYTICAL PROCESSING (OLAP) - the manipulation of information to support decision making. Data warehouses support OLAP. 4-7 Information Revisited MANAGING INFORMATION WHILE IT’S USED Determining who can view or use information Specifying how to back up information Identifying what storage technologies to use Most importantly, managing information includes organizing it so that people can logically use it without having to know anything about its physical structure. The difference between logical and physical is key. 4-8 Information Revisited physical deals with In managing information, the structure of information as it resides on various storage media. Logical deals with how knowledge workers view their information needs, and includes such terms as: – – – – – – CHARACTER - our smallest unit of information. FIELD - group of related characters. RECORD - group of related fields. FILE - group of related records. DATABASE - group of logically associated files. DATA WAREHOUSE - information from many databases. 4-9 Databases DATABASE a collection of information that you organize and access according to the logical structure of that information. A database is actually composed of two parts: 1. the information itself – the files that are logically associated 2. the logical structure of the information – called the data dictionary 4-10 Databases A Database Is a Collection of Information Most databases contain two or more files with related information. The Inventory database (Figure 4.4, page 125) contains two files - Part and Facility. These two files are logically related because parts are stored in facilities and because you would use both of these files to manage your inventory. 4-11 Databases A Database Contains a Logical Structure You organize and access a database by its logical structure, not its physical position. DATA DICTIONARY - contains the logical structure of information in a database. The data dictionary contains the logical properties that describe information in a database. See Figure 4.5 (page 126) for the data dictionary of the Percentage Markup field in the Inventory database. 4-12 Databases A Database Has Logical Ties Among the Information A PRIMARY KEY is a field in a database file that uniquely describes each record. A FOREIGN KEY is a primary key of one file that also appears in another file. So, foreign keys specify how files are logically related. For example, the Part and Facility files are logically related. So, in Figure 4.4 you can see that Facility Number (the primary key for the Facility file) exists in the Part file (where it’s a foreign key). 4-13 Databases A Database Contains Built-in Integrity Constraints An INTEGRITY CONSTRAINT is a rule that helps assure the quality of the information in a database. A registration database at your school includes integrity constraints concerning prerequisites for certain classes. Our Inventory database includes an integrity constraint that says a part in the Part file cannot be assigned to a facility that does not exist in the Facility file. 4-14 Database Management Systems DATABASE MANAGEMENT SYSTEM (DBMS) the software you use to specify the logical organization for a database and access it. A DBMS contains 5 software components: 1. DBMS engine 2. Data definition subsystem 3. Data manipulation subsystem 4. Application generation subsystem 5. Data administration subsystem 4-15 DBMSs DBMS ENGINE accepts logical requests from the various other DBMS subsystems, converts them to their physical equivalent, and actually accesses the database and data dictionary as they exist on a storage device. Recall that: PHYSICAL VIEW deals with how information is physically arranged, stored, and accessed on some type of secondary storage device. LOGICAL VIEW focuses on how you need to arrange and access information to meet your particular business needs. 4-16 DBMSs DATA DEFINITION SUBSYSTEM helps you create and maintain the data dictionary and define the structure of the files in a database. You use this subsystem to define the information logical structure when you first create a database. Once you’ve created a database, you use this subsystem to define new fields, delete fields, or change field properties. Figure 4.5 (page 126) contains this subsystem screen for the Part file. 4-17 DBMSs DATA MANIPULATION SUBSYSTEM helps you add, change, and delete information in a database and mine it for valuable information. This subsystem is most often the primary interface between you as a user and the information contained in a database. Tools in this subsystem include views, report generators, query-by-example tools, and structured query language. 4-18 DBMSs DATA MANIPULATION TOOLS VIEW - allows you to see the content of a database file, make whatever changes you want, perform simple sorting, and query to find the location of specific information. See Figure 4.7 page 129. REPORT GENERATOR - helps you quickly define formats of reports and what information you want to see in a report. See Figures 4.8 and 4.9 page 130. 4-19 DBMSs DATA MANIPULATION TOOLS QUERY-BY-EXAMPLE (QBE) TOOL - helps you graphically design the answer to a question. Figure 4.10 (page 130) shows the QBE for displaying the names and phone numbers of facility managers in charge of parts that cost more than $10. STRUCTURED QUERY LANGUAGE (SQL) a standardized fourth-generation language found in most database environments. SQL is the same as QBE, except that you perform a query by creating a statement instead of pointing, clicking, dragging. 4-20 DBMSs APPLICATION GENERATION SUBSYSTEM contains facilities to help you develop transaction-intensive applications. This subsystem includes: Tools for creating data entry screens (See Figure 4.12 page 131 for an example) Programming languages specific to a particular DBMS Interfaces to commonly used programming languages that are independent of any DBMS. 4-21 DBMSs DATA ADMINISTRATION SUBSYSTEM helps you manage the overall database environment by providing facilities for: Backup and recovery Security management Query optimization Reorganization Concurrency control Change management 4-22 Database Models THE RELATIONAL DATABASE MODEL a database model that uses a series of twodimensional tables or files to store information. This is the most popular model. Each table is called a RELATION. A relation contains information about a particular ENTITY CLASS (a concept - people, places, or things - about which you wish to store information and that you can identify with a unique key). 4-23 Database Models Figure 4.14 (page 136) shows a relational database for a video rental store. The entity classes are Customer, Video, Video Rental, and Distributor. Notice how these tables are related to each other through the use of foreign keys. In the Video Rental relation, you’ll find a primary key that uses more than one one field to create a unique description. This is called a COMPOSITE PRIMARY KEY. A primary key that uses only one field is called an ATOMIC PRIMARY KEY. 4-24 Database Models THE OBJECT-ORIENTED (O-O) DATABASE MODEL a database model that brings together, stores, and allows you to work with both information and procedures that act on the information. An OBJECT-ORIENTED DATABASE MANAGEMENT SYSTEM (O-O DBMS) is the DBMS software that allows you to develop and work with an O-O database. 4-25 Database Models This model takes advantage of the concept of an OBJECT - a software module containing information that describes an entity class along with a list of procedures that can act on the information describing the entity class. Figure 4.15 (page 138) shows the same video rental store using the O-O database model. Notice that the objects (entity classes) - which include Customer, Video Rental, Video, and Distributor - contain both information and procedures for working with that information. See Appendix C for more on objects. 4-26 Developing Databases DEVELOPING YOUR OWN DATABASE Being able to develop your own database is a part of knowledge worker computing. Building a database for your personal needs includes the following 4 steps: 1. Defining entity classes and primary keys 2. Defining relationships among entity classes 3. Defining information (fields) for each relation 4. Using a data definition language to create the database Follow along as we build the database to support the report in Figure 4.16 on page 140. 4-27 Developing Databases #1 - DEFINING ENTITY CLASSES AND PRIMARY KEYS From the report in Figure 4.16, you can identify the entity classes as Employee, Department, and Job. Now, for each entity class, you must define a primary key that provides a unique description. These include: • Employee entity class - Emp ID (e.g., 2345 for Smith) • Department entity class - Dept (e.g., 15) • Job entity class - Job (e.g., 14 for Acct) 4-28 Developing Databases #2 - DEFINING RELATIONSHIPS AMONG ENTITY CLASSES For this step, use an ENTITY-RELATIONSHIP (E-R) DIAGRAM, a graphical method of representing entity classes and their relationships. See Figure 4.17 (page 140) for the initial E-R diagram of our database and a listing of E-R diagram symbols. 4-29 EMPLOYEE Developing Databases M:1 DEPARTMENT An Employee must be assigned to a Department. An Employee cannot be assigned to more than one Department. A Department may have many Employees assigned to it. A Department is not required to have any Employees assigned to it. 4-30 Developing Databases After building the initial E-R diagram, you must follow the process of normalization. NORMALIZATION is a process of assuring that a relational database structure can be implemented as a series of two-dimensional tables. Normalization includes the following 3 steps: 1.Eliminate repeating groups or M:M relationships 2.Assure that each field in a relation depends only on the primary key of that relation 3.Remove all derived fields from the relations. 4-31 Developing Databases The first rule of normalization states that no M:M relationships can exist. There is an M:M between Employee and Job. You eliminate this by creating an INTERSECTION RELATION - a relation you create to eliminate a repeating group. An intersection relation will have a composite primary key that consists of the primary key fields from the two intersecting relations. In Figure 4.18 (page 142), we created an intersection relation called Employee-Job to eliminate the M:M relationship. 4-32 Developing Databases #3 - DEFINING INFORMATION (FIELDS) FOR EACH RELATION In this step, you follow rules #2 and #3 of normalization. Your goal here is two-fold: 1.Make sure that the information in each relation is indeed in the correct relation 2.Make sure that the information cannot be derived from other information. 4-33 Developing Databases To determine if information is in the correct relation, ask: “Does this piece of information depend only on the primary key for this relation?” If the answer is yes, the information is in the correct relation. In the Employee relation (Figure 4.20 page 144), we currently store Dept Sup. Does Dept Sup depend on Emp ID? The answer is no - Dept Sup depends on Dept, so it should be in the Department relation. 4-34 Developing Databases Derived information - information that can be mathematically determined from other information - should not be stored in your database. For example, # Emp is a field in the Department relation. However, we can simply count the number of occurrences of each Dept in the Employee relation and determine the number of employees. So, we remove # Emp from the database. 4-35 Developing Databases #4 - USING A DATA DEFINITION LANGUAGE TO CREATE THE DATABASE The final step is to actually create the relations you identified in steps 1-3. You do this with a data definition language. This step includes: – Developing a data dictionary – Defining the various relations – Defining primary keys and relationships 4-36 Data Warehouses DATA WAREHOUSE a logical collection of information - gathered from many different operational databases - that supports business analysis activities and decision-making tasks. Data warehouses... are a logical extension of databases support OLAP are among the newest and hottest buzz words and concepts in the IT field. 4-37 Data Warehouses DATA WAREHOUSE FEATURES Data warehouses combine information from different databases – Making them a true repository of all an organization’s information Data warehouses are multi-dimensional – As opposed to 2 dimensions in the relational model – Often called hypercubes (See Figure 4.23 page 148) Data warehouses support decision making – While databases support OLTP, data warehouses support OLAP 4-38 Data Warehouses DATA MINING TOOLS the software tools you use to query information in a data warehouse. QUERY-AND-REPORTING TOOLS - QBE tools, SQL, and report generators. INTELLIGENT AGENTS - various artificial intelligence tools that form the basis for “information discovery” in OLAP. MULTIDIMENSIONAL ANALYSIS (MDA) TOOLS - slice-and-dice techniques that allow you to view multidimensional information from different perspectives. 4-39 Data Warehouses IMPORTANT CONSIDERATIONS IN USING A DATA WAREHOUSE Do you need a data warehouse? Do you already have a data warehouse? Who will the users be? How up-to-date must the information be? What data mining tools do you need? 4-40 Managing Information MANAGING THE INFORMATION RESOURCE How will changes in technology affect organizing and managing information? What types of database models and databases are most appropriate? Who should oversee the organization’s information? 4-41 Managing Information OVERSEEING YOUR ORGANIZATION’S INFORMATION CHIEF INFORMATION OFFICER (CIO) is the IT manager who directs all IT systems and personnel while communicating directly with the highest levels of the organization. DATA ADMINISTRATION plans for, oversees the development of, and monitors the information resource. DATABASE ADMINISTRATION is responsible for the more technical and operational aspects of managing information in databases. 4-42 Managing Information MANAGING THE INFORMATION RESOURCE Is information ownership a consideration? What are the ethics involved in organizing and managing information? How should databases and database applications be developed and maintained? 4-43 TO SUMMARIZE How we view information: – The physical view of information deals with how information is physically arranged, stored, and accessed on some type of secondary storage device. – The logical view of information focuses on how you need to arrange and access information to meet your particular business needs. A database is a collection of information that you organize and access according to the logical structure of that information. The data dictionary contains the logical structure of information in a database. 4-44 TO SUMMARIZE A database management system is the software you use to specify the logical organization for a database and access it. Popular database models include the relational model and the object-oriented model. The four steps of developing a personal database application include: 1. Define entity classes and primary keys 2. Define relationships among entity classes 3. Define information (fields) for each relation 4. Use a data definition language to create the database 4-45 TO SUMMARIZE Data warehouses are a logical collection of information - gathered from many different operational databases - that supports business analysis activities and decision-making tasks. Data mining tools - the software tools you use to query information in a data warehouse include query-and-reporting tools, intelligent agents, and multidimensional analysis (MDA) tools.