Database Theory and Terminology, Part 1 Theory and Practice • We will begin by looking at the theory and the terminology of table design. • After that, we will focus on the practical side—using Access to: – Create a new database – Design tables – Create relationships • That’s what assignment 1 will be about. Modern Databases • Your reading assignment for this week is the first two chapters of a book called “Databases Demystified.” • This book contains a general definition of a database: a database is a collection of interrelated data items that are managed as a single unit. • This definition is deliberately vague, allowing it to cover most of the different types of databases that have been used over the past five decades or so. • For a relational database, the definition can be more focused, at least on the logical level: • Database: A collection of tables, the relationships between them, and auxiliary items such as views and stored procedures. DBMS • A database is managed, strangely enough, by something called a “database management system” (DBMS). • Popular DBMS’s include Oracle, MySql, DB2, and Microsoft’s SQL Server (for large-scale databases) and Access (for smaller databases). • Large-scale DBMS’s like Oracle and SQL Server typically run on specialized computers called servers, which provide data for many computers (clients) over networks. They typically store their data in many files, frequently spread across many hard drives, and even many different servers. Access • While Access can be used on networks, it is more of a “personal” DBMS, running on the user’s computer instead of a separate server. It stores everything in a single file (*.mdb for Access 2003 and earlier; *.accdb for Access 2007). • The single-file feature of Access is why we will use it in this class. It makes it easy for me to share entire databases with you, and for you to turn in databases for assignments and projects. Microsoft Access • Access is not the best or most powerful DBMS; it is just the most convenient for use in this class. • Nevertheless, it is good enough and powerful enough that it serves very nicely as a training database: most of what you need to know about databases you can learn using Access. • You’ll start learning how to use Access in the next lecture and in lab. Key Point! • Good relational database design is about optimizing how the data is STORED, not how it is DISPLAYED. • Most “tables” you have seen—in books, in lectures, on the web—were probably optimized for display, not for storage. • Relational database tables are designed for consistency and to reduce redundancy. They are not designed for appearance. • When we learn SQL and Visual Basic, we will look at various ways to display the data stored in relational database tables. Why “relational?” • We’ll be learning about database relationships later is this mega-lecture. • However, relationships are not what gave the relational database its name. • The term “relational” comes from the mathematical concept of “relation,” which refers to a set of ordered pairs (or triplets, etc.; the generic term is “tuple”) of items. A mathematical function is a special type of relation. Entities • In a properly designed relational database, each relation (table) represents a single “entity”. • An entity is sort of a generic noun. For example, the concept of Customer is an entity, but one particular customer is not an entity. • “Customers” could be the name of the table representing the Customer entity. Each row in the table would represent a particular real customer. • In object-oriented programming (OOP), an entity is typically represented by something called a “class.” An individual instance of that class (a particular customer, for example), is called an “object.” In each group, which is the entity? • IOE 373 • Course • Team • Jets • Person • Barack Obama • Tom Brady • Player • Company • Microsoft • University • Michigan • Honda Accord • Car • Product • Crest Toothpaste • This Slide • Slide Tuples, Records, Rows • What Codd referred to as a “tuple” is generally referred to as a “record” or a “row” in practical database discussions. • Records are called rows because they are represented by the horizontal rows in database tables. • The items in the previous slide that were not entities were records—instances of the entity. • In OOP, a record (row) is called an “object,” which is an instance of a class. Attributes, Fields, Columns • Codd and other theorists called the items in each tuple “attributes”, properties which all instances of an entity have, but which aren’t always the same for each instance. • For example, attributes of the Customer entity could include LastName and FirstName. Each customer has these attributes, but the particular values are different. • In practical database discussions, attributes are referred to as “fields” or “columns”. (“Columns” since they are represented by the vertical columns in a table.) • In OOP, an attribute is generally referred to as a “property.” Tables as Templates • You can think of a table as a template for describing or designing things of a particular type. • For example, the “car” entity can be described by attributes such as make, model, year, color, miles per gallon, etc. • You can then use a table with those fields to describe individual cars (rows, records): Typical Beginners’ Tables • I’ve often seen the case where people with no formal database training have tried to create their own databases—perhaps their CD or recipe collection, or to run their small business, or for some event they are organizing. • Frequently they will turn to Excel. They will take a single worksheet and try to put all of their data into a single table. Eventually, these single-table databases run into some serious problems—the problems that Codd was trying to address. Guate Tours • Here’s an example of a simple table that starts to grow, and the problems created by it. • It is based on a database designed by a student last year. She actually designed the database properly; I took her data and made the table worse so I could use it as an example. Guate Tours • Guate Tours is a company in Guatemala which conducts tours of the historical, cultural, and natural treasures of that country. • The company's manager, Cristina Rodriguez (""C Rod"") has used Excel to prepare a list of the tours that they offer. A student named Rita who works for Guate Tours in the summer returns home from her IOE 373 class, and tells the manager that they should add a Tour ID number to their table to serve as a primary key, especially since several of their tours have similar names. In the past, the manager has told employees to lead the Antigua tour, and the employees then took the customers on the WRONG Antigua tour. So having a numeric key for each tour may help. Each tour is generally led by the same guide, but the manager sometimes forgets which employee leads which tour. Therefore, she decides to add the employee's name to the list. Then one day the father of one of the customers is in a bad accident back in Italy and is taken to the hospital. The customer's brother calls Guate Tours and asks them to get in touch with the customer, but the manager has trouble finding the cellphone number of the guide. When the customer finally finds out about the accident, he is irate and threatens to sue the company. The manager decides it might be a good idea to add her employees' contact info to the list. So this is what the list looks like. Eventually, the manager gets sick of managing and decides to give up the tour business and go into VB programming. On her last day, she appoints Rita as the new manager and removes herself from the list. Manager’s name was here After this, Rita looks more closely at the list and realizes that she is identified as the manager if you look at one row, but as a tour guide if you look at another. She recalls that this is the “update anomaly.” When C-Rod made Rita the manager, she only changed her title in one place. Rita changes her title to Manager in the other place now. Rita changes her title to Manager in the other place now. A few days later Rita gets a call from a customer from California who really wants to take the Rio Dulce trip. Rita looks at the list and says, “Do you mean the Rio Dulce Boat Trip?” The customer replies “No, I don't ever want to get that close to crocodiles again! I want that five-star hotel.” Rita then realizes the mistake-when C-Rod removed herself from the list, she took the whole Rio Dulce trip along with her. Rita remembers learning in 373 that this is called the “delete anomaly.” Too much data was stored in a single row in the list. She adds the Rio Dulce trip back into the list, and realizes that she doesn't have anyone to lead it. Rita is now overworked. She is still leading two different tours AND managing the company, and she's still got the Rio Dulce trip with nobody to guide it. She decides to hire someone new. She hasn't quite decided which of her trips she is going to give to the new employee, so she tries to just insert his name and contact info into the table. She knows something is wrong—this is supposed to be a table of trips, and now she has a row in the table which doesn’t have a trip or a unique numeric identifier. She recalls learning something about this problem back in 373—the “insert anomaly”, it was called. The Three Anomalies • The story highlights what are known as the three anomalies: – Insert Anomaly: This occurs when you are storing information about one entity in a table focused on a different entity. In the story, the table was designed around tours. When a new employee was hired, there was no logical place to put his information because he was not yet assigned to a tour. – Update Anomaly: For the same reason, updating employee information, like a change of title or phone number, is problematic in a table focused on a different entity. If you are not careful, inconsistencies can arise. In the story, Rita’s title was changed in one row, but not another. The table was no longer consistent as to what her title really was. – Delete Anomaly: When employee information is stored in the tours table, deleting a tour may result in the loss of all information about an employee. • You should know these anomalies for the quiz on 9/21. The Three Anomalies • The insert, update, and delete anomalies were frequently encountered in early attempts to create useful databases back in the 1960’s. • Dr. E.F. Codd developed relational database theory as a way of avoiding these anomalies. • All three problems arise because data about multiple entities are stored in a single table. • In the Guate Tours example, the entities are Tours and Employees. Two Entities in One Table Tours Employees The cure: Third Normal Form • Codd and others defined the rules for the proper structure of database tables; these have become known as “normal forms.” While six (and sometimes more) normal forms have been defined, the acceptable level of “normalization” for a database is Third Normal Form (3NF). • You will learn the details of 3NF later in this lengthy slideshow. • I will teach you 3NF mostly by example, but I’ll give you a brief rundown of what the first three normal forms mean. OLTP vs. OLAP • Modern databases tend to fall into one of two categories: – Online Transaction Processing (OLTP) – Online Analytical Processing (OLAP) – We’ll spend a couple of weeks learning about OLTP databases; – We’ll only spend a few slides learning about OLAP databases. – Normalization is only necessary and desirable for OLTP databases. OLAP: Online Analytical Processing • OLAP databases are typically composed of historical data, such as sales records from previous years. This data is often gathered (using SQL) from an OLTP database. • While there are technical differences in the definitions, an OLAP database is often called a “data warehouse.” OLAP • Since the data is historical, it doesn’t need to be updated. Therefore, the insert, update, and delete anomalies that can plague an OLTP database are rarely a concern with OLAP databases. Uses for OLAP databases • OLAP databases are generally used to help management make decisions. They can be used to answer questions like: – Which products sell best in which regions? – Who are our best (and worst) sales people? – Which stores should we close, and where should we open new stores? – Will we be better off trying to sell more lowvolume, high-price items, or more high-volume, low-price items? Characteristics of OLAP databases • Since questions (queries) like these can take a long time to run when the data is stored in multiple tables, OLAP databases are frequently “denormalized”—the data from several tables is combined into a single table which is not in third normal form. Queries on single tables run faster than queries across multiple tables. • Since these tables do not need to be updated, denormalization does not pose a problem. OLAP, we hardly knew ya! • Unfortunately, that’s all the time we have to spend on OLAP databases. • This doesn’t mean they’re not important, just that they are easier to understand than OLTP databases—many typical Excel sheets are pretty close to being OLAP databases. • If you want to learn more, there are many books (including Databases Demystified, chapter 12) and websites which cover the topic. OLTP • OLTP (Online Transaction Processing) databases are the type we’ll be focusing on in this class. OLTP databases are designed to handle: – Frequent changes (inserts, updates, deletes) – Frequent requests for small amounts of data, such as a customer or product record. – Multiple concurrent users. OLTP databases: Is there anything they can’t do? • Well, yes. Properly normalized OLTP databases are not optimized for: – Quickly retrieving large amounts of summary data. – Using in pivot tables to analyze historical data. • In general, you run your day-to-day business using an OLTP database. • You then periodically transfer that data to a nonnormalized OLAP database. • You use the OLAP database to analyze your business’s performance. Practice • Would you be more likely to find a table called “2009Sales” in an OLAP or OLTP database? • How about “Customers”? • “QuarterlySummary”? OLTP • Because the data in OLTP databases is frequently modified, they are highly susceptible to problems caused by the three anomalies (insert, update, delete). • The vaccine for these anomalies is normalization; in particular, getting the tables into third normal form (3NF). • Reminder: Normalization is not required for OLAP databases because their data is not updated frequently. In fact, it is common to denormalize OLAP databases. Reminder! • Good relational database design is about optimizing how the data is STORED, not how it is DISPLAYED. • Most “tables” you have seen—in books, in lectures, on the web—were probably optimized for display, not for storage. • Relational database tables are designed for consistency and to reduce redundancy. They are not designed for appearance. • When we learn SQL and Visual Basic, we will look at various ways to display the data stored in relational database tables. Normalization: A Database Thing • While it is common to talk about normalizing tables, normalization is a process that refers to an entire database. • Normalizing a table generally involves creating more tables. • Before we can learn more about normalizing a database, we need to know about designing tables.