RELATIONAL DATA MODELING MIS2502 Data Analytics A Brief Review Gathering Storing Data Retrieving Interpreting Will be talked about throughout the class A Brief Review Transactional Database Analytical Data Store Supports management of an organization’s data Supports managerial decision-making For everyday transactions For periodic analysis This is what is commonly thought of as “database management” This is the foundation for business intelligence The Information Architecture of an Organization Data entry Data extraction Transactional Database Data analysis Analytical Data Store Stores real-time transactional data Stores historical transactional and summary data Called OLTP: Called OLAP: Online transaction processing Online analytical processing A Brief Review Transactional Database Analytical Data Store Based on Relational paradigm Based on Dimensional paradigm Storage of real-time transactional data Storage of historical transactional data Optimized for storage efficiency and data integrity Optimized for data retrieval and summarization Supports day-to-day operations Supports periodic and on-demand analysis What is a model? • Representation of something in the real world Modeling a database • A representation of the structure of the data • Describes the data contained in the database • Explains how the data interrelates • A student is part of a section, which is part of a course Why bother modeling? • Creates a blueprint before you start building the database • Gets the story straight: easy for non-technical people to understand • Minimize having to go back and make changes in the implementation stage The process of analysis and design • Systems Analysis • Analysis of complex, large-scale systems and the interactions within those systems http://en.wikipedia.org/wiki/Systems_analysis • Systems Design • The process of defining the hardware and software architectures, components, models, interfaces, and data for a computer system to satisfy specified requirements http://en.wikipedia.org/wiki/Systems_design Notice that they are not the same! Basically… In the context of database development. • Systems Analysis is the process of modeling the problem • Requirements-oriented • What should we do? This is where we define and understand the business scenario. • Systems Design is the process of modeling a solution • Functionality-oriented • How should we do it? This is where we implement that scenario as a database. Start with a problem statement • “We want a database to track orders.” • That’s too vague to create a useful system, so we then gather requirements to learn more • Gather documentation • About the business process • About existing systems • Conduct interviews • Employees directly involved in the process • Other stakeholders (i.e., customers) • Management Why are each of these important? Are there others? Start with a problem statement • Refine the problem statement • Getting iterative feedback from the client • End up with a scenario like this: • The system must track customer orders • Multiple products can go into an order • A customer is described by their name, address, and a unique Customer ID number • An order is described by the date in which it was placed, what was bought, and how much it costs The specification “what was bought” is a little vague, and that will cause us a problem a little later. But let’s leave it for now… First lecture on data modeling stops here. Review Questions • What is the key difference between transactional database and analytical data store? • What is a Model? • What is the first step to build a database? • Let’s build a relational database! The Entity Relationship Diagram (ERD) • The primary way of modeling a relational database • Part of the “analysis” process • Implemented as a picture with three key elements Use rectangle Use diamond Use oval Entity A uniquely identifiable thing (i.e., person, order) Relationship Describes how two entities relate to one another (i.e., makes) Attribute A characteristic of an entity or relationship (i.e., first name, order number) A very simple example Last name City State First name Customer Zip Customer ID place Order number Order Date Order Product name Price The primary key • Entities need to be uniquely identifiable • So you can tell them apart when you retrieve them • Use a primary key • An attribute (or a set of attributes) that uniquely identifies an entity Customer ID Order number Uniquely identifies a customer How about these as primary keys for Customer: Uniquely identifies an order First name and/or last name? Social security number? One to many relationship (ERD) Customer at least – one at most - one place Order at least – one at most - many This is a one-to-many relationship: One customer can have many orders One order can only belong to one customer many to one relationship (ERD) Order associate at least – one at most - many Customer at least – one at most - one This is a many-to-one relationship: Many to many relationship (ERD) First Read this way Employee has at least – one at most - many Then read this way! Office at least – one at most - many Crows Feet Notation Customer So called because this… makes Order …looks something like this There are other ways of denoting cardinality, but this one is pretty standard. There are also variations of the crows feet notion! Cardinality is defined by business rules • What would the cardinality be in these situations? Order Course Employee ? ? ? contains has has ? ? ? Product Section Office But we have a problem with our ERD Last name City State First name Customer Zip Customer ID makes Order number Order Date Order Product name Price This assumes every order contains only one product. So if I want two products, I have to make two orders! The problem: Product is defined as an attribute, not an entity. (Because we didn’t define our requirements clearly enough?) Here’s a solution Last name First name City Customer ID Customer State Order number place Order Date Order Zip contains • Now • A customer can place multiple orders • An order can contain multiple products • A product can be part of multiple orders Product Price Quantity Product name So far for the 2nd class of ERD… Implementing the ERD • As a database schema • A map of the tables and fields in the database • This is what is implemented in the database management system • Part of the “design” process • A schema actually looks a lot like the ERD • Entities become tables • Attributes become fields • Relationships can become additional tables (manymany) Structure of a database Data element Description Character Single letter or number (“A”, “Z”, “1”) Field Set of related characters (first name) Record Set of related fields (all information about a customer) Table Set of related records (all customers in the company) Database Set of related tables (all information about the company) Data Base Structure • character, field, record, table, db The Rules 1. Create a table for every entity 2. Create table fields for every entity’s attributes 3. Implement relationships between the tables 1:many relationships • Primary key field of “1” table put into “many” table as foreign key field many:many relationships • Create new table • 1:many relationships with original tables 1:1 relationships • Primary key field of one table put into other table as foreign key field Our Order Database schema Original 1:n relationship Original n:n relationship • Order-Product is a decomposed many-to-many relationship • Order-Product has a 1:n relationship with Order and Product • Now an order can have multiple products, and a product can be associated with multiple orders What the Customer and Order tables look like Customer Table CustomerID FirstName LastName City State Zip 1001 Greg House Princeton NJ 09120 1002 Lisa Cuddy Plainsboro NJ 09123 1003 James Wilson Pittsgrove NJ 09121 1004 Eric Foreman Warminster PA 19111 Order Number OrderDate Customer ID 101 3-2-2011 1001 102 3-3-2011 1002 103 3-4-2011 1001 104 3-6-2011 1004 Order Table Note that there are no repeating records Every customer is unique Every order is unique This is an example of normalization. Normalization • Organizing data to minimize redundancy (repeated data) • This is good for two reasons • The database takes up less space • You have a lower chance of inconsistencies in your data • If you want to make a change to a record, you only have to make it in one place (but you do not change the primary key-unique identifier) • The relationships take care of the rest • But you will usually need to link the separate tables together in order to retrieve information To figure out who ordered what • Match the Customer IDs of the two tables, starting with the table with the foreign key (Order): Order Table Customer Table Order Number OrderDate Customer ID Customer ID FirstName LastName City State Zip 101 3-2-2011 1001 1001 Greg House Princeton NJ 09120 102 3-3-2011 1002 1002 Lisa Cuddy Plainsboro NJ 09123 103 3-4-2011 1001 1001 Greg House Princeton NJ 09120 104 3-6-2011 1004 1004 Eric Foreman Warminster PA 19111 • We now know which order belonged to which customer • This is called a join • But it’s an inefficient way to store data (redundancies) • So we normalize Now the many:many relationship Order Table Order-Product Table Order Number OrderDate Customer ID Order ProductID Order number Product ID Quantity 101 3-2-2011 1001 1 101 2251 2 102 3-3-2011 1002 2 101 2282 3 103 3-4-2011 1001 3 101 2505 1 104 3-6-2011 1004 4 102 2251 5 5 102 2282 2 6 103 2505 3 7 104 2505 8 Product Table ProductID ProductName Price 2251 Cheerios 3.99 2282 Bananas 1.29 2505 Eggo Waffles 2.99 This table relates Order and Product to each other! To figure out what each order contains • Match the Product IDs and Order IDs of the tables, starting with the table with the foreign keys (Order-Product): Order-Product Table Order Table Product Table Order ProductID Order Number Product ID Quantity Order Number Order Date Customer ID Product ID Product Name Price 1 101 2251 2 101 3-2-2011 1001 2251 Cheerios 3.99 2 101 2282 3 101 3-2-2011 1001 2282 Bananas 1.29 3 101 2505 1 101 3-2-2011 1001 2505 Eggo Waffles 2.99 4 102 2251 5 102 3-3-2011 1002 2251 Cheerios 3.99 5 102 2282 2 102 3-3-2011 1002 2282 Bananas 1.29 6 103 2505 3 103 3-4-2011 1001 2505 Eggo Waffles 2.99 7 104 2505 8 104 3-6-2011 1004 2505 Eggo Waffles 2.99 Now there is redundant product data as a result of the join! Why redundant data is a big deal Customer ID Product ID Product Name Price 1001 2251 Cheerios 3.99 1001 2282 Bananas 1.29 1001 2505 Eggo Waffles 2.99 1002 2251 Cheerios 3.99 1002 2282 Bananas 1.29 1001 2505 Eggo Waffles 2.99 1004 2505 Eggo Waffles 2.99 stomer Customer ID First Name Last Name City State Zip 01 1001 Greg House Princeton NJ 09120 02 1002 Lisa Cuddy Plainsboro NJ 09123 01 1001 Greg House Princeton NJ 09120 04 1004 Eric Foreman Warminster PA 19111 The redundant data seems harmless, but: What if the price of “Eggo Waffles” changes? And what if Greg House changes his address? And if there are 1,000,000 records? Best practices for normalization • Create new entities when there are collections of related attributes, especially when they would repeat • For example, consider a modified Product entity Vendor Phone Don’t do this… Vendor Name Product Price Vendor Address Product name Vendor Phone Vendor Name Vendor Vendor Address Vendor ID …do this. Then you won’t have to repeat vendor information for each product. sells Product Product name Price ? Why did we introduce VendorID? Best practices for normalization • Create new entities to enforce data entry standards …but this can be even better. Last name First name Customer ID This is fine… Last name First name Zip Customer ID Customer State City City Customer State Zip City ID ! City Name State ID State Name The city name is entered only once in the City table; CityID is used in Customer table City and State as “lookup tables” • Why this can be a better way of doing it Customer Last name First name Zip Customer ID CustomerID FirstName LastName CityID StateID Zip 1001 Greg House 1 1 09120 1002 Lisa Cuddy 2 1 09123 1003 James Wilson 3 1 09121 1004 Eric Foreman 4 2 19111 Customer City State City City ID City Name State ID State Name CityID CityName 1 Princeton 2 Plainsboro 3 Pittsgrove 4 Warminster This helps prevent inconsistent spellings (Pennsylvania is always entered as “2”) State StateID StateName Abbr 1 New Jersey NJ 2 Pennsylvania PA The three-way relationship • Sometimes three entities are necessary to capture what happens in a transaction Employee ID Repair code modeled as an many-to-manyto-many relationship Salary Mechanic Description Repair • This would be Name Performs Repair date Charge Car Model Make VIN The many:many:many table • The many-to-many-to-many relationship would still be represented as a separate table • Just with three foreign keys, instead of two Employee ID Repair code Name Salary Mechanic RepairID RepairCode EmployeeID VIN Repair date 1 101 9112 10192919201 2011-3-1 2 201 2313 19292919291 2011-3-2 3 302 1231 102010023 2011-3-3 4 223 2132 393848383 2011-3-4 Description Repair Performs Repair date Charge Car Model Make VIN