Introduction to Database Design July 2006 Ken Nunes knunes @ sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER Database Design Agenda •Introductions •General Design Considerations •Entity-Relationship Model •Normalization •Overview of SQL •Star Schemas •Additional Information •Q&A SAN DIEGO SUPERCOMPUTER CENTER General Design Considerations •Users •Application Requirements •Legacy Systems/Data SAN DIEGO SUPERCOMPUTER CENTER Users •Who are they? •Administrative •Scientific •Technical •Impact •Access Controls •Interfaces •Service levels SAN DIEGO SUPERCOMPUTER CENTER Application Requirements •What kind of database? •OnLine Analytical Processing (OLAP) •OnLine Transactional Processing (OLTP) •Budget •Platform / Vendor •Workflow? •order of operations •error handling •reporting SAN DIEGO SUPERCOMPUTER CENTER Legacy Systems/Data •What systems are currently in place? •Where does the data come from? •How is it generated? •What format is it in? •What is the data used for? •Which parts of the system must remain static? SAN DIEGO SUPERCOMPUTER CENTER Entity - Relationship Model A logical design method which emphasizes simplicity and readability. •Basic objects of the model are: •Entities •Relationships •Attributes SAN DIEGO SUPERCOMPUTER CENTER Entities Data objects detailed by the information in the database. •Denoted by rectangles in the model. Employee SAN DIEGO SUPERCOMPUTER CENTER Department Attributes Characteristics of entities or relationships. •Denoted by ellipses in the model. Employee Name SSN SAN DIEGO SUPERCOMPUTER CENTER Department Name Budget Relationships Represent associations between entities. •Denoted by diamonds in the model. Employee Name SSN works in Start date SAN DIEGO SUPERCOMPUTER CENTER Department Name Budget Relationship Connectivity Constraints on the mapping of the associated entities in the relationship. •Denoted by variables between the related entities. •Generally, values for connectivity are expressed as “one” or “many” Employee Name SSN N work Start date SAN DIEGO SUPERCOMPUTER CENTER 1 Department Name Budget Connectivity one-to-one Department 1 has 1 Manager N Project N Project one-to-many Department 1 has many-to-many Employee M works on SAN DIEGO SUPERCOMPUTER CENTER ER example Retailer wants to create an online webstore. •The retailer requires information on: •Customers •Items •Orders SAN DIEGO SUPERCOMPUTER CENTER Webstore Entities & Attributes •Customers - name, credit card, address •Items - name, price, inventory •Orders - item, quantity, cost, date, status Name credit card Customers name address price date cost Orders Items inventory SAN DIEGO SUPERCOMPUTER CENTER status item quantity Webstore Relationships Identify the relationships. •The orders are recorded each time a customer purchases items, so the customer and order entities are related. •Each customer may make several purchases so the relationship is one-to-many Customer 1 N purchase SAN DIEGO SUPERCOMPUTER CENTER Order Webstore Relationships Identify the relationships. •The order consists of the items a customer purchases but each item can be found in multiple orders. •Since a customer can purchase multiple items and make multiple orders the relationship is many to many. Order M N consists SAN DIEGO SUPERCOMPUTER CENTER Item Webstore ER Diagram name credit card address Customers 1 status purchase date N Orders item quantity Items consists M cost SAN DIEGO SUPERCOMPUTER CENTER N name price inventory Logical Design to Physical Design Creating relational SQL schemas from entityrelationship models. •Transform each entity into a table with the key and its attributes. •Transform each relationship as either a relationship table (many-to-many) or a “foreign key” (one-to-many and many-to-many). SAN DIEGO SUPERCOMPUTER CENTER Entity tables Transform each entity into a table with a key and its attributes. Employee Name create table employee (emp_no number, name varchar2(256), ssn number, primary key (emp_no)); SSN SAN DIEGO SUPERCOMPUTER CENTER Foreign Keys Transform each one-to-one or one-to-many relationship as a “foreign key”. •Foreign key is a reference in the child (many) table to the primary key of the parent (one) table. Department 1 has N Employee create table department (dept_no number, name varchar2(50), primary key (dept_no)); create table employee (emp_no number, dept_no number, name varchar2(256), ssn number, primary key (emp_no), foreign key (dept_no) references department); SAN DIEGO SUPERCOMPUTER CENTER Foreign Key Department dept_no 1 2 3 Accounting has 1 employee: Name Accounting Human Resources IT Employee emp_no 1 2 3 4 5 6 dept_no 2 3 2 1 3 3 Name Nora Edwards Ajay Patel Ben Smith Brian Burnett John O'Leary Julia Lenin SAN DIEGO SUPERCOMPUTER CENTER Brian Burnett Human Resources has 2 employees: Nora Edwards Ben Smith IT has 3 employees: Ajay Patel John O’Leary Julia Lenin Many-to-Many tables Transform each many-to-many relationship as a table. •The relationship table will contain the foreign keys to the related entities as well as any relationship attributes. Project N Start date has create table project_employee_details (proj_no number, emp_no number, start_date date, primary key (proj_no, emp_no), foreign key (proj_no) references project foreign key (emp_no) references employee); M Employee SAN DIEGO SUPERCOMPUTER CENTER Many-to-Many tables Project proj_no 1 2 3 Project_employee_details Name Employee Audit Budget Intranet proj_no 1 3 3 2 3 2 Employee emp_no 1 2 3 4 5 6 dept_no 2 3 2 1 3 3 Name Nora Edwards Ajay Patel Ben Smith Brian Burnett John O'Leary Julia Lenin SAN DIEGO SUPERCOMPUTER CENTER emp_no 4 6 5 6 2 1 start_date 4/7/03 8/12/02 3/4/01 11/11/02 12/2/03 7/21/04 Employee Audit has 1 employee: Brian Burnett Budget has 2 employees: Julia Lenin Nora Edwards Intranet has 3 employees: Julia Lenin John O’Leary Ajay Patel Normalization A logical design method which minimizes data redundancy and reduces design flaws. •Consists of applying various “normal” forms to the database design. •The normal forms break down large tables into smaller subsets. SAN DIEGO SUPERCOMPUTER CENTER First Normal Form (1NF) Each attribute must be atomic • No repeating columns within a row. • No multi-valued columns. 1NF simplifies attributes • Queries become easier. SAN DIEGO SUPERCOMPUTER CENTER 1NF Employee (unnormalized) emp_no 1 2 3 name Kevin Jacobs Barbara Jones Jake Rivera dept_no 201 224 201 dept_name R&D IT R&D skills C, Perl, Java Linux, Mac DB2, Oracle, Java Employee (1NF) emp_no 1 1 1 2 2 3 3 3 name Kevin Jacobs Kevin Jacobs Kevin Jacobs Barbara Jones Barbara Jones Jake Rivera Jake Rivera Jake Rivera dept_no 201 201 201 224 224 201 201 201 SAN DIEGO SUPERCOMPUTER CENTER dept_name R&D R&D R&D IT IT R&D R&D R&D skills C Perl Java Linux Mac DB2 Oracle Java Second Normal Form (2NF) Each attribute must be functionally dependent on the primary key. • Functional dependence - the property of one or more attributes that uniquely determines the value of other attributes. • Any non-dependent attributes are moved into a smaller (subset) table. 2NF improves data integrity. • Prevents update, insert, and delete anomalies. SAN DIEGO SUPERCOMPUTER CENTER Functional Dependence Employee (1NF) emp_no 1 1 1 2 2 3 3 3 name Kevin Jacobs Kevin Jacobs Kevin Jacobs Barbara Jones Barbara Jones Jake Rivera Jake Rivera Jake Rivera dept_no 201 201 201 224 224 201 201 201 dept_name R&D R&D R&D IT IT R&D R&D R&D skills C Perl Java Linux Mac DB2 Oracle Java Name, dept_no, and dept_name are functionally dependent on emp_no. (emp_no -> name, dept_no, dept_name) Skills is not functionally dependent on emp_no since it is not unique to each emp_no. SAN DIEGO SUPERCOMPUTER CENTER 2NF Employee (1NF) emp_no 1 1 1 2 2 3 3 3 name Kevin Jacobs Kevin Jacobs Kevin Jacobs Barbara Jones Barbara Jones Jake Rivera Jake Rivera Jake Rivera dept_no 201 201 201 224 224 201 201 201 dept_name R&D R&D R&D IT IT R&D R&D R&D Employee (2NF) emp_no 1 2 3 name Kevin Jacobs Barbara Jones Jake Rivera dept_no 201 224 201 dept_name R&D IT R&D SAN DIEGO SUPERCOMPUTER CENTER skills C Perl Java Linux Mac DB2 Oracle Java Skills (2NF) emp_no 1 1 1 2 2 3 3 3 skills C Perl Java Linux Mac DB2 Oracle Java Data Integrity Employee (1NF) emp_no 1 1 1 2 2 3 3 3 name Kevin Jacobs Kevin Jacobs Kevin Jacobs Barbara Jones Barbara Jones Jake Rivera Jake Rivera Jake Rivera dept_no 201 201 201 224 224 201 201 201 dept_name R&D R&D R&D IT IT R&D R&D R&D skills C Perl Java Linux Mac DB2 Oracle Java • Insert Anomaly - adding null values. eg, inserting a new department does not require the primary key of emp_no to be added. • Update Anomaly - multiple updates for a single name change, causes performance degradation. eg, changing IT dept_name to IS • Delete Anomaly - deleting wanted information. eg, deleting the IT department removes employee Barbara Jones from the database SAN DIEGO SUPERCOMPUTER CENTER Third Normal Form (3NF) Remove transitive dependencies. • Transitive dependence - two separate entities exist within one table. • Any transitive dependencies are moved into a smaller (subset) table. 3NF further improves data integrity. • Prevents update, insert, and delete anomalies. SAN DIEGO SUPERCOMPUTER CENTER Transitive Dependence Employee (2NF) emp_no 1 2 3 name Kevin Jacobs Barbara Jones Jake Rivera dept_no 201 224 201 dept_name R&D IT R&D Dept_no and dept_name are functionally dependent on emp_no however, department can be considered a separate entity. SAN DIEGO SUPERCOMPUTER CENTER 3NF Employee (2NF) emp_no 1 2 3 name Kevin Jacobs Barbara Jones Jake Rivera Employee (3NF) emp_no 1 2 3 name Kevin Jacobs Barbara Jones Jake Rivera dept_no 201 224 201 SAN DIEGO SUPERCOMPUTER CENTER dept_no 201 224 201 dept_name R&D IT R&D Department (3NF) dept_no dept_name 201 R&D 224 IT Other Normal Forms Boyce-Codd Normal Form (BCNF) • Strengthens 3NF by requiring the keys in the functional dependencies to be superkeys (a column or columns that uniquely identify a row) Fourth Normal Form (4NF) • Eliminate trivial multivalued dependencies. Fifth Normal Form (5NF) • Eliminate dependencies not determined by keys. SAN DIEGO SUPERCOMPUTER CENTER Normalizing our webstore (1NF) orders order_id 405 405 405 408 410 410 cust_id 45 45 45 78 102 102 item_id 34 35 56 56 72 81 quantity 2 1 3 2 2 1 items cost 100 50 75 50 150 175 date 2/306 2/306 2/306 3/5/06 3/10/06 3/10/06 status shipped shipped shipped refunded shipped shipped item_id 34 35 56 72 81 name sweater red sweater blue t-shirt jeans jacket price 50 50 25 75 175 inventory 21 10 76 5 9 customers cust_id 45 45 45 78 102 102 name Mike Speedy Mike Speedy Mike Speedy Frank Newmon Joe Powers Joe Powers address 123 A St. 123 A St. 123 A St. 2 Main St. 343 Blue Blvd. 343 Blue Blvd. SAN DIEGO SUPERCOMPUTER CENTER credit_card_num 45154 32499 12834 45698 94065 10532 credit_card_type visa mastercard discover visa mastercard discover Normalizing our webstore (2NF & 3NF) customers cust_id 45 78 102 name Mike Speedy Frank Newmon Joe Powers address 123 A St. 2 Main St. 343 Blue Blvd. SAN DIEGO SUPERCOMPUTER CENTER credit_cards cust_id 45 45 45 78 102 102 num 45154 32499 12834 45698 94065 10532 type visa mastercard discover visa mastercard discover Normalizing our webstore (2NF & 3NF) items item_id 34 35 56 72 81 name sweater red sweater blue t-shirt jeans jacket cust_id 45 78 102 inventory 21 10 76 5 9 order details orders order_id 405 408 410 price 50 50 25 75 175 date 2/306 3/5/06 3/10/06 status shipped refunded shipped order_id 405 405 405 408 410 410 SAN DIEGO SUPERCOMPUTER CENTER item_id 34 35 56 56 72 81 quantity 2 1 3 2 2 1 cost 100 50 75 50 150 175 Revisit webstore ER diagram address Customers name Credit card have 1 N 1 card type purchase status date N Orders 1 name price consists quantity cost card number N Order details M SAN DIEGO SUPERCOMPUTER CENTER consists N Items inventory Structured Query Language SQL is the standard language for data definition and data manipulation for relational database systems. • Nonprocedural • Universal SAN DIEGO SUPERCOMPUTER CENTER Data Definition Language The aspect of SQL that defines and manipulates objects in a database. • create tables • alter tables • drop tables • create views SAN DIEGO SUPERCOMPUTER CENTER Create Table address create table customer (cust_id number, name varchar(50) not null, address varchar(256) not null, primary key (cust_id)); name Customer 1 have create table credit_card (cust_id number not null, credit_card_type char(5) not null, credit_card_num number not null, foreign key (cust_id) references customer); N Credit card card type SAN DIEGO SUPERCOMPUTER CENTER card number Modifying Tables alter table customer modify name varchar(256); alter table customer add credit_limit number; drop table customer; SAN DIEGO SUPERCOMPUTER CENTER Data Manipulation Language The aspect of SQL used to manipulate the data in a database. • queries • updates • inserts • deletes SAN DIEGO SUPERCOMPUTER CENTER Data Manipulation Language The aspect of SQL used to manipulate the data in a database. • queries • updates • inserts • deletes SAN DIEGO SUPERCOMPUTER CENTER Select command Used to query data from database tables. • Format: Select <columns> From <table> Where <condition>; SAN DIEGO SUPERCOMPUTER CENTER Query example customers cust_id 45 78 102 name Mike Speedy Frank Newmon Joe Powers address 123 A St. 2 Main St. 343 Blue Blvd. Select name from customers; result: Mike Speedy Frank Newmon Joe Powers SAN DIEGO SUPERCOMPUTER CENTER Query example customers cust_id 45 78 102 name Mike Speedy Frank Newmon Joe Powers address 123 A St. 2 Main St. 343 Blue Blvd. select name from customers where address = ‘123 A St.’; result: Mike Speedy SAN DIEGO SUPERCOMPUTER CENTER Query example customers cust_id 45 78 102 name Mike Speedy Frank Newmon Joe Powers credit_cards address 123 A St. 2 Main St. 343 Blue Blvd. cust_id 45 45 45 78 102 102 num 45154 32499 12834 45698 94065 10532 select * from customers where customers.cust_id = credit_cards.cust_id and type = ‘visa’; returns: Cust_id Name Address Cust_id Num type 45 Mike Speedy 123 A St. 45 45154 visa 78 Frank Newmon 2 Main St. 78 45698 visa SAN DIEGO SUPERCOMPUTER CENTER type visa mastercard discover visa mastercard discover Changing Data There are 3 commands that change data in a table. Insert: insert into <table> (<columns>) values (<values>); insert into customer (cust_id, name) values (3, ‘Fred Flintstone’); Update: update <table> set <column> = <value> where <condition>; update customer set name = ‘Mark Speedy’ where cust_id = 45; Delete: delete from <table> where <condition>; delete from customer where cust_id = 45; SAN DIEGO SUPERCOMPUTER CENTER Star Schemas Designed for data retrieval • Best for use in decision support tasks such as Data Warehouses and Data Marts. • Denormalized - allows for faster querying due to less joins. • Slow performance for insert, delete, and update transactions. • Comprised of two types tables: facts and dimensions. SAN DIEGO SUPERCOMPUTER CENTER Fact Table The main table in a star schema is the Fact table. • Contains groupings of measures of an event to be analyzed. •Measure - numeric data Invoice Facts units sold unit amount total sale price SAN DIEGO SUPERCOMPUTER CENTER Dimension Table Dimension tables are groupings of descriptors and measures of the fact. •descriptor - non-numeric data Customer Dimension cust_dim_key name address phone Location Dimension loc_dim_key store number store address store phone SAN DIEGO SUPERCOMPUTER CENTER Time Dimension time_dim_key invoice date due date delivered date Product Dimension prod_dim_key product price cost Star Schema The fact table forms a one to many relationship with each dimension table. Customer Dimension 1 cust_dim_key name address phone N Location Dimension loc_dim_key store number store address store phone N Invoice Facts cust_dim_key loc_dim_key time_dim_key prod_dim_key units sold unit amount total sale price 1 SAN DIEGO SUPERCOMPUTER CENTER 1 N Time Dimension time_dim_key invoice date due date delivered date Product Dimension N prod_dim_key product price 1 cost Analyzing the webstore The manager needs to analyze the orders obtained from the webstore. • From this we will use the order table to create our fact table. Order Facts date items customers SAN DIEGO SUPERCOMPUTER CENTER Webstore Dimension We have 2 dimensions for the schema: customers and items. Customer Dimension cust_dim_key name address credit_card_type SAN DIEGO SUPERCOMPUTER CENTER Item Dimension item_dim_key name price inventory Webstore Star Schema Order Facts date items customers N N 1 1 Customer Dimension cust_dim_key name address credit_card_type SAN DIEGO SUPERCOMPUTER CENTER Item Dimension item_dim_key name price inventory Books and Reference •Database Design for Mere Mortals, Michael J. Hernandez •Information Modeling and Relational Databases, Terry Halpin •Database Modeling and Design, Toby J. Teorey SAN DIEGO SUPERCOMPUTER CENTER Continuing Education UCSD Extension Data Management Courses DBA Certificate Program Database Application Developer Certificate Program SAN DIEGO SUPERCOMPUTER CENTER Data Central The Data Services Group provides Data Allocations for the research community. • http://datacentral.sdsc.edu/ •Tools and expertise for making data collections available to the broader scientific community. •Provide disk, tape, and database storage resources. SAN DIEGO SUPERCOMPUTER CENTER