DATA SCIENCE MIS0855 | Spring 2016 Designing Data SungYong Um sungyong.um@temple.edu Pivot Table Data Use a Flat File Structure EmpNo 101 102 103 101 102 103 101 102 103 Ename Abigail Bob Carolyn Abigail Bob Carolyn Abigail Bob Carolyn DeptNo 10 20 10 10 20 10 10 20 10 DeptName Marketing Purchasing Marketing Marketing Purchasing Marketing Marketing Purchasing Marketing Date Jan 2014 Jan 2014 Jan 2014 Feb 2014 Feb 2014 Feb 2014 Mar 2014 Mar 2014 Mar 2014 Expenses $1,000 $500 $1,500 $250 $1,000 $900 $400 $1,750 $2,000 All related data is in the same column EmpNo 101 102 103 101 102 103 101 102 103 Ename Abigail Bob Carolyn Abigail Bob Carolyn Abigail Bob Carolyn DeptNo 10 20 10 10 20 10 10 20 10 DeptName Marketing Purchasing Marketing Marketing Purchasing Marketing Marketing Purchasing Marketing Date Jan 2014 Jan 2014 Jan 2014 Feb 2014 Feb 2014 Feb 2014 Mar 2014 Mar 2014 Mar 2014 Expenses $1,000 $500 $1,500 $250 $1,000 $900 $400 $1,750 $2,000 This allows you to aggregate data by column value EmpNo 101 102 103 101 102 103 101 102 103 Ename DeptNo DeptName Date Expenses Abigail 10 Marketing Jan 2014 $1,000 Bob 20 Purchasing Jan 2014 $500 Carolyn 10 Marketing Jan 2014 $1,500 Abigail 10 Marketing Feb 2014 $250 Bob 20 Purchasing Feb 2014 $1,000 Carolyn 10 Marketing Feb 2014 $900 Abigail 10 Marketing Mar 2014 $500 Bob 20 Purchasing Mar 2014 $1,750 Carolyn 10 Marketing Mar 2014 $2,000 How many people work in marketing? How much did Abigail expense in January through March? Who expensed the most in March? Which month had the greatest amount expensed? Relational Data Model Data Model: a formal way to express data relationships to a database Relational Data Model: a type of model that represents its information in the form of logically-related 2-D tables The most common, intuitive, de facto model consists of Entities, Attributes, and Relationships Entity-Relationship Diagram (ERD) Entities and Attributes (1/2) Entity, aka table, is an object or event about which information is stored. Object: e.g. Customer, Product, Employee, Factory Event: e.g. Order, Registration, Contract, Payment A company is usually not an entity unless a database stores information of multiple companies. Attributes, aka fields or columns, are characteristics or proper- ties of an entity. A customer (entity) can be described by customer number, name, address, phone number (attributes). Entities and Attributes (2/2) How to distinguish between Entity and Attribute? An Attribute is a characteristic or property of an Entity. An Attribute becomes a column of a Table. Instances An entity consists of multiple Instances, aka records or rows. Entity Instance Attribute Entity Identifier (Primary Key) Entity Identifier, aka primary key, is an attribute that ensures each instance has a unique value that distinguishes it from every other instance. Every entity must have an entity identifier. TUID # Social Security # Plate # other examples? How to draw Entities and Attributes (1/2) Entity Identifier How to draw Entities and Attributes (2/2) Entity Identifier Entity Customer Order Item Distributor Customer Number First Name Last Name Street City State Zip Code Phone Number Credit Card # Credit Card Exp. Order Number Customer Number Order Date Order Filled Item Number Title Distributor # Price Release Date Genre Distributor # Name Street City State Zip Code Phone Number Contact Name Contact Phone Attributes Example SportTech Events puts on athletic events for local high school athletes. The company needs a database designed to keep track of the sponsor for the event and where the event is located. Each event needs a description, date, and cost. Separate costs are negotiated for each event. The company would also like to have a list of potential sponsors that includes each sponsor’s contact information such as the name, phone number, and address. Each event will have a single sponsor, but a particular sponsor may sponsor more than one event. Each location will need an ID, contact person, and phone number. A particular event will use only one location, but a location may be used for multiple events. ID Descripti on Date ID Cost Name Event Sponsor Phone Add-ress Location ID Contact Phone Relationship An entity should be related to one or more other entities. Any relationship between the entities below? One-to-One (1:1) Relationship A relationship between two entities in which an instance of entity A can be related to only one instance of entity B and an instance of entity B can be related to only one instance of entity A. Name of relationship usually with a verb One-to-Many (1:M) Relationship A relationship between two entities, in which an instance of entity A can be related to one or more instances of entity B and an instance of entity B can be related to only one instance of entity A. Many-to-Many (M:N) Relationship A relationship between two entities, in which an instance of entity A can be related to one or more instances of entity B and an instance of entity B can be related to one or more instances of entity A. ID Descripti on Date ID Cost Name Event M Support 1 Sponsor Phone M Add-ress Host 1 Location ID Contact Phone Relationship Depends on Business Rules There is no right answer in relationship types. It depends on business descriptions and rules. If a customer can place only one order, it’s 1:1 relationship. If an order can have multiple customers, it’s M:N. Database Normalization (1/2) EmpNo 101 102 Ename Abigail Bob DeptNo 10 20 DeptName Marketing Purchasing Date Jan 2014 Jan 2014 Expenses $1,000 $500 103 101 102 104 Carolyn Abigail Bob Kevin 10 10 20 30 Marketing Marketing Purchasing R&D Jan 2014 Feb 2014 Feb 2014 Feb 2014 $1,500 $250 $1,000 $900 101 102 103 Abigail Bob Carolyn 10 20 10 Marketing Purchasing Marketing Mar 2014 Mar 2014 Mar 2014 $400 $1,750 $2,000 Database Normalization (1/2) EmpNo Ename DeptNo DeptName Date Expenses 101 Abigail 10 Marketing Jan 2014 $1,000 102 Bob 20 Purchasing Jan 2014 $500 103 Carolyn 10 Marketing Jan 2014 $1,500 101 Abigail 10 Marketing Feb 2014 $250 102 Bob 20 Purchasing Feb 2014 $1,000 104 Kevin 30 R&D Feb 2014 $900 101 Abigail 10 Marketing Mar 2014 $400 102 Bob 20 Purchasing Mar 2014 $1,750 103 Carolyn 10 Marketing Mar 2014 $2,000 There are too many redundant data (Department name, employee names). What if there are millions of records? Normalization : a process to remove redundant data Database Normalization (2/2) DeptNo 10 20 30 DeptName Marketing Purchasing R&D EmpNo 101 102 Ename Abigail Bob DeptNo 10 20 103 104 Carolyn Kevin 10 30 EmpNo 101 102 Date Jan 2014 Jan 2014 Expenses $1,000 $500 103 101 102 Jan 2014 Feb 2014 Feb 2014 $1,500 $250 $1,000 104 101 Feb 2014 Mar 2014 $900 $400 102 103 Mar 2014 Mar 2014 $1,750 $2,000 EmpNo Ename DeptNo Employee M Belong to Make M Expense Date Dept Dept Name 1 ExpID 1 Expense