Data Management & Data Warehouses MIS 320 Kraig Pencil Summer 2014 1 PPT Slides by Dr. Craig Tyran & Kraig Pencil Game Plan • • • • • • Introduction Why use a relational database? Database management systems Data warehouses Data mining Data marts 2 PPT Slides by Dr. Craig Tyran & Kraig Pencil A. Why use a relational database? 1. A database sounds great, but why don’t we just store all our data in one big table in an Excel spreadsheet? – Example: Can you foresee any hassles or potential difficulties associated with entering/storing order information in the following Excel table? 3 PPT Slides by Dr. Craig Tyran & Kraig Pencil A. Why use a relational database? 1. Why don’t we just store our data in one spreadsheet table? (cont.) – – Potential problems • May have “redundant” data entry • Potential for data entry errors (different/wrong phone numbers) • Updates can be a hassle/inefficient (e.g., change phone no) Solution • “Normalize” the data … Break up the table into a set of linked tables in a data base (instead of having one spreadsheet) – See example PPT Slides by Dr. Craig Tyran & Kraig Pencil 4 Example: Normalized Tables (and the advantages of a database) Questions: a) Any unneeded redundancy? b) Is it now efficient to update customer info? c) Where is the foreign key? 5 PPT Slides by Dr. Craig Tyran & Kraig Pencil Example: Non-Normalized Data Table for an Auto Shop (Rainer & Turban, Fig 4.6) Examples of redundancy PPT Slides by Dr. Craig Tyran & Kraig Pencil B. Database Management Systems 1. What is a “database management system” (DBMS)? SW that allows one to create, store, organize, manage, and use data • Example of a DBMS? 2. Key components – – – – – Data Definition subsystem Data Manipulation subsystem Application Generation subsystem Data Administration subsystem DBMS engine 7 PPT Slides by Dr. Craig Tyran & Kraig Pencil DBMS Components Lab Tutorials 1,2 Lab Tutorials 3,5 Lab Tutorials 4,6 8 PPT Slides by Dr. Craig Tyran & Kraig Pencil B. Database Management Systems 3. Examples of DBMS components in Access Data Definition subsystem – Data dictionary (“Design view” for a table) Data Manipulation subsystem: Move, change, and “ask questions” – View of a table (“Datasheet view” for a table) – Query-by-example (QBE) tool – Structured query language (SQL) Application Generation subsystem: the “front end” – Design of forms and reports Data Administration subsystem – Optimize query performance – Security settings with password PPT Slides by Dr. Craig Tyran & Kraig Pencil 9 B. Database Management Systems 4. What aspects of data need to be specified? – Lots of aspects!!! • – Common data properties • • • • • – Recall table creation in MS Access (Tutorials 1 & 2) Data “type” (number, text, date, etc.) Description Field size Required/not required Etc. An important reference for a database system: Data dictionary – Stores information about the data in a database PPT Slides by Dr. Craig Tyran & Kraig Pencil 10 Access Example: Data “type” (number, text, date, etc.) Description Field size Required/not required Information about the “Gender” field is specified in “Field Properties” section PPT Slides by Dr. Craig Tyran & Kraig Pencil Access Example: Data Manipulation Subsystem (Low Stock Products query) QBE or SQL may be used to prepare a query. Which approach would be easier for most people? PPT Slides by Dr. Craig Tyran & Kraig Pencil Access Example: Application Generation Subsystem (Employer Information Form) PPT Slides by Dr. Craig Tyran & Kraig Pencil Access Example: Data Administration (Performance Analysis for a Database) PPT Slides by Dr. Craig Tyran & Kraig Pencil B. Database Management Systems (cont.) 5. DBMS: Example products – You are very likely to work with – and possibly help develop a database– using one or more of the following: • • Small-Midsize DBMS: Microsoft Access, dBase, Paradox Mid-to-Large DBMS: Microsoft SQL Server, Oracle, DB2, Informix, IMS 15 PPT Slides by Dr. Craig Tyran & Kraig Pencil C. Data Warehouses 1. Business problem: • • Difficult for larger organizations to analyze organizational data from multiple sources Solution: Data warehouse 2. Gather/integrate information from existing operational databases into a “warehouse” • • Create “Business Intelligence” system See next figure PPT Slides by Dr. Craig Tyran & Kraig Pencil 16 Create a Data Warehouse from Operational Databases 17 PPT Slides by Dr. Craig Tyran & Kraig Pencil From Haag, et al., MIS for the Information Age, 2004 C. Data Warehouses (cont.) 3. Data warehouse features • Designed to support business decision making • • Not transactions! Supports OLAP – • • • Crosses functional boundaries of an organization Can be very large Note: Warehouse is “read only” • • Why? Can be a significant strategic resource for a company 4. On-line Analytical Processing Can yield a high ROI Examples • ??? PPT Slides by Dr. Craig Tyran & Kraig Pencil 18 C. Data Warehouses (cont.) 5. Implementation issues • • People may be reluctant to share information “ETL” process is not easy • • Extraction, transformation, load Expensive 19 PPT Slides by Dr. Craig Tyran & Kraig Pencil D. Data Mining 1. Provides a means to extract patterns and relationships from large amount of data (e.g., a data warehouse) 2. Mining analogy – – 3. Sift through raw dirt/rock to find something of value Large volumes of data are sifted in an attempt to find something worthwhile Example: market basket analysis – Identify products that may be attractive to a customer • See next slide: Amazon.com buyer suggestions 20 PPT Slides by Dr. Craig Tyran & Kraig Pencil Data Mining: Example of pattern discovered via mining PPT Slides by Dr. Craig Tyran & Kraig Pencil D. Data mining (cont.) 4. Identify previously unknown patterns – e.g., What are characteristics of customers likely to default on a bank loan? “Target knows before it shows” – How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did – How Companies Learn Your Secrets: NYTimes – e.g., Suppose you discovered that beer and diapers*were often found in the same purchase? • • “Market basket analysis” What could you do with that information to improve sales of one, the other or both? *This is a common example, not an actual case. PPT Slides by Dr. Craig Tyran & Kraig Pencil 22 E. Data Marts 5. Data marts • • Warehouses can be overwhelming/difficult to implement … Some organizations create “data marts” A subset of a data warehouse • • • Simpler, scaled-down version Focuses on/Integrates a specific area (e.g., Sales department) Provides useful decision making tools PPT Slides by Dr. Craig Tyran & Kraig Pencil Haggen photo from: www.callhugh.com/ ferndale.php 23 MiniMart photo from: http://www.ae.gatech.edu/research/controls/pictures/f020801_gtar/Mini%20Mart.JPG Data Marts: Subsets of Data Warehouse 24 PPT Slides by Dr. Craig Tyran & Kraig Pencil Data Mining – Business Intelligence • A few videos to watch and think about … • • • • • • http://www.youtube.com/user/SASsoftware?v=C14GVhNt7Do&featu re=pyv&ad=4782573666&kw=CRM http://www.youtube.com/user/ibm?#p/c/13/fFdITHMuy2w http://www.youtube.com/user/SASsoftware?v=2677nWVNg9M&feat ure=pyv&ad=4782551166&kw=business%20analytics http://www.youtube.com/watch?v=El_lSd6G5WU http://www.youtube.com/watch?v=uP89kaDU40c http://www.youtube.com/user/SASsoftware?v=C14GVhNt7Do&featu re=pyv&ad=4782573666&kw=CRM#p/u/35/ecqk0JUKvAI 25 PPT Slides by Dr. Craig Tyran & Kraig Pencil Big Data • Big data[1][2] is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. (Wikipedia) • (Image) 26 PPT Slides by Dr. Craig Tyran & Kraig Pencil Global Big Data: + 2.5 exobytes/day • The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s[15] • As of 2012, every day 2.5 quintillion (2.5×1018) bytes of data were created.[16] • (Wikipedia) • (Image) 27 PPT Slides by Dr. Craig Tyran & Kraig Pencil Big Data Total World Data Storage Capacity (in CDs @ 730MB/CD) 3,000,000,000,000 • The next frontier in data? • http://www.eweek.com/c/a/Data-Storage/Big-Data-Analytics- 2,500,000,000,000 Is-Just-Starting-to-Reach-Its-Potential-10-Reasons-Why457684/?kc=EWKNLEAU07102012STR1 • Some terms: – Hadoop (distributed file organization) – Distributed databases and server clusters – Cassandra (No only SQL DBMS) – MapReduce (breaking computation into smaller pieced, then combining the results of each computation) PPT Slides by Dr. Craig Tyran & Kraig Pencil 2,000,000,000,000 1,500,000,000,000 1,000,000,000,000 500,000,000,000 1993 2000 2007 2014 28