Chapter 9: Data Warehousing Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration Gonzaga University Spokane, WA 99258 chen@jepson.gonzaga.edu Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems Objectives • Definition of terms • Reasons for information gap between information needs and availability • Reasons for need of data warehousing • Describe three levels of data warehouse architectures (ETL) • Describe two components of star schema • Estimate fact table size • Design a data mart • Develop requirements for a data mart • OLAP, data mining and its applications Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-2 A Solution to the Information Gap • A solution to bridging the information data warehouses gap is the ______ _________ which consolidate and integrate information from many different sources and arrange it in a meaningful format for making accurate business decisions. Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-3 Two issues need to know about D.W. • 1. A major factor drives the need for data warehousing – Businesses need an integrated view of company information. • 2. Which of the following organizational trends does not encourage the need for data warehousing? – – – – – a) Multiple, nonsynchronized systems b) Focus on customer relationship management c) Downsizing d) Focus on supplier relationship management Downsizing Answer: ______________ Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-4 Need for Data Warehousing • Integrated, company-wide view of high-quality information (from disparate databases) • Separation of operational and informational systems and data (for improved performance) Table 9-1 – Comparison of Operational and Informational Systems Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-5 DATA WAREHOUSE FUNDAMENTALS • Data warehouse – a logical collection of information – gathered from many different operational databases – that supports business analysis activities and decision-making tasks • The primary purpose of a data warehouse is to aggregate information throughout an organization into a single repository for decision-making purposes Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-6 Definition • Data Warehouse: A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decisionmaking processes – Subject-oriented: e.g. customers, patients, students, products • DW is organized around key high-level entities of the enterprise – Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources – Time-variant: Can study trends and changes • data in the warehouse contain a time dimension so that they may be used to study trends and changes. – Non-updatable: Read-only, periodically refreshed • Data Mart: – A data warehouse that is limited in scope – contains a subset of data warehouse information Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-7 History Leading to Data Warehousing • Improvement in database technologies, especially relational DBMSs • Advances in computer hardware, including mass storage and parallel architectures • Emergence of end-user computing with powerful interfaces and tools • Advances in middleware, enabling heterogeneous database connectivity • Recognition of difference between operational and informational systems Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-8 Need for Data Warehousing • Integrated, company-wide view of highquality information (from disparate databases) • Separation of operational and informational systems and data (for improved performance) Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-9 Issues with Company-Wide View • • • • • Inconsistent key structures Synonyms Free-form vs. structured fields Inconsistent data values Missing data See figure 9-1 for example Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-10 Figure 9-1 Examples of heterogeneous data Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-11 Database vs. Datawarehouse DBMS ??? Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems Database Data Warehouse TM 9-12 Database vs. Datawarehouse DBMS Data Mining Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems Database Data Warehouse TM 9-13 Database vs. Datawarehouse DBMS ??? Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems Database Datawarehouse TM 9-14 Data Marts and the Data Warehouse Legacy systems feed data to the warehouse. The warehouse feeds specialized information to departments (data marts). Legacy Systems Finance Data Mart Sales Data Mart Operational Data Store Marketing Data Mart ETL Operational Data Store Accounting Data Mart ETL Operational Data Store Organizational Data Warehouse Operational Data Store Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-15 The Data Mart is More Specialized The data mart serves the needs of one business unit, not the organization. Organizational Data Warehouse Corporate Highly granular data Normalized design Robust historical data Large data volume Data Model driven data Versatile General purpose DBMS technologies Finance Data Mart Sales Data Mart Marketing Data Mart ETL Accting Data Mart Data Marts Organizational Data Warehouse Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems Departmentalized Summarized, aggregated data Star join design Limited historical data Limited data volume Requirements driven data Focused on departmental needs Multi-dimensional DBMS technologies TM 9-16 Organizational Trends Motivating Data Warehouses • No single system of records • Multiple systems not synchronized • Organizational need to analyze activities in a balanced way • Customer relationship management • Supplier relationship management Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-17 Separating Operational and Informational Systems • Operational system – a system that is used to run a business in real time, based on current data; also called a system of record • Informational system – a system designed to support decision making based on historical pointin-time and prediction data for complex queries or data-mining applications Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-18 Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-19 19 Position of the Data Warehouse Within the Organization Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-20 DATA WAREHOUSE FUNDAMENTALS (cont.) • Extraction, transformation, and loading (ETL) – a process that extracts information from internal and external databases, transforms the information using a common set of enterprise definitions, and loads the information into a data warehouse Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-21 Data Warehouse Architectures • Independent Data Mart • Dependent Data Mart and Operational Data Store • Logical Data Mart and Real-Time Data Warehouse • Three-Layer architecture All involve some form of extraction, transformation and loading (ETL) Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-22 Figure 9-2 Independent data mart data warehousing architecture Data marts: Mini-warehouses, limited in scope L T E Separate ETL for each independent data mart Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems Data access complexity due to multiple data marts 23 TM 9-23 Figure 9-3 Dependent data mart with ODS provides option for operational data store: a three-level architecture obtaining current data L T E Single ETL for enterprise data warehouse (EDW) Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems Simpler data access Dependent data marts loaded from EDW 24 TM 9-24 Figure 9-4 Logical data mart and real time warehouse architecture ODS and data warehouse are one and the same L T E Near real-time ETL for Data Warehouse Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems Data marts are NOT separate databases, but logical views of the data warehouse Easier to create new data marts 25 TM 9-25 The ETL Process – another perspective and example • • • • Capture/Extract - E Scrub or data cleansing Transform - T Load and Index - L ETL = Extract, transform, and load Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-26 Capture/Extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse Static extract = capturing a snapshot of the source data at a point in time Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems Incremental extract = capturing changes that have occurred since the last static extract TM 9-27 Scrub/Cleanse…uses pattern recognition and AI techniques to upgrade data quality Fixing errors: misspellings, Also: decoding, reformatting, time erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies stamping, conversion, key generation, merging, error detection/logging, locating missing data Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-28 Transform = convert data from format of operational system to format of data warehouse Record-level: Selection – data partitioning Joining – data combining Aggregation – data summarization Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems Field-level: single-field – from one field to one field multi-field – from many fields to one, or one field to many TM 9-29 Load/Index= place transformed data into the warehouse and create indexes Refresh mode: bulk rewriting of Update mode: only changes in target data at periodic intervals source data are written to data warehouse Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-30 Information Cleansing or Scrubbing • An organization must maintain high-quality data in the data warehouse • Information cleansing or scrubbing – a process that weeds out and fixes or discards inconsistent, incorrect, or incomplete information Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-31 Information Cleansing or Scrubbing • Standardizing Customer name from Operational Systems Information Cleansing or Scrubbing Information Cleansing or Scrubbing • Accurate and complete information Representation of Data in DW • Dimensional Modeling – a retrieval-based system that supports high-volume query access – Not only accommodate but also boost the processing of complex multidimensional queries. • Two means Star – 1. ______schema – the most commonly used and the simplest style of dimensional modeling • Contain a fact table surrounded by and connected to several dimension tables • Fact table contains the descriptive attributes (numerical values) needed to perform decision analysis and query reporting, and foreign keys are used to link to dimension table. • Dimension tables contain classification and aggregation information about the values in the fact table (i.e., attributes describing the data contained within the fact table). Snowflakes schema – an extension of star schema where the diagram – 2. ___________ resembles a snowflake in shape Fact Table vs. Dimensional Table Many to Many Relationship (M:N) pk Dimensional Table cpk fk fk Fact Table Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems pk Dimensional Table TM 9-36 Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-37 Figure 9-5 Three-layer data architecture for a data warehouse Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-38 Figure 9-6 Example of DBMS log entry Data Characteristics Status vs. Event Data Status Event = a database action (create/ update/ delete) that results from a transaction Status Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-39 Figure 9-7 Transient operational data Data Characteristics Transient vs. Periodic Data With transient data, changes to existing records are written over previous records, thus destroying the previous data content Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-40 Figure 9-8 Periodic warehouse data Data Characteristics Transient vs. Periodic Data Periodic data are never physicall y altered or deleted once they have been added to the store Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-41 Other Data Warehouse Changes • • • • • • New descriptive attributes New business activity attributes New classes of descriptive attributes Descriptive attributes become more refined Descriptive data are related to one another New source of data Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-42 Data Reconciliation • Typical operational data is: – Transient – not historical – Not normalized (perhaps due to denormalization for performance) – Restricted in scope – not comprehensive – Sometimes poor quality – inconsistencies and errors • After ETL, data should be: – – – – – Detailed – not summarized yet Historical – periodic Normalized – 3rd normal form or higher Comprehensive – enterprise-wide perspective Timely – data should be current enough to assist decisionmaking – Quality controlled – accurate with full integrity TM 9-43 Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems Derived Data • Objectives – – – – – Ease of use for decision support applications Fast response to predefined user queries Customized data for particular target audiences Ad-hoc query support Data mining capabilities • Characteristics – Detailed (mostly periodic) data – Aggregate (for summary) – Distributed (to departmental servers) Most common data model = star schema (also called “dimensional model”) Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-44 Figure 9-9 Components of a star schema Fact tables contain factual (descriptive) or quantitative data (numerical values) 1:N relationship between dimension tables and fact tables Dimension tables are denormalized to maximize performance Dimension tables contain descriptions about the subjects of the business (values in the fact table) Excellent for ad-hoc queries, but bad for online transaction processing Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-45 Figure 9-10 Star schema example Fact table provides statistics for sales broken down by product, period and store dimensions Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-46 Figure 9-11 Star schema with sample data Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-47 Surrogate Dimension Keys • Dimension table keys should be surrogate (nonintelligent and non-business related), because: – Business keys may change over time – Helps keep track of nonkey attribute values for a given production key – Surrogate keys are simpler and shorter – Surrogate keys can be same length and format for all keys Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-48 Grain of the Fact Table • Granularity of Fact Table–what level of detail do you want? – Transactional grain–finest level – Aggregated grain–more summarized – Finer grains better market basket analysis capability – Finer grain more dimension tables, more rows in fact table – In Web-based commerce, finest granularity is a click Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-49 Duration of the Database – Natural duration–13 months or 5 quarters – Financial institutions may need longer duration – Older data is more difficult to source and cleanse Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-50 Size of Fact Table • Depends on the number of dimensions and the grain of the fact table • Number of rows = product of number of possible values for each dimension associated with the fact table • Example: assume the following for Figure 9-11: • Total rows calculated as follows (assuming only half the products record sales for a given month): Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-51 Break ! (Ch. 9) Exercise # 5 – a, b, c (p. 422) With the following assumptions: HW #3 (p.422) – a, b, c Assume one professor per course section 1. The length of a fiscal period is one month 2. The data mart will contain five years of historical data 3. Approximately 5 percent of the policies experience some type of change each month 4. There are 8 fields in each record (row) ALL computations for b & c should be shown to get credits . Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-52 Figure 9-12 Modeling dates Fact tables contain time-period data Date dimensions are important Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-53 Variations of the Star Schema • Multiple Facts Tables – Can improve performance – Often used to store facts for different combinations of dimensions – Conformed dimensions • Factless Facts Tables – No nonkey data, but foreign keys for associated dimensions – Used for: • Tracking events • Inventory coverage Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-54 Figure 9-13 Conformed dimensions Two fact tables two (connected) start schemas. Conformed dimension Associated with multiple fact tables Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-55 Figure 9-14a Factless fact table showing occurrence of an event No data in fact table, just keys associating dimension records Fact table forms an n-ary relationship between dimensions Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-56 56 Normalizing Dimension Tables • Multivalued Dimensions – Facts qualified by a set of values for the same business subject – Normalization involves creating a table for an associative entity between dimensions • Hierarchies – Sometimes a dimension forms a natural, fixed depth hierarchy – Design options • Include all information for each level in a single denormalized table • Normalize the dimension into a nested set of 1:M table relationships Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-57 Figure 9-15 Multivalued dimension Helper table is an associative entity that implements a M:N relationship between dimension and fact. Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-58 Figure 9-16 Fixed product hierarchy Dimension hierarchies help to provide levels of aggregation for users wanting summary information in a data warehouse. Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-59 Slowly Changing Dimensions (SCD) • Need to maintain knowledge of the past • One option: for each changing attribute, create a current value field and many oldvalued fields (multivalued) • Better option: create a new dimension table row each time the dimension object changes, with all dimension characteristics at the time of change Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-60 Figure 9-18 Example of Type 2 SCD Customer dimension table The dimension table contains several records for the same customer. The specific customer record to use depends on the key and the date of the fact, which should be between start and end dates of the SCD customer record. Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-61 Figure 9-19 Dimension segmentation For rapidly changing attributes (hot attributes), Type 2 SCD approach creates too many rows and too much redundant data. Use segmentation instead. Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-62 10 Essential Rules for Dimensional Modeling • Use atomic facts • Create single-process fact tables • Include a date dimension for each fact table • Enforce consistent grain • Disallow null keys in fact tables Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems • Honor hierarchies • Decode dimension tables • Use surrogate keys • Conform dimensions • Balance requirements with actual data TM 9-63 Other Data Warehouse Advances • Columnar databases – Issue of Big Data (huge volume, often unstructured) – Columnar databases optimize storage for summary data of few columns (different need than OLTP) – Data compression – Sybase, Vertica, Infobright, • NoSQL – “Not only SQL” – Deals with unstructured data – MongoDB, CouchDB, Apache Cassandra Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-64 The User Interface Metadata (data catalog) • Identify subjects of the data mart • Identify dimensions and facts • Indicate how data is derived from enterprise data warehouses, including derivation rules • Indicate how data is derived from operational data store, including derivation rules • Identify available reports and predefined queries • Identify data analysis techniques (e.g. drill-down) • Identify responsible people Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems 65 TM 9-65 Online Analytical Processing (OLAP) Tools • The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques • Relational OLAP (ROLAP) – Traditional relational representation • Multidimensional OLAP (MOLAP) – Cube structure • OLAP Operations – Cube slicing–come up with 2-D view of data – Drill-down–going from summary to more detailed views Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-66 Multidimensional Analysis • Databases contain information in a series of two-dimensional tables • In a data warehouse and data mart, information is multidimensional, it contains layers of columns and rows – Dimension – a particular attribute of information Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-67 Figure 9-21 Slicing a data cube REGION CUSTOMER Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-68 Multidimensional Analysis • Cube – common term for the representation of multidimensional information Figure 9-22: Example of drill-down Starting with summary data, users can obtain details for particular cells Summary report Drill-down with color added Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-70 Business Performance Mgmt (BPM) Figure 9-25 Sample Dashboard BPM systems allow managers to measure, monitor, and manage key activities and processes to achieve organizational goals. Dashboards are often used to provide an information system in support of BPM. Charts like these are examples of data visualization, the representation of data in graphical and multimedia formats for human analysis. Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-71 OLAP and its Applications • What software and function that enable you to create OLAP and its applications? • ANSWER – EXCEL with – Pivot Table Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-72 Multidimensional Analysis • Data mining – the process of analyzing data to extract information not offered by the raw data alone • To perform data mining users need data-mining tools – Data-mining tool – uses a variety of techniques to find patterns and relationships in large volumes of information and infers rules that predict future behavior and guide decision making • An example – Grocery Store in UK (see next slide) Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-73 CRM and Data Mining (BI)Example • • • • A Grocery store in U.K. with the following “patterns” found: Every Thursday afternoon Young Fathers (why?) shopping at store Two of the followings are always included in their shopping list – Diapers and – Beers • What other decisions should be made as a store manager (in terms of store layout)? • Short term vs. Long term – This is an example of cross-selling – Other types of promotion: up-sell, bundled-sell • IT (e.g., BI) helps to find valuable information then decision makers make a timely/right decision for improving/creating competitive advantages. Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-74 More on OLTP vs. OLAP Fig. Extra-a: A simple database with a relation between two tables. Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems • The figure depicts a relational database environment with two tables. • The first table contains information about pet owners; the second, information about pets. The tables are related by the single column they have in common: Owner_ID. • By relating tables to one another, redundancy of we can reduce ____________ data and improve database performance. • The process of breaking tables apart and thereby reducing data redundancy is called normalization _______________. TM 9-75 OLTP vs. OLAP (cont.) • Most relational databases which are designed to handle a high number of reads and writes (updates and retrievals of information) are referred OLTP (OnLine Transaction Processing) systems. to as ________ • OLTP systems are very efficient for high volume activities such as cashiering, where many items are being recorded via bar code scanners in a very short period of time. • However, using OLTP databases for analysis is generally not very efficient, because in order to retrieve data from multiple tables at the joins must be used. same time, a query containing ________ Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-76 OLTP vs. OLAP (cont.) • In order to keep our transactional databases running quickly and smoothly, we may wish to create a data warehouse. A data warehouse is a type of large database (including both current and historical data) that has been denormalized _____________ and archived. • Denormalization is the process of intentionally combining some tables into a single table in spite of the fact that this may introduce duplicate data in some columns. Fig. Extra-b: A combination of the tables into a single dataset. • The figure depicts what our simple example data might look like if it were in a data warehouse. When we design databases in this way, we reduce the number of joins necessary to query related data, thereby speeding up the process of analyzing our data. OLAP • Databases designed in this manner are called __________ (OnLine Analytical Processing) systems. Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-77 OLTP vs. OLAP (cont.) • Transactional systems and analytical systems have conflicting purposes when it comes to database speed and performance. For this reason, it is difficult to design a single system which will serve both purposes. This is why data warehouses generally contain archived data. Archived data are data that have been copied out of a transactional database. • Denormalization typically takes place at the time data are copied out of the transactional system. It is important to keep in mind that if a copy of the data is made in the data warehouse, the data may synch . This happens when a copy is made in the become out-of-______ data warehouse and then later, a change to the original record is made in the source database. • Data mining activities performed on out-of-synch records may be useless, or worse, misleading. • An alternative archiving method would be to move the data out of the transactional system. This ensures that data won’t get out-ofsynch, however, it also makes the data unavailable should a user of the transactional system need to view or update it. Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-78 Data Mining • Knowledge discovery using a blend of statistical, AI, and computer graphics techniques • Goals: – Explain observed events or conditions – Confirm hypotheses – Explore data for new or unexpected relationships Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems 79 TM 9-79 Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-80 DATA MINING • Data-mining software includes many forms of AI such as neural networks and expert systems Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-81 Data Mining Examples • A telephone company used a data mining tool to analyze their customer’s data warehouse. The data mining tool found about 10,000 supposedly residential customers that were expending over $1,000 monthly in phone bills. • After further study, the phone company discovered that they were really small business owners trying to avoid paying business rates * Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-82 Data Mining Examples (cont.) • 65% of customers who did not use the credit card in the last six months are 88% likely to cancel their accounts. • If age < 30 and income <= $25,000 and credit rating < 3 and credit amount > $25,000 then the minimum loan term is 10 years. • 82% of customers who bought a new TV 27" or larger are 90% likely to buy an entertainment center within the next 4 weeks. Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-83 Sustainable Competitive Advantages • Any sustainable competitive advantages? • How can an organization sustain its competitive advantage? • Firms may create/improve their competitive advantages only if they: – have capacity to learn, – employ revenue management approach, – learning to learn and learning to change (life-long learning environment) Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems 84 TM 9-84 BUSINESS INTELLIGENCE • Business intelligence – information that people use to support their decision-making efforts • Principle BI enablers include: – Technology – People – Culture Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-85 Working Smarter , Not Harder • Overlapping Human/Organizational (Culture, Process)/ Technological factors in BI/KM: PEOPLE ORGANIZATIONAL PROCESSES Knowledge TECHNOLOGY Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems N TM 9-86 Essential Value Propositions for a Successful Company • Business Model • Core Competency • Execution – Set corporate goals and get executive sponsorship for the initiative Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-87 Relationship between the Organizational Knowledge and Core Competency Core competenc y A specific business context Can be transferred and reused efficiently and effectively across functional areas (sharing and collaboration) Best Practices IT People Culture Organizational knowledge Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-88 BI: Big Data And Data Warehousing • Two paradigms in BI: Data __________ Warehouse and ___ Big_____. Data – _____ – Both are competing each other for turning data into actionable information. • However, in recent years, the variety and complexity of data made data warehouse incapable of keeping up the changing needs. • Big Data – A new paradigm that the world of IT was forced to volume of the structured data develop, not because the _______ variety and the _______ velocity . but the ______ Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-89 Introduction to Big Data Analytics • Big Data? – – – – – Not just big! olume V______ ariety V______ elocity V______ structured, unstructured, or in a stream • Two aspects for studying “Big Data” storing and __________ processing /analyzing “Big Data” – _______ computation to the data instead of pushing • Push ____________ data to a computing mode. Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems TM 9-90