Data Warehousing • High quality information is vital to good decision making • Data is different from information • Organizations have a lot of data, may not have a lot of information • information gap Data Warehousing • Databases, in an organization, may have been developed in a fragmented manner over time need to reorganize, filter, .. the data before mining it • Most systems are designed for operational processing, not informational processing Data Warehousing • Operational processing: the processing of daily data • Information processing: analysis of summarized data in order to support decision making Data Warehousing • Ex 1: WalMart: what to stock, when, .. • Ex 2: Airlines: how much to overbook a flight by (forecast no-shows, ..) • Ex 3: Amazon: people who bought this book also bought these books Data Warehousing • A lot of transactions are done via the web today • fully automated, lots of data, real time opportunities to mine the data In this chapter • How data warehouses relate to operational systems • 3-tier architecture for data warehousing • Extracting data from operational systems and loading them into a data warehouse • Big data / NoSql • Data mining Data Warehousing: Basic Concepts • A data warehouse is: – – – – subject oriented (organized around key topics) Integrated (formats, names, ..) time variant (time dimension trends, … non-updateable (loaded by operation systems, not by end users) Data Warehousing: Brief History • Improvements in database technology • Advances in computer hardware (cpu, memory, storage, ..) • End user computing • Advances in middleware products (networks, internet, ..) Need for Data Warehousing • Business require an integrated, companywide view of high-quality information • IS department must separate informational from operational systems to improve performance in managing company data Need for a company-wide view • Data scattered around company – – – – – Different formats Different names/synonyms Free data vs structured data Inconsistent data (?) Missing data Operational vs informational systems • Operational system: designed to manage day to day operations; reservation systems, sales transactions, .. • Informational systems: designed to support decision making; trends, data mining, forecasting, planning, .. Informational systems • A data warehouse is centralized • A data warehouse adds value to data by improving its quality and consistency • A data warehouse eliminates the contention/bottleneck for resource use Data Warehouse Architectures • Independent data marts • Dependent data marts • Logical data marts Independent data marts • • • • Figure 9-2: 4 steps from left to right Data is extracted from various sources Data is transformed and integrated Data warehouse = collection of data marts; contains both summary and detailed data • Tools for users Data Mart • A data warehouse that is limited in scope, whose data is generated from either a data warehouse or other source data systems • Independent data mart = data mart filled with data extracted from the operational environment, not from a data warehouse Dependent data marts • Independent data marts have some limitations: • Need separate process for each data mart • Possible consistency issues • Limited drill down capabilities • Scaling costs can be high Dependent data mart • A data mart filled exclusively from an enterprise data warehouse and its reconciled data • EDW = Enterprise Data Warehouse = a centralized, integrated data warehouse that is the entry point and single source of data for decision support applications Dependent data marts • • • • • Figure 9-3: 4 steps from left to right Data is extracted from various sources Data is transformed and integrated EDW and data marts Tools for users, pulling data from EDW and/or data marts Logical data marts and Real time Data Warehousing • Figure 9-4: 3 steps from left to right • Data is extracted from various sources • Data is transformed and integrated and fed real-time to data marts • Tools for users, pulling data from EDW and/or data marts Real time Data Warehouse • An Enterprise Data Warehouse that accepts near-real-time feeds of transactional data from the systems of record, analyzes warehouse data, and in near real-time relays business rules to the data warehouse Real time Data Warehouse • Enable real time data analysis and response • Capture customer data at time of event • Analyze customer behavior and possibly predict customer response • Develop rules for optimizing customer interaction • Take immediate action with customer Real time Data Warehouse • E-commerce, abandoned shopping cart can trigger instant promotional email message • Fraud detection in credit card transaction (unusual pattern detected) triggers call to credit card owner Three-Layer Data Architecture • • • • • Figure 9-5: Operational data (Operational systems) Reconciled data (EDW) Derived data (Data marts) Note: for each data, there is metadata associated with it (describes the data, ..) Data warehouse data • • • • Status vs event data Figure 9-6 Status data = data before and after event Event data = data of the event Data warehouse data • Transient vs periodic data • Figure 9-7 • Transient data = can be overwritten less data needed, but less information • Periodic data = never altered more data needed, but more information Transient vs periodic data • Likes of a post • Do we just store the number of likes (transient)? Can we prevent somebody from voting twice? Probably not • Do we store every like (periodic)? aggregate to get total The Derived Data Layer • Data layer associated with physical and logical data marts (from figure 9.5) • Users interact with this layer for their decision support applications Characteristics of Derived Data • The source of derived data is the reconciled data • Typically, derived data is aggregated (per topic, user group, ..) it looks more like information than data The Star Schema • A simple database design in which dimensional data are separated from fact or event data. A dimensional model is another name for the star schema. The Star Schema • Figure 9.9: a start schema includes: – Dimension Tables (descriptive data) – Fact table (actual data) • ER model looks like a star The Star Schema • Figure 9.10: example of a star schema: – Dimension Tables: product, period, store – Fact table: sales • Primary key in fact table (sales) is a combination of all primary keys in the dimension tables The Star Schema • A row of the product table contains information for a particular product • A row of the period table contains information about a unit of time • A row of the store table contains information about a store The Star Schema • A row of the sales table contains information about the sales of a particular product in a particular product during a particular unit of time: unit sold, dollars sold, and dollars cost • Example with actual data in figure 9.11 Fact table • Can be big: example Sears • Assumption: 1000 stores, 10000 products, 5 years of daily data • 1000 X 100000 * ( 5 * 365) rows Grain of the Fact table • = level of detail of the fact table = intersection of primary keys of the dimension tables • The finer the grain, the bigger the table • Tradeoff between size and information quality/detail Fact table Variations • If we are interested in storing events, we can have a “factless” fact table • it contains foreign key values only; no extra data • Example figure 9.14: an event took place at that time, between these people, at that facility, on that topic (course) BIG DATA • A database whose size strains the ability of RDBMS to capture, manage, and process data efficiently scalability issues • Large scale data gathering and analytics (particularly in web applications) big data challenging for traditional relational database management systems BIG DATA • NewSQL Make SQL based, relational databases more scalable • NoSQL alternative databases to SQL NoSQL databases • NoSQL Stands for Not Only SQL • Less functionality than SQL • Higher performance (fast), more scalable than SQL NoSQL • • • • Non relational No schema Scalable Hash tables and similar data structures NoSQL • No joins • No constraints • No complex transactions (do this but if that, rollback the transaction) NoSQL Data Structures • Hash tables: Key, Value pairs • Key, Document pairs (can retrieve the document via its key, then query the document) • Graph Hash tables • Key, Value pairs • MD Maryland, CA California, NY New York, .. • Jan January, Feb February, .. • originalPostId the whole thread for this original post (this value contains a lot of data)