The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 2: Data Warehousing Johannes Gehrke johannes@cs.cornell.edu http://www.cs.cornell.edu/johannes Overview • Conceptual design • Querying relational data • Dimensional data modeling: OLTP versus decision support We will design and analyze a data mart with click-stream data as an illustrative example. The Database Design Process • Requirement analysis • Conceptual design using the entityrelationship (ER) model • Schema refinement • Normalization • Physical tuning Overview of Database Design • Conceptual design: (ER Model is used at this stage.) • What are the entities and relationships in the enterprise? • What information about these entities and relationships should we store in the database? • What are the integrity constraints or business rules that hold? • A database `schema’ in the ER Model can be represented pictorially (ER diagrams). • Can map an ER diagram into a relational schema. ER Model Basics name • Entity: ssn lot Real-world object distinguishable from Employees other objects. An entity is described (in DB) using a set of attributes. • Entity Set: A collection of similar entities. E.g., all employees. • All entities in an entity set have the same set of attributes. • Each entity set has a key. • Each attribute has a domain. ER Model Basics (Contd.) • Relationship: Association among two or more entities. E.g., Johannes works in the computer science department. • Relationship set: Collection of similar relationships. since name ssn dname lot Employees did Works_In budget Departments ER-Model Basics (Contd.) • An n-ary relationship set R relates n entity sets E1 ... En; each relationship in R involves entities e1, ..., en • Same entity set could participate in different relationship sets, or in different “roles” in same set. name ssn address Employees supervisor subordinate Reports_To Key Constraints • Consider Works_In: An employee can work in many departments; a department can have many employees. • In contrast, each deptartment has at most one manager, according to the key constraint on Manages. since name ssn dname lot Employees did Manages budget Departments Key Constraints (Contd.) • Several types of key-constraints: 1-to-1 1-to Many Many-to-1 Many-to-Many Key constraints: Examples • Example Scenario 1: An inventory database contains information about parts and manufacturers. Each part is constructed by exactly one manufacturer. • Example Scenario 2: A customer database contains information about customers and sales persons. Each customer has exactly one primary sales person. • What do the ER diagrams look like? Participation Constraints Does every department have a manager? If so, this is a participation constraint: The participation of Departments in Manages is said to be total (vs. partial). (Compare with foreign key constraints.) since name ssn dname did lot Employees Manages Works_In since budget Departments Participation Constraints: Examples • Example Scenario 1 (Contd.): Each part is constructed by exactly one manufacturer. • Example Scenario 2: Each customer has exactly one primary sales person. ER Modeling: Case Study Drugwarehouse.com has offered you a free life-time supply of prescription drugs (no questions asked) if you design its database schema. Given the rising cost of health care, you agree. Here is the information that you gathered: • Patients are identified by their SSN, and we also store their names and age. • Doctors are identified by their SSN, and we also store their names and specialty. • Each patient has one primary care physician, and we want to know since when the patient has been with her primary care physician. • Each doctor has at least one patient. ER Modeling: Summary • After the requirement analysis, the conceptual design develops a high-level description of the data • Main components: • • • • Entities Relationships Attributes Integrity constraints: Key constraints and participation constraints • We covered only a subset Querying Relational Databases • “The” relational query language: SQL Structured Query Language SELECT target-list FROM relation-list WHERE qualifications • relation-list: A list of relation names • target-list: A list of attributes of relations in relation-list • qualification: Comparisons (Attr op const or Attr1 op Attr2, where op is one of <,>,=,<=,>=,<>) combined using AND, OR and NOT. Recall: Customer Relation • Relation schema: Customers(cid: integer, name: string, byear: integer, state: string) • Relation instance: cid 1 2 3 name Jones Smith Smith byear 1960 1974 1950 state NY CA NY Example Query • Example Schema: Customers( cid: integer, name: string, byear: integer, state: string) • Query: SELECT Customers.cid, Customers.name, Customers.byear, Customers.state FROM Customers WHERE cid = 1950 cid 1 2 3 name Jones Smith Smith byear 1960 1974 1950 state NY CA NY cid 3 name Smith byear 1950 state NY Example Query SELECT Customers.cid, Customers.name, Customers.byear, Customers.state FROM Customers WHERE cid = 1960 cid 1 2 3 name Jones Smith Smith byear 1960 1974 1950 state NY CA NY cid 1 name Jones byear 1960 state NY Range Variables SELECT Customers.cid, Customers.name, Customers.byear, Customers.state FROM Customers C WHERE cid = 1960 cid 1 2 3 name Jones Smith Smith byear 1960 1974 1950 state NY CA NY cid 1 name Jones byear 1960 state NY Example Query A range variable is a substitute for a relation name. • Query: SELECT C.cid, C.name FROM Customers C WHERE C.byear = 1960 cid 1 2 3 name Jones Smith Smith byear 1960 1974 1950 state NY CA NY cid 1 name Jones Common Shortcuts • “*” is a shortcut for all fields • Query: SELECT * FROM Customers C WHERE C.name=“Smith” cid 1 2 3 name Jones Smith Smith cid 2 3 name Smith Smith byear 1960 1974 1950 byear 1974 1950 state NY CA NY state CA NY Example Query SELECT C.state FROM Customers C cid 1 2 3 name Jones Smith Smith state NY CA NY byear 1960 1974 1950 state NY CA NY DISTINCT Keyword DISTINCT: Eliminates duplicates in the output Query: SELECT DISTINCT C.state FROM Customers C cid 1 2 3 name Jones Smith Smith byear 1960 1974 1950 state NY CA NY state NY CA Example Query • Query: SELECT C.cid, C.name, C.byear, C.state FROM Customers C WHERE C.name=“Smith” • Answer: cid 2 3 name Smith Smith byear 1974 1950 state CA NY Selection Predicates • Query: SELECT C.cid, C.name, C.byear, C.state FROM Customers C WHERE C.name=“Smith” AND C.state=“NY” • Answer: cid 3 name Smith byear 1950 state NY Combining Relations: Joins SELECT P.pid, P.pname FROM Products P, Transactions T WHERE P.pid = T.tid AND T.tdate < “2/1/2000” pid 1 2 3 4 pname Intel PIII-700 MS Office Pro IBM DB2 Thinkpad 600E price 300.00 500.00 5000.00 5000.00 pid 1 2 4 category hardware software software hardware pname Intel PIII-700 MS Office Pro Thinkpad 600E tid 1 1 2 3 3 tdate 1/1/2000 1/1/2000 1/1/2000 2/1/2000 2/1/2000 cid pid 1 1 1 2 1 4 2 3 2 4 Example Query • Query: “Find the names and ids of customers who have made purchases before February 1, 2000.” • SQL: SELECT C.name, C.id FROM Customers C, Transactions T WHERE C.cid = T.cid AND T.tdate < “2/1/2000” cid 1 2 3 tid 1 1 2 3 3 name Jones Smith Smith byear 1960 1974 1950 tdate 1/1/2000 1/1/2000 1/1/2000 2/1/2000 2/1/2000 state NY CA NY cid pid 1 1 1 2 1 4 2 3 2 4 Example Query • Query: “Find the names and ids of the customers who have purchased MS Office Pro.” • SQL: SELECT C.name, C.id FROM Customers C, Transactions T, Products P WHERE C.cid = T.cid AND T.pid = P.pid AND P.pname = “MS Office Pro” Aggregate Operators • SQL allows computation of summary statistics for a collection of records. • Operators: • • • • • MAX (maximum value) MIN (minimum value) SUM AVG (average) COUNT (distinct number) Example Queries • Query: “Tell me the minimum and maximum price of all products.” • SQL: SELECT MAX(P.price), MIN(P.price) FROM Products P • Query: “How many different products do we have?” • SQL: SELECT COUNT(*) FROM Products GROUP BY and HAVING • Instead of applying aggregate operators to all (qualifying) tuples, apply aggregate operators to each of several groups of tuples. • Example: For each year, show the number of customers who are born that year. • Conceptually, many queries, one query per year. • Suppose we know that years are between 1900 and 2000, we can write 1001 queries that look like this: SELECT COUNT(*) FROM Customers C WHERE C.byear = i Queries With GROUP BY and HAVING • Extended SQL Query Structure: SELECT [DISTINCT] target-list FROM relation-list WHERE tuple-qualification GROUP BY grouping-list HAVING group-qualification Example: GROUP BY • Example: For each year, show the number of customers who are born that year. • SQL: SELECT C.byear, COUNT(*) FROM Customers C GROUP BY C.byear Example • Query: “For each customer, list the price of the most expensive product she purchased.” • SQL: SELECT C.cid, C.name, MAX(P.price) FROM Customers C, Transactions T, Products P WHERE C.cid = T.cid and T.pid = P.pid GROUP BY C.cid, C.name Example • Query: “For each product that has been sold at least twice, output how often it has been sold so far.” • SQL: SELECT P.pid, P.pname, COUNT(*) FROM Products P, Transactions T WHERE P.pid = T.pid GROUP BY P.pid, P.pname HAVING COUNT(*) > 1 Notes on GROUP BY and HAVING SELECT FROM WHERE GROUP BY HAVING [DISTINCT] attribute-list, aggregate-list relation-list record-qualification grouping-list group-qualification • The query generates one output record per group. A group is a set of tuples that have the same value for all attributes in grouping-list. Notes on GROUP BY and HAVING SELECT FROM WHERE GROUP BY HAVING [DISTINCT] attribute-list, aggregate-list relation-list record-qualification grouping-list group-qualification • The attribute list must be a subset of the grouping-list. Why? Each answer record corresponds to a group, and there must be a single value per group. • The aggregate-list generates one value per group. • What about the group-qualification? Summary: SQL • Powerful query language for relational database systems • End-users usually do not write SQL, but graphical user front-ends generate SQL queries • SQL completely isolates users from the physical structure of the DBMS You can tune your DBMS for performance and your applications do not change (This is physical data independence!) From OLTP To The Data Warehouse • Traditionally, database systems stored data relevant to current business processes • Old data was archived or purged • Your database stores the current snapshot of your business: • • • • Your current customers with current addresses Your current inventory Your current orders My current account balance The Data Warehouse • The data warehouse is a historical collection of all your data for analysis purposes • Examples: • Current customers versus all customers • Current orders versus history of all orders • Current inventory versus history of all shipments • Thus the data warehouse stores information that might be useless for the operational part of your business Terminology • • • • OLTP (Online Transaction Processing) DSS (Decision Support System) DW (Data Warehouse) OLAP (Online Analytical Processing) OLTP Architecture Clients OLTP DBMSs Cash Register Product Purchase Inventory Update DW Architecture Clients Information Sources Data Warehouse Server OLAP Servers MOLAP OLTP DBMSs Analysis Query/Reporting Other Data Extract Clean Sources Transform Aggregate Load Update Data Mining Data Marts ROLAP OLTP Versus Data Warehousing OLTP Data Warehouse Typical user Clerical Management System usage Regular business Analysis Workload Read/Write Read only Types of queries Predefined Ad-hoc Unit of interaction Transaction Query Level of isolation required High Low No of records accessed <100 >1,000,000 No of concurrent users Thousands Hundreds Focus Data in and out Information out The Data Warehouse Market • Market forecast for warehousing tools in 2002: $8 billion (IDC 7/1999) • Revenue forecast for data warehouse front-end tools: 1998 $1.6 billion 1999 $2.1 billion 2000 $2.8 billion (Gartner Group 2/1999) Dimensional Data Modeling • Recall: The relational model. The dimensional data model: • Relational model with two different types of attributes and tables. • Attribute level: Facts (numerical, additive, dependent) versus dimensions (descriptive, independent). • Table level: Fact tables (large tables with facts and foreign keys to dimensions) versus dimension tables (small tables with dimensions). Dimensional Modeling (Contd.) • Fact (attribute): Measures performance of a business. • Example facts: • Sales, budget, profit, inventory • Example fact table: • Transactions (timekey, storekey, pkey, promkey, ckey, units, price) • Dimension (attribute): Specifies a fact. • Example dimension: • Product, customer data, sales person, store • Example dimension table: • Customer (ckey, firstname, lastname, address, dateOfBirth, occupation, …) OLTP Versus The Data Warehouse OLTP • Regular relational schema • Update queries change data in the database: One instance of a customer with a unique customerID • Queries return information about the current state of affairs The data warehouse • Dimensional model • Update queries create new records in the database: Several instances of the same customer (with different data), in case the customer moved • Queries return aggregate information about historical facts Example: Dimensional Data Modeling ckey cid name byear state timekey Day ckey timekey pkey Month #units pkey pid pname price Customers: Dimension Table Year $price category Time: Dim. Table Transactions: Fact Table Products: Dim. Table Another View: Star Schema Time Customers Transactions (timekey, storekey, pkey, promkey, ckey, units, price) Promotions Store Products Grain • The grain defines the level of resolution of a single record in the fact table. • Example fact tables: • Transactions (timekey, storekey, pkey, promkey, ckey, units, price); grain is individual item • Transactions(timekey, storekey, ckey, units, price); grain is one market basket • CustomerSessions(timekey, ckey, pagekey, sessionkey, sessionSeconds, pagesVisited); what is the grain? Tips • Fact tables are usually very large; they can grow to several hundred GB and TB • Dimension tables are usually smaller (although can grow large, e.g., Customers table), but they have many fields • Queries over fact tables usually involve many records • Indexes usually increase the size of each table a factor of three or four Typical Queries • SQL: SELECT FROM WHERE GROUP BY HAVING D1.d1, …, Dk.dk, agg1(F.f1,) Dimension D1, …, Dimension Dk, Fact F D1.key = F.key1 AND … AND Dk.keyk = F.keyk AND otherPredicates D1.d1, …, Dk.dk groupPredicates • This query is called a “Star Join”. Example Query • “Break down sales by year and category for the last two years; show only categories with more than $1M in sales.” • SQL: SELECT T.year, P.category, SUM(X.units * X.price) FROM Time T, Products P, Transactions X WHERE T.year = 1999 OR T.year = 2000 GROUP BY T.year, P.category HAVING SUM(X.units * X.price) > 1000000 Design of the Clickstream Schema • Dimensions: • Date. One record for each calendar day. • Time. One record per second (is that accurate enough?). • Customer. One record per customer. Several groups of customers, depending on knowledge about the customer: • Anonymous web site visitor with cookie ID • Know name, address, and customer ID is assigned • Know demographics • Can you think of other dimensions? Schema Design (Contd.) Often used “clickstream” dimensions: • Page. Grain: One record per page or one record per page type? Page(pagekey, pageFunction, pageType, contentsType) • Event. Captures what the users does at a page (enter data, click on a link, etc.). Event(eventkey, eventtype) • Session. Attributes that characterize short-term behavior. Session(sessionkey, sessiontype, sequencetype, context, status) • Referral. Captures how the user arrived at our site. Referral(referralkey, referraltype, url, site, domain) Schema Design (Contd.) • Example clickstream fact table schema for analyzing sessions: Sessionfacts(datekey, timekey, ckey, totalSessionTime, numPages, numItems, orderAmount) • Foreign keys: datekey, timekey, ckey • Facts: totalSessionTime, number of pages visited, number of items ordered, total order amount • Example clickstream fact table schema for analyzing page use: Pagefacts(datekey, timekey, ckey, pagekey, pagetime, numItems) • Foreign keys: datekey, timekey, ckey, pagekey • Facts: pagetime, number of items orderd, Building a Data Warehouse: Tips • A data warehouse is a collection of data marts. A data mart contains one dimensional star schema that captures one business aspect. • Notes: • It is crucial to centralize the logical definition and format of dimensions and facts (political challenge; assign a dimension authority to each dimension). Everything else is a distributed effort throughout your company (technical challenge). • Each data mart will have its own fact table, but we will duplicate dimension tables over several data marts. Summary: Dimensional Design • A dimensional data model is similar to a relational data model, but we plan for historical data • Facts versus dimensions • Queries are in star join format Questions? (In the third lecture: Data analysis)