The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 3: An Introduction To Data Mining (I) Johannes Gehrke johannes@cs.cornell.edu http://www.cs.cornell.edu/johannes Lectures Three and Four • Data preprocessing • Multidimensional data analysis • Data mining • Association rules • Classification trees • Clustering Why Data Preprocessing? • Quality decisions come from quality data. • Problems with real life data: • Data needs to be integrated from different sources • Missing values • Noisy and inconsistent values • Data is not at the right level of aggregation Recall: The Relational Data Model • • A relational database is a set of relations A relation has two components: 1. The relation instance. Basically a table, with rows (also: records, tuples) and columns (also: fields, attributes). Number of records in the relation instance: Cardinality. 2. The relation schema. Specifies name of the relation, plus name and type of each column. • Turing award (Nobel price in CS) for Codd in 1981 for his work on the relational model Example: Customer Relation • Relation schema: Customers (cid: integer, name: string, byear: integer, state: string) • Relation instance: cid 1 2 3 name Jones Smith Smith byear 1960 1974 1950 state NY CA NY Data Integration • Integrate data from multiple sources into a common format for data mining. • Note: A good data warehouse has already taken care of this step. Data Warehouse Server OLTP DBMSs Other Data Sources Extract, clean, transform, aggregate, load, update Data Marts Data Integration (Contd.) Problem: Heterogeneous schema integration • Different attribute names cid 1 2 3 name Jones Smith Smith byear 1960 1974 1950 Customer-ID 1 2 3 state NY CA NY • Different units: Sales in $, sales in Yen, sales in DM Data Integration (Contd.) Problem: Heterogeneous schema integration • Different scales: Sales in dollars versus sales in pennies • Derived attributes: Annual salary versus monthly salary cid 1 2 3 monthlySalary 5000 2400 3000 cid 6 7 8 Salary 50,000 100,000 40,000 Data Integration (Contd.) Problem: Inconsistency due to redundancy • Customer with customer-id 150 has three children in relation1 and four children in relation2 cid 1 numChildren 3 cid 1 numChildren 4 • Computation of annual salary from monthly salary in relation1 does not match “annual-salary” attribute in relation2 cid 1 2 monthlySalary 5000 6000 cid 1 2 Salary 60,000 80,000 Missing Values Often the values of some attributes in a record are not known. Reasons for missing values: • • • • • Attribute does not apply (e.g., maiden name) Inconsistency with other recorded data Equipment malfunction Human errors Attribute introduced recently (e.g., email address) Missing Values: Approaches • Ignore the record • Complete the missing value: • Manual completion: Tedious and likely to be infeasible • Fill in a global constant, e.g., “NULL”, “unknown” • Use the attribute mean or mode • Construct a data mining model that predicts the missing value Noisy Data • Examples: • • • • • • Faulty data collection instruments Data entry problems, misspellings Data transmission problems Technology limitation Inconsistency in naming conventions Duplicate records with different values for a common field Noisy Data: Remove Outliers Noisy Data: Smoothing y Y1 Y1’ X1 x Noisy Data: Normalization • Scale data to fall within a small, specified range • • • • Leave out extreme order statistics Min-max normalization Z-score normalization Normalization by decimal scaling Data Reduction Problem: • Data might not be at the right scale for analysis. Example: Individual phone calls versus monthly phone call usage • Complex data mining tasks might run a very long time. Example: Multi-terabyte data warehouses One disk drive: About 20MB/s Data Reduction: Attribute Selection • Select the “relevant” attributes for the data mining task • If there are k attributes, there are 2k-1 different subsets • Example: {salary,children,byear}, {salary,children}, {salary,byear}, {children,byear}, {salary}, {children}, {byear} • Choice of the right subset depends on: • Data mining task • Underlying probability distribution Attribute Selection (Contd.) • How to choose relevant attributes: • Forward selection: Select greedily one attribute at a time • Backward elimination: Start with all attributes, eliminate irrelevant attributes • Combination of forward selection and backward elimination Data Reduction: Parametric Models • Main idea: • Fit a parametric model to the data (e.g., multivariate normal distribution) • Store the model parameters, discard the data (except for outliers) Parametric Models: Example • Instead of storing (x,y) pairs, store only the x-value. Then recompute the y-value using y = ax + b y x Data Reduction: Sampling • Choose a representative subset of the data • Simple random sampling may have very poor performance in the presence of skew Data Reduction: Sampling (Contd.) Stratified sampling: Biased sampling • Example: Keep population group ratios • Example: Keep minority population group count Data Reduction: Histograms • Divide data into buckets and store average (sum) for each bucket • Can be constructed “optimally” for one attribute using dynamic programming • Example: Dataset: 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,5,5,6,6, 7,7,8,8,9,9,10,10,11,11,12,12 Histogram: (range, count, sum) (1-2,12,16), (3-6,8,36), (7-9,6,48), (10-12,6,66) Histograms (Contd.) • Equal-width histogram • Divides the domain of an attribute into k intervals of equal size • Interval width = (Max – Min)/k • Computationally easy • Problems with data skew and outliers • Example: Dataset: 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,5,5,6,6, 7,7,8,8,9,9,10,10,11,11,12,12 Histogram: (range, count, sum) (1-3,14,22), (4-6,6,30), (7-9,6,48), (10-12,6,66) Histograms (Contd.) • Equal-depth histogram • Divides the domain of an attribute into k intervals, each containing the same number of records • Variable interval width • Computationally easy • Example: Dataset: 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,5,5,6,6, 7,7,8,8,9,9,10,10,11,11,12,12 Histogram: (range, count, sum) (1,8,8), (2-4,8,22), (5-8,8,52), (9-12,8,84) Data Reduction: Discretization • Same concept as histograms • Divide domain of a numerical attribute into intervals. • Replace attribute value with label for interval. • Example: • Dataset (age; salary): (25;30,000),(30;80,000),(27;50,000), (60;70,000),(50;55,000),(28;25,000) • Discretized dataset (age, discretizedSalary): (25,low),(30,high),(27,medium),(60,high), (50,medium),(28,low) Discretization: Natural Hierarchies • Replace low-level concepts with high-level concepts • Example replacements: • Product SKU by category • City by state Industry Country Category State Product City Year Quarter Month Week Day Data Reduction: Aggregation • Natural hierarchies on attributes can be used to aggregate data along the hierarchy Industry Country Category State Product City Year Quarter Month Week Day Aggregation: Example cid 1 1 1 1 1 2 2 2 2 2 3 4 4 Date 3/1/2000 3/5/2000 3/29/2000 4/2/2000 4/6/2000 3/2/2000 3/5/2000 3/10/2000 3/11/2000 4/7/2000 4/2/2000 3/17/2000 4/25/2000 Amount $150 $50 $200 $300 $200 $100 $250 $100 $50 $200 $300 $250 $100 cid Month Year Amount Visits 1 3 2000 $400 3 1 3 2000 $500 2 2 3 2000 $500 4 2 4 2000 $200 1 3 4 2000 $300 1 4 3 2000 $250 1 4 4 2000 $100 1 Data Reduction: Other Methods • Principal component analysis • Fourier transformation • Wavelet transformation Data Preprocessing: Summary • Problems during data integration: • • • • • Different attribute names Different units Different scales Derived attributes Redundant data • Missing values • Imputation • Prediction • Noisy data: • Outlier removal • Smoothing • Normalization • Data Reduction: • • • • • • Attribute selection Fitting parametric models Sampling Histograms Discretization Aggregation Data Analysis • Data preprocessing • Multidimensional data analysis • Data mining • Association rules • Classification trees • Clustering Multidimensional Data Analysis Recall: Transactions(ckey, timekey, pkey, units, price) Customers(ckey, cid, name, byear, city, state, country) Time(tkey, day, month, quarter, year) Products(pkey, pname, price, pid, category, industry) Hierarchies on dimensions: Industry Country Category State Product City Year Quarter Month Week Day Multidimensional Data Analysis NY CA WI Industry1 $1000 $2000 $1000 Industry2 $500 $1000 $500 Industry3 $3000 $3000 $3000 Industry Country=“USA” Category State Product City Year Quarter Month Week Day Corresponding Query in SQL • SELECT SUM(units) FROM Transactions T, Products P, Customers C WHERE T.pkey = P.pkey AND T.ckey = C.ckey AND C.country = “USA” GROUP BY P.industry, C.state • We think that Industry3 in CA is interesting. Industry Country=“USA” Category State Product City Year Quarter Month Week Day Slice and Drill-Down Category1 San Francisco $300 San Jose $300 Los Angeles $400 Category2 $300 $300 $400 Category3 $100 $800 $100 Industry=“Industry3” Country Category State=“CA” Product City Year Quarter Month Week Day Corresponding Query in SQL • SELECT SUM(units) FROM Transactions T, Products P, Customers C WHERE T.pkey = P.pkey AND T.ckey = C.ckey AND P.industry = “Industry3” AND C.state = “CA” GROUP BY P.category, C.city • We think that Category3 is interesting. Industry=“Industry3” Country Category State=“CA” Product City Year Quarter Month Week Day Slice and Drill-Down Product1 San Francisco $20 San Jose $160 Los Angeles $20 Product2 $20 $160 $20 Product3 $60 $480 $60 Industry Country Category=“Category3” State=“CA” Product City Year Quarter Month Week Day Corresponding Query in SQL • SELECT SUM(units) FROM Transactions T, Products P, Customers C WHERE T.pkey = P.pkey AND T.ckey = C.ckey AND C.state = “CA” AND P.category = “Category3” GROUP BY P.product, C.city • Nothing new in this view of the data. Industry Country Category=“Category3” State=“CA” Product City Year Quarter Month Week Day Pivot To (City, Year) 1997 San Francisco $20 San Jose $100 Los Angeles $20 1998 $20 $600 $20 1999 $60 $100 $60 Industry Country Category=“Category3” State=“CA” Product City Year Quarter Month Week Day Corresponding Query in SQL • SELECT SUM(units) FROM Transactions T, Products P, Customers C WHERE T.pkey = P.pkey AND T.ckey = C.ckey AND C.state = “CA” AND P.category = “Category3” GROUP BY C.city, T.year Industry Country Category=“Category3” State=“CA” Product City Year Quarter Month Week Day Multidimensional Data Analysis Set of data manipulation operators • Roll-up: Go up one step in a dimension hierarchy. Example: month -> quarter • Drill-down: Go down one step in a dimension hierarchy. Example: quarter -> month • Slice: Select a subset of the values of a dimension. Example: All categories -> only Category3 • Dice: Select all values of a dimension. Example: Only Category3 -> all categories • Pivot: Select new dimensions to visualize the data. Example: Pivot to Time(quarter) and Customer(state) Visual Intuition: Cube roll-up to category Customer Data Mart roll-up to state SH SF Product LA Product1 Product2 Product3 Product4 Product5 Product6 20 30 20 15 10 50 roll-up to week M T W Th F S S Time 50 Units of Product6 sold on Monday in LA OLAP Server Architectures • Relational OLAP (ROLAP) • Relational DBMS stores data mart • OLAP middleware: • Aggregation and navigation logic • Optimized for DBMS in the background, but slow and complex • Basically only one vendor: Microstrategy • Multidimensional OLAP (MOLAP) • Specialized array-based storage structure • Vendors: Hyperion (Essbase), Appix (iTM1), Oracle, Microsoft OLAP Server Architectures • Desktop OLAP (DOLAP) • Performs OLAP directly at your PC • Vendors: Cognos (Powerplay), Business Objects, Brio Technology, Hummingbird • Hybrids and Application OLAP • More: www.olapreport.com The OLAP Market Revenues For The OLAP Market (OLAP Report 11/1998) 3.5 Billion $ 3.0 2.5 2.0 Revenue 1.5 1.0 0.5 0.0 1994 1995 1996 1997 1998 1999 2000 Enterprise Reporting Tools Market Size Forecast for Enterprise Reporting Tools (IDC 11/1998) 1000 Million $ 800 600 Market Size 400 200 0 1997 1998 1999 2000 2001 2002 Summary: Multidimensional Analysis • Spreadsheet style data analysis • Roll-up, drill-down, slice, dice, and pivot your way to interesting cells in the CUBE • Mainstream technology • Established enterprises already have OLAP installations • When establishing your e-business, OLAP will be your first step in data analysis Data Mining Definition Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Example pattern (Census Bureau Data): If (relationship = husband), then (gender = male). 99.6% Definition (Cont.) Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns. Why Use Data Mining Today? Human analysis skills are inadequate: • Volume and dimensionality of the data • High data growth rate Availability of: • • • • Data Storage Computational power Off-the-shelf software An Abundance of Data • • • • • • • • • Supermarket scanners, POS data Preferred customer cards Credit card transactions Direct mail response Call center records Web server logs Customer web site trails ATM machines Demographic data Evolution of Database Technology • 1960s: IMS and network DBMS • 1970s: The relational data model, small relational DBMS implementations • 1980s: Maturing RDBMS, application-specific DBMS (spatial data, scientific data, images, etc.) • 1990s: Mature, high-performance RDBMS technology, parallel DBMS, terabyte data warehouses, object-relational DBMS, middleware and web technology • 2000s: High availability, maintainability, seamless integration into business processes Computational Power • Moore’s Law: In 1965, Intel Corp. cofounder Gordon Moore predicted that the density of transistors in an integrated circuit would double every year. • Later changed to reflect 18 months progress. • Experts on ants estimate that there are 1016 to 1017 ants on earth. In the year 1997, we produced one transistor per ant. OTS-Software for Data Mining • • • • • • • ANGOSS KnowledgeStudio IBM Intelligent Miner Metaputer PolyAnalyst SAS Enterprise Miner SGI Mineset SPSS Clementine Many others • More at http://www.kdnuggets.com/software Why Use Data Mining Today? And … competitive pressure! “The secret of success is to know something that nobody else knows.” Aristotle Onassis • Competition on service, not only on price (Banks, phone companies, hotel chains, rental car companies) • Personalization • Your competitors already apply data mining The Knowledge Discovery Process Steps: 1. Identify business problem 2. Data preprocessing • • • • • Data selection: Identify target datasets and relevant fields Data cleaning: Remove noise and outliers Data transformation Create common units Generate new fields 3. Data mining and model evaluation 4. Deployment and measurement of the results Preprocessing and Mining Knowledge Patterns Target Data Preprocessed Data Interpretation Model Construction Original Data Preprocessing Data Integration and Selection What is a Data Mining Model? A data mining model is a description of a specific aspect of a dataset. It produces output values for an assigned set of input values. Examples: • Linear regression model • Classification model • Clustering Data Mining Models (Contd.) A data mining model can be described at two levels: • Functional level: • Describes model in terms of its intended usage. Examples: Classification, clustering • Representational level: • Specific representation of a model. Example: Log-linear model, classification tree, nearest neighbor method. • Black-box models versus transparent models Data Mining Techniques • Supervised learning • Classification and regression • Unsupervised learning • Clustering • Dependency modeling • Associations, summarization, causality • Outlier and deviation detection • Trend analysis and change detection Fields Related to Data Mining • • • • • Database systems Machine learning Statistics Visualization Parallel computing Traditional Analysis Tools • Predefined set of reports • Capture most important known business questions • Generated at fixed points in time • No support for impromptu queries • SQL: “Write your own SQL query.” • Statistics department within a company • Slow reaction to users needs • Limited understanding of the business questions • Problems communicating the findings Example Application: Sports IBM Advanced Scout analyzes NBA game statistics • Shots blocked • Assists • Fouls • http://www.research.ibm.com/scout/home.html Advanced Scout • Example pattern: An analysis of the data from a game played between the New York Knicks and the Charlotte Hornets revealed that “When Glenn Rice played the shooting guard position, he shot 5/6 (83%) on jump shots." • Pattern is interesting: The average shooting percentage for the Charlotte Hornets during that game was 54%. Example Application: Sky Survey • Input data: 3 TB of image data with 2 billion sky objects, took more than six years to complete • Goal: Generate a catalog with all objects and their type • Method: Use decision trees as data mining model • Results: • 94% accuracy in predicting sky object classes • Increased number of faint objects classified by 300% • Helped team of astronomers to discover 16 new high red-shift quasars in one order of magnitude less observation time Gold Nuggets? • Investment firm mailing list: Discovered that old people do not respond to IRA mailings • Bank clustered their customers. One cluster: Older customers, no mortgage, less likely to have a credit card • “Bank of 1911” • Customer churn example Market Basket Analysis • Consider shopping cart filled with several items • Market basket analysis tries to answer the following questions: • Who makes purchases? • What do customers buy together? • In what order do customers purchase items? Market Basket Analysis Given: • A database of customer transactions • Each transaction is a set of items • Example: Transaction with TID 111 contains items {Pen, Ink, Milk, Juice} TID 111 111 111 111 112 112 112 113 113 114 114 114 CID 201 201 201 201 105 105 105 106 106 201 201 201 Date 5/1/99 5/1/99 5/1/99 5/1/99 6/3/99 6/3/99 6/3/99 6/5/99 6/5/99 7/1/99 7/1/99 7/1/99 Item Pen Ink Milk Juice Pen Ink Milk Pen Milk Pen Ink Juice Qty 2 1 3 6 1 1 1 1 1 2 2 4 Market Basket Analysis (Contd.) • Coocurrences • 80% of all customers purchase items X, Y and Z together. • Association rules • 60% of all customers who purchase X and Y also buy Z. • Sequential patterns • 60% of customers who first buy X also purchase Y within three weeks. Confidence and Support We prune the set of all possible association rules using two interestingness measures: • Confidence of a rule: • X => Y has confidence c if P(Y|X) = c • Support of a rule: • X => Y has support s if P(XY) = s We can also define • Support of an itemset (a coocurrence) XY: • XY has support s if P(XY) = s Example Examples: • {Pen} => {Milk} Support: 75% Confidence: 75% • {Ink} => {Pen} Support: 100% Confidence: 100% TID 111 111 111 111 112 112 112 113 113 114 114 114 CID 201 201 201 201 105 105 105 106 106 201 201 201 Date 5/1/99 5/1/99 5/1/99 5/1/99 6/3/99 6/3/99 6/3/99 6/5/99 6/5/99 7/1/99 7/1/99 7/1/99 Item Pen Ink Milk Juice Pen Ink Milk Pen Milk Pen Ink Juice Qty 2 1 3 6 1 1 1 1 1 2 2 4 Exercise • Find all itemsets with support >= 75%? TID 111 111 111 111 112 112 112 113 113 114 114 114 CID 201 201 201 201 105 105 105 106 106 201 201 201 Date 5/1/99 5/1/99 5/1/99 5/1/99 6/3/99 6/3/99 6/3/99 6/5/99 6/5/99 7/1/99 7/1/99 7/1/99 Item Pen Ink Milk Juice Pen Ink Milk Pen Milk Pen Ink Juice Qty 2 1 3 6 1 1 1 1 1 2 2 4 Exercise • Can you find all association rules with support >= 50%? TID 111 111 111 111 112 112 112 113 113 114 114 114 CID 201 201 201 201 105 105 105 106 106 201 201 201 Date 5/1/99 5/1/99 5/1/99 5/1/99 6/3/99 6/3/99 6/3/99 6/5/99 6/5/99 7/1/99 7/1/99 7/1/99 Item Pen Ink Milk Juice Pen Ink Milk Pen Milk Pen Ink Juice Qty 2 1 3 6 1 1 1 1 1 2 2 4 Extensions • Imposing constraints • • • • • • • Only find rules involving the dairy department Only find rules involving expensive products Only find “expensive” rules Only find rules with “whiskey” on the right hand side Only find rules with “milk” on the left hand side Hierarchies on the items Calendars (every Sunday, every 1st of the month) Market Basket Analysis: Applications • Sample Applications • • • • • Direct marketing Fraud detection for medical insurance Floor/shelf planning Web site layout Cross-selling Beyond Support and Confidence Example: 5000 students • 3000 students play basketball • 3750 students eat cereal • 2000 students both play basketball and eat cereal Cereal No cereal Sum Basketball No basketball 2000 1750 1000 250 3000 2000 Sum 3750 1250 5000 Misleading Association Rules Basketball No basketball Cereal 2000 1750 No cereal 1000 250 Sum 3000 2000 Sum 3750 1250 5000 • Basketball => Cereal (support: 40%, confidence: 66.7%) is misleading because 75% of students eat cereal • Basketball => No cereal (support: 20%, confidence: 33.3%) is more interesting, although with lower support and confidence Interest Interest of rule A => B: P(AB)/(P(A)*P(B)) • Symmetric (uses both P(A) and P(B)) • Note that confidence is not symmetric (confidence of rule A => B: P(AB)/P(A)) Interest values: • Interest = 1: A and B are independent (P(AB)=P(B)*P(A)) • Interest > 1: A and B are positively correlated • Interest < 1: A and B are negatively correlated Interest: Example Basketball No basketball Cereal 2000 1750 No cereal 1000 250 Sum 3000 2000 Itemset Cereal, basketball Cereal, no basketball No cereal, basketball No cereal, no basketball Sum 3750 1250 5000 Support Interest 40% 35% 20% 5% 1.125 0.857 0.750 2.000 Extensions • Imposing constraints • • • • • • • Only find rules involving the dairy department Only find rules involving expensive products Only find “expensive” rules Only find rules with “whiskey” on the right hand side Only find rules with “milk” on the left hand side Hierarchies on the items Calendars (every Sunday, every 1st of the month) Applications • Sample Applications • • • • • Direct marketing Fraud detection for medical insurance Floor/shelf planning Web site layout Cross-selling • Case study Summary • Data preprocessing • Quality analysis comes from quality data • Multidimensional data analysis • Interactive, spreadsheet-style data analysis • Data mining • The knowledge discovery process • Association rules Questions? (In the fourth lecture: More data mining.)