Chapter Two Principles of data mining Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Overview • • • • • • • • The process of data mining Approaches of data mining Categories of data mining problems Information patterns to be discovered Overview of data mining solutions Importance of evaluation Undertaking a data mining task in Weka Review of basic concepts in statistics and probability Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Data Mining Process Input Data Preparing Input Data Mining Patterns Post-processing Patterns A data mining stage Flow of control from one stage to the next stage Flow of control from one stage to the previous stage Repetition of the tasks at one stage Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Output Patterns Data Mining Process • Preparation • Selecting relevant features • Selecting relevant records • Data cleaning • Deal with unknown data • Data transformation Target Data set Collected Data set • Integrating data • Getting necessary data details Original Data sets Pre-Processed Data set • Formatting data into acceptable form by the mining tool Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Formatted Data set Data Mining Process • Mining Formatted Data set Parameter settings – Determining data mining tasks – Assigning roles for data for certain tasks – Selecting data mining solution(s) to each task – Setting necessary parameters for the solution – Collecting result patterns Solution3 (w1, w2, …, wm) Solution2 (t1, t2, …, tr) Solution1 (p1, p2, …, pn) Mining solutions Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Patterns Data Mining Process • Post-processing – Pattern evaluation – Pattern selection – Pattern interpretation Patterns Evaluation criteria accept Knowledge learnt Valid Valid Patterns Patterns reject Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Selection criteria Pattern Interpretation Selected Patterns Data Mining Process • Roles of participants in data mining – Participants include: • Data miners / data analysts: main participant of a DM project • Domain expert: main collaborators of DM project • Decision makers: clients of a DM project – Risk of human bias in the discovery process – Important roles of domain expert • Pattern interpretation (for usefulness) • Pattern evaluation (for significance) • Mining options (for suitable tasks, limited) • Advisory on data pre-processing (for suitable operations, limited) – Balancing the strength of human and machine Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Data Mining Approaches • Hypothesis testing approach – Top-down lead by a hypothesis statement – Procedure: 1. Forming a hypothesis statement 2. Collecting and selecting data of relevance 3. Conducting data analysis and collecting patterns 4. Interpreting the patterns to accept/reject the hypothesis • Discovery approach – Bottom-up without a hypothesis in mind – Procedure: 1. Collecting and preparing data of interest 2. Conducting data analysis and discovering possible patterns 3. Evaluating the importance and interestingness Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Data Mining Approaches • Discovery approach (cont’d) – Directed discovery (supervised learning): • Certain aspects of the outcome, i.e. the goal, of the discovery have been specified. The discovery is to find those patterns satisfying the goal. e.g. patterns relating to the outcome of a class variable – Undirected discovery (unsupervised learning): • There is no specification of the goal of the discovery. The discovery is to find those patterns of some kind of significance. e.g. associative links among some attribute values Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Data Mining: Problems & Patterns • Classification – Construct a classification model to determine the class of a given record Model Construction Method Classification Model Example Data Set (a) Model Development Phase Input features Input features class class Ci ? Unseen Data Record with undetermined class Classification Model (b) Model Use Phase Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Data Record with the determined class Data Mining: Problems & Patterns • Various forms of classification models Instance space Neural network Decision tree Many more … List of ordered classification rules Function (linear regression) Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Data Mining: Problems & Patterns • Cluster detection – Measure similarity among data objects and group them into clusters accordingly Clustering Method Input data points Cluster Memberships of Data Points Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Data Mining: Problems & Patterns • Forms of clustering results Clusters of various shapes Hierarchical clustering results Eclipse shaped clusters Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Data Mining: Problems & Patterns • Association rule mining – Discover significant relationships between data objects Association Mining Method XY • Various associations – – – – Between values, e.g. Apple Coke Between categories of values, e.g. Food Magazine Between values of attributes, e.g. Married:yes OwnHouse:yes Over time period, e.g. year 1: Database year 2: Data Mining Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Data Mining: Problems & Patterns • An example StudentID 1 2 3 4 5 6 7 8 9 10 11 12 Gender Country Major Subject M UK Computing F UK Computing M FRANCE Psychology M SPAIN Accounting F UK Psychology F USA History M UK Computing F FRANCE Psychology F GERMANY History M UK Accounting M SPAIN History F UK Law Classification model? Age 22 21 24 23 22 30 35 25 23 22 20 45 TotalUnits Degree Class 360 1st Class 360 2nd Lower 345 2nd Lower 360 1st Class 300 Pass 345 2nd Upper 360 1st Class 360 3rd Class 360 2nd Upper 360 1st Class 345 2nd Upper 300 Pass Clusters? Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Association rules? Data Mining Solutions: An Overview • Classification solutions – Decision tree – k nearest neighbour (kNN) – Rules – Bayesian theorem – Artificial neural network e.g. ID3 e.g. PEBLS e.g. Sequential Cover e.g. Naïve Bayes • Clustering Solutions – Partition-based methods – Hierarchical methods – Density-based methods – Model-based methods – Graph-based methods e.g. K-means e.g. agglomeration e.g. DBScan e.g. Expectation-Maximisation e.g. Chameleon Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Data Mining Solutions: An Overview • Association rule solutions – Greedy methods e.g. Apriori – Graph-based methods e.g. FP-Growth – Methods for various associations • Boolean associations • Generalised associations (multi-level associations) • Quantitative associations (multidimensional associations) • Sequential associations (sequential patterns) Since one type of data mining problems can be transformed to another type of data mining problems, some solutions for one type can also be applied to another type. Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Evaluation of Patterns • Importance of evaluating result patterns – Classification model must be accurate enough to be creditable – Clusters must genuinely exist – Association rules must have enough strengths to be believed – Data descriptions must be general enough to cover a large part of the data set How do we evaluate the discovered patterns ? Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Evaluation of Patterns • Possible measures of interestingness – Objective measures based on data and pattern • • • • • Conciseness of pattern, e.g. minimum description length Coverage, e.g. coverage for classification rules Reliability, e.g. accuracy of a classification model Peculiarity, e.g. measures of difference from the norm Diversity, e.g. tendency of clusters – Subjective measures based on domain knowledge • • • • Novelty Surprisingness Usefulness Applicability Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Evaluation of Patterns • Commonly used measures – Accuracy rate or error rate for classification models • True positive • False positive • False negative (see section 6.5.1) – Quality of clusters • Quality of a cluster • Overall quality of all clusters (see section 4.5.1) – Strengths of associations • Support • Confidence • Lift (see section 8.1.2 and 8.6) Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Data Mining in Weka Explorer • The roadmap Associate Tab page Preprocess Tab page Tree Visualiser window Cluster Tab page (1) Classify Tab page (2) Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning (3) Data Mining in Weka Explorer • Preprocess Open data set from different sources Generate random data set Display & edit data Save data set into a file Filters for pre-processing Data summary Selected attribute summary Attribute display, selection & removal from the opened data set Visualise all attributes Selected attribute visualisation Feedback messages Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Data Mining in Weka Explorer • Classify (as an example) Method selection & parameter setting Test option setting Result display window Task list. Menu of options available with right click. Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Data Mining in Weka Explorer • Classify (as an example) Method List Selecting & Changing parameters Selecting a specific method Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Data Mining in Weka Explorer • Visualisation Scatter plot of data object of different classes An Example Decision Tree Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Probability & Statistics: A Brief Review • Where probability and statistics used? – Patterns found from data are probabilistic in nature – Used in various measures of evaluation, e.g. confidence measure of association rules – Used in data exploration stage for better understanding, e.g. maximum, minimum, mean, variance, skewness – Used during the mining process to assist the discovery of patterns, e.g. information gain for decision tree induction – Used as a part of patterns, e.g. naïve Bayes, Gaussian mixture model – Used in comparison of patterns, e.g. classification model with significantly better accuracy Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Probability & Statistics: A Brief Review • Probability and conditional probability – Probability of event P(E) and its meanings when: P(E) = 0, P(E) = 1 and 0 < P(E) < 1 – Probabilities of multiple events: P(E and F), P(E or F) = P(E) + P(F) – P(E and F) – Mutually exclusive events: P(E and F) = 0 and P(E and F) = P(E) + P(F) – Conditional probability of event E given event F: P(E|F) = P(E and F)/P(F) – Independent events: P(E and F) = P(E)P(F), and P(E|F) = P(E) Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Probability & Statistics: A Brief Review • Probability & conditional probability (example) StudentID 1 2 3 4 5 6 7 8 9 10 11 12 Gender Country Major Subject M UK Computing F UK Computing M FRANCE Psychology M SPAIN Accounting F UK Psychology F USA History M UK Computing F FRANCE Psychology F GERMANY History M UK Accounting M SPAIN History F UK Law P ( Gender M ) 6 12 1 Age 22 21 24 23 22 30 35 25 23 22 20 45 TotalUnits Degree Class 360 1st Class 360 2nd Lower 345 2nd Lower 360 1st Class 300 Pass 345 2nd Upper 360 1st Class 360 3rd Class 360 2nd Upper 360 1st Class 345 2nd Upper 300 Pass P ( Gender M or Gender F ) 1 2 P ( Gender M and Gender F ) 0 P ( Gender F | Country UK ) 1 2 Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Probability & Statistics: A Brief Review • Probability distribution of random variables – Discrete random variable – Continuous random variable 68% P(X = x) P(a X < b) Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning 95% Probability & Statistics: A Brief Review • Basic Statistics – Sample mean, median and mode – Variance and standard deviation – Skewness x x i age 26 n 2 sx (x i x) n 1 3 ( x Median sx x ) 2 median age 23 mode age 22 s age 7 . 324 2 s age 53.636 skewness age 3 ( 26 23 ) 7 . 324 Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning 1 . 229 Probability & Statistics: A Brief Review • Confidence interval estimate – Sample mean is only an estimate of the true mean for the data population. – Central limit theorem: sample means follows a normal distribution that: a. The mean is the true population mean X b. The standard deviation is / n – Based on the central limit theorem and using the sample standard deviation to replace the true one, the following expression is used to estimate the interval for the true mean at confidence level of 1- P( x t sX n xt sX ) 1 n Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Probability & Statistics: A Brief Review • Confidence interval estimate (example) For this data set, n = 12, age = 26 and sage = 7.324. At confidence level of 95%, i.e. 1 - = 0.95 and /2 = 0.025, n – 1 = 11, and therefore, t = 2.201. The interval estimate is: P ( 26 2 . 201 7 . 324 12 26 2 . 201 7 . 324 ) 0 . 95 12 The interval is estimated as [21.347, 30.653] at confidence level of 95% Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Probability & Statistics: A Brief Review • Hypothesis testing – As an introduction to statistical inference and statistic significance. – Procedure: a. Forming null and alternative hypotheses b. Deciding the level of significance p c. Determining a test statistic and calculating its value d. Comparing the calculated value against known value and deciding if the null hypothesis should be rejected Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Probability & Statistics: A Brief Review • Hypothesis testing (example) – Assuming age = 25 – Hypotheses: age age Null: Alternative: age age – Calculating the statistic t as: t age s age / n 26 25 0.473 7 . 324 / 12 Less than t = 2.201 for p/2 = 0.025 and n – 1 = 11. – Conclusion: null hypothesis is not rejected, i.e. the difference between the sample mean and the population mean is insignificant. Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter Summary • The data mining process involves preparation of data, mining of patterns and post-processing of the patterns. • Top-down and bottom-up approaches are both useful. The discovery approach can be directed or undirected. • Three main streams of data mining tasks and various forms of patterns and models are introduced. • Specific solutions are required for specific types of problems • The importance of evaluation of patterns must be appreciated. • Normal procedure of conducting data mining in Weka is explained • Some important basic concepts in probability and statistics are reviewed. Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning References Read Chapter 2 of Data Mining Techniques and Applications Useful further references Han, J. and Kamber, M. (2006), Data Mining: Concepts and Techniques, 2nd Edition, Morgan Kaufmann Publishers, Chapter 1 Berry, M. J. A. and Linoff, G. (2004), Data Mining Techniques: For Marketing, Sales and Customer Relationship Management, 2nd ed. Wiley Computer Publishing, Chapters 1 – 2 Data Mining Techniques and Applications, 1st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning