G54DMT – Data Mining Techniques and Applications http://www.cs.nott.ac.uk/~jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk Topic 3: Data Mining Lecture 5: Regression and Association Rules Some slides from chapter 5 of Data Mining. Concepts and Techniques by Han & Kamber Outline of the lecture • Regression – Definition – Representations • Association rules – Definition – Methods • Resources Regression • Regression problems are supervised problems where the output variable is continuous • Many techniques with different names are included in this category – Regression – Function approximation – Modelling – Curve-fitting • Given an input vector X and a corresponding output y, we want to find a function f such that y’=f(X) is as close as possible to the true y Evaluating regression • Supervised learning: we know the true outputs, so we check how different are from the predicted ones n 1 MAE = å f (X i ) - y i n i=1 – Mean Absolute Error – Mean Squared Error n – Root Mean Squared Error MSE = 1 å ( f (X i ) - y i ) 2 n i=1 RMSE = MSE Linear Regression • Most classic (and widespread in statistics) type of regression • f(X) is modelled as – y’=w0+w1x1+w2x2+…+wnxn http://upload.wikimedia.org/wikipedia/en/thumb/1/13/Linear_regression.png/400px-Linear_regression.png Linear regression • Simple but limited in expression power – The same model would apply to these four datasets http://en.wikipedia.org/wiki/Anscombe%27s_quartet Linear regression • How to find W? – Many mathematical methods available • • • • Least squares Ridge regression Lasso Etc – We can also use some kind of metaheuristic (e.g. a Genetic Algorithm) Polynomial regression • More complex and sophisticated functions – y=w0+w1x+w2x2+….. – Y=w0+w1x1+w2x2+w3x1x2+… • Now the job is double – Choosing the correct function (human inspection may help) – Adjusting the weights of the model • Still, would a single mathematical function fit any type of data? Piece-wise regression • Input space is partitioned in regions • A local regression model is generated from the training examples that fell inside each region – Approximating a sine function with linear regressions (Butz, 2010) Piece-wise regression • How to partition the input space – Using a series of rules • With a (hyper)rectangular condition (XCSF) • With a (hyper)ellipsoidal condition (XCSF,LWPR) • With a neural condition (XCSF) – Using a tree-like structure (CART, M5) • How to perform the regression process for each local approximation – Pick any of the functions discussed before – Plus some truly non-linear methods (SVR) Piece-wise approximation with hyperellipsoids • Using XCSF (Wilson, 02) with hyperellipsoid conditions (Butz et al, 08) XCSF’s population Test function (Stalph et al, 2010) Other regression methods • Neural networks – A MLP is natively a regression method • Classification is done by discretising the output of the network – It is proven that a MLP with enough hidden nodes can approximate any function • Support Vector Regression – As in SVM, depending on the kernel we got linear or nonlinear regression – The margin specifies a tube around the approximated function. All points inside the tube have their errors ignored – Support Vectors are the points that lay outside the tube Association Rules • Association rules try to find frequent patterns in the dataset that appear together • It can use the class label but it does not have to we can consider it an unsupervised learning paradigm • Two types of elements being generated – Association rules: They have antecedent and consequent – Frequent itemsets: They just have an antecedent. • Both antecedent and consequent are logic predicates (generally of conjunctive form) Association rules mining Witten and Frank, 2005 (http://www.cs.waikato.ac.nz/~eibe/Slides2edRev2.zip) Origin of Association Rules • These methods were originally employed to analyse shopping carts • Database is specified as a set of transactions. Each of them includes one or more of a set of items • An frequent itemset is a set of items that appears in many transactions Tid Items 10 A, C, D • These databases are extremely sparse 20 B, C, E 30 A, B, C, E 40 B, E Beers and diapers • An urban myth about association rules says that when applied to analyze a very large volume of shopping carts they discovered a very simple pattern – “Customers that buy beer also tend to buy diapers” • This story has changed through time. You can find an article about it here • It is a good example of data mining, as it was able to find an unexpected pattern Why Is Freq. Pattern Mining Important? • Discloses an intrinsic and important property of data sets • Forms the foundation for many essential data mining tasks – Association, correlation, and causality analysis – Sequential, structural (e.g., sub-graph) patterns – Pattern analysis in spatiotemporal, multimedia, time-series, and stream data – Classification: associative classification – Cluster analysis: frequent pattern-based clustering – Data warehousing: iceberg cube and cube-gradient – Semantic data compression: fascicles – Broad applications Evaluation of association rules • Support – Percentage of examples covered by the predicate in the antecedent – Applies to both association rules and frequent itemsets • Confidence – Percentage of the examples matched by the antecedent for which also match the consequent – Only apply to association rules • Typically, the user specifies a minimum support and confidence and the algorithm finds all rules above the thresholds Scalable Methods for Mining Frequent Patterns • The downward closure property of frequent patterns – Any subset of a frequent itemset must be frequent – If {beer, diaper, nuts} is frequent, so is {beer, diaper} – i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} • Scalable mining methods: Three major approaches – Apriori (Agrawal & Srikant@VLDB’94) – Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) – Vertical data format approach (Charm—Zaki & Hsiao @SDM’02) Apriori: A Candidate Generation-and-Test Approach • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94) • Method: – Initially, scan DB once to get frequent 1-itemset – Generate length (k+1) candidate itemsets from length k frequent itemsets – Test the candidates against DB – Terminate when no frequent or candidate set can be generated The Apriori Algorithm—An Example Supmin = 2 Itemset sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Database TDB Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E C1 1st scan C2 L2 Itemset {A, C} {B, C} {B, E} {C, E} sup 2 2 3 2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} sup 1 2 1 2 3 2 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 L1 C2 2nd scan Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} C3 Itemset {B, C, E} 3rd scan L3 Itemset sup {B, C, E} 2 The Apriori Algorithm • Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; Resources • “The Elements of Statistical Learning” by Hastie et al. contains a lot of detail about statistical regression • List of Regression and association rules methods in KEEL • Weka also contains both kind of methods • Chapter 5 of the Han and Kamber book is all about association rules (Han created the Fpgrowth method) • Review of evolutionary algorithms for association rule mining Questions?