Data Mining III: Numeric Estimation Computer Science 105 Boston University Spring 2012 David G. Sullivan, Ph.D. Review: Numeric Estimation • Numeric estimation is like classification learning. • it involves learning a model that works like this: input attributes model output attribute/class • the model is learned from a set of training examples that include the output attribute • In numeric estimation, the output attribute is numeric. • we want to be able to estimate its value Example Problem: CPU Performance • We want to predict how well a CPU will perform on some task, given the following info. about the CPU and the task: • CTIME: the processor's cycle time (in nanosec) • MMIN: minimum amount of main memory used (in KB) • MMAX: maximum amount of main memory used (in KB) • CACHE: cache size (in KB) • CHMIN: minimum number of CPU channels used • CHMAX: maximum number of CPU channels used • We need a model that will estimate a performance score for a given combination of values for these attributes. CTIME MMIN, MMAX CACHE CHMIN, CHMAX model performance (PERF) Example Problem: CPU Performance (cont.) • The data was originally published in a 1987 article in the Communications of the ACM by Phillip Ein-Dor and Jacob Feldmesser of Tel-Aviv University. • There are 209 training examples. Here are five of them: class/ output attribute input attributes: CTIME 125 29 29 125 480 MMIN 256 8000 8000 2000 512 MMAX 6000 32000 32000 8000 8000 CACHE 256 32 32 0 32 CHMIN 16 8 8 2 0 CHMAX 128 32 32 14 0 PERF 198 269 172 52 67 Linear Regression • The classic approach to numeric estimation is linear regression. • It produces a model that is a linear function (i.e., a weighted sum) of the input attributes. • example for the CPU data: PERF = 0.066CTIME + 0.0143MMIN + 0.0066MMAX + 0.4945CACHE – 0.1723CHMIN + 1.2012CHMAX – 66.48 • this type of model is known as a regression equation • The general format of a linear regression equation is: y = w1x1 + w2x2 + E + wnxn + c where y is the output attribute x1, E , xn are the input attributes w1, E , wn are numeric weights linear regression learns these values c is an additional numeric constant Linear Regression (cont.) • Once the regression equation is learned, it can estimate the output attribute for previously unseen instances. • example: to estimate CPU performance for the instance CTIME 480 MMIN 1000 MMAX 4000 CACHE 0 CHMIN 0 CHMAX 0 PERF ? we plug the attribute values into the regression equation: PERF = 0.066CTIME + 0.0143MMIN + 0.0066MMAX + 0.4945CACHE – 0.1723CHMIN + 1.2012CHMAX – 66.48 = 0.066 * 480 + 0.0143 * 1000 + 0.0066 * 4000 + 0.4945 * 0 – 0.1723 * 0 + 1.2012 * 0 – 66.48 = 5.9 Linear Regression with One Input Attribute • Linear regression is easier to understand when there's only one input attribute, x1. y c • In that case: • the training examples are ordered pairs of the form (x1, y) • shown as points in the graph above x1 • the regression equation has the form y = w1x1 + c • shown as the line in the graph above • w1 is the slope of the line; c is the y-intercept • Linear regression finds the line that "best fits" the training examples. Linear Regression with One Input Attribute (cont.) y y = w1x1 + c c x1 • The dotted vertical bars show the differences between: • the actual y values (the ones from the training examples) • the estimated y values (the ones given by the equation) Why do these differences exist? • Linear regression finds the parameter values (w1 and c) that minimize the sum of the squares of these differences. Linear Regression with Multiple Input Attributes • When there are k input attributes, linear regression finds the equation of a line in (k+1) dimensions. • here again, it is the line that "best fits" the training examples • The equation has the form we mentioned earlier: y = w1x1 + w2x2 + E + wnxn + c • Here again, linear regression finds the parameter values (for the weights w1, E , wn and constant c) that minimize the sum of the squares of the differences between the actual and predicted y values. Linear Regression in Weka • Use the Classify tab in the Weka Explorer. • Click the Choose button to change the algorithm. • linear regression is in the folder labelled functions • By default, Weka employs attribute selection, which means it may not include all of the attributes in the regression equation. Linear Regression in Weka (cont.) • On the CPU dataset with M5 attribute selection, Weka learns the following equation: PERF = 0.0661CTIME + 0.0142MMIN + 0.0066MMAX + 0.4871CACHE + 1.1868CHMAX – 66.60 • it does not include the CHMIN attribute • To eliminate attribute selection, you can click on the name of the algorithm and change the attributeSelectionMethod parameter to "No attribute selection". • doing so produces our earlier equation: PERF = 0.066CTIME + 0.0143MMIN + 0.0066MMAX + 0.4945CACHE – 0.1723CHMIN + 1.2012CHMAX – 66.48 Evaluating a Regression Equation • To evaluate the goodness of a regression equation, we again set aside some of the examples for testing. • do not use these examples when learning the equation • use the equation on the test examples and see how well it does • Weka provides a variety of error measures, which are based on the differences between the actual and estimated y values. • we want to minimize them • The correlation coefficient measures the degree of correlation between the actual and estimated y values. • it is between 0.0 and 1.0 • we want to maximize it Handling Non-Numeric Input Attributes • We employ numeric estimation when the output attribute is numeric. • Linear regression – and many other techniques for numeric estimation – also require that the input attributes be numeric. • If we have a non-numeric input attribute, it may be possible to convert it to a numeric one. • example: if we have a binary attribute (yes/no or true/false), we can convert the two values to 0 and 1 Handling Non-Numeric Input Attributes (cont.) • There are techniques for numeric estimation that can handle both numeric and non-numeric (categorical) attributes. • One option: • build a decision tree • have each classification be a numeric value that is the average of the values for the training examples in that subgroup • the result is called a regression tree • Another option is to have a separate regression equation for each classification in the tree – based on the training examples in that subgroup. • this is called a model tree Regression Tree for the CPU Dataset Regression and Model Trees in Weka • To build a tree for estimation in Weka, select the M5P algorithm in the trees folder. • by default, it builds model trees • you can click on the name of the algorithm and tell it to build regression trees