Data Mining III: Numeric Estimation Review: Numeric Estimation Computer Science 105 Boston University

advertisement
Data Mining III: Numeric Estimation
Computer Science 105
Boston University
Spring 2012
David G. Sullivan, Ph.D.
Review: Numeric Estimation
• Numeric estimation is like classification learning.
• it involves learning a model that works like this:
input attributes
model
output attribute/class
• the model is learned from a set of training examples
that include the output attribute
• In numeric estimation, the output attribute is numeric.
• we want to be able to estimate its value
Example Problem: CPU Performance
• We want to predict how well a CPU will perform on some task,
given the following info. about the CPU and the task:
• CTIME: the processor's cycle time (in nanosec)
• MMIN: minimum amount of main memory used (in KB)
• MMAX: maximum amount of main memory used (in KB)
• CACHE: cache size (in KB)
• CHMIN: minimum number of CPU channels used
• CHMAX: maximum number of CPU channels used
• We need a model that will estimate a performance score for
a given combination of values for these attributes.
CTIME
MMIN, MMAX
CACHE
CHMIN, CHMAX
model
performance
(PERF)
Example Problem: CPU Performance (cont.)
• The data was originally published in a 1987 article in the
Communications of the ACM by Phillip Ein-Dor and
Jacob Feldmesser of Tel-Aviv University.
• There are 209 training examples. Here are five of them:
class/
output attribute
input attributes:
CTIME
125
29
29
125
480
MMIN
256
8000
8000
2000
512
MMAX
6000
32000
32000
8000
8000
CACHE
256
32
32
0
32
CHMIN
16
8
8
2
0
CHMAX
128
32
32
14
0
PERF
198
269
172
52
67
Linear Regression
• The classic approach to numeric estimation is linear regression.
• It produces a model that is a linear function
(i.e., a weighted sum) of the input attributes.
• example for the CPU data:
PERF = 0.066CTIME + 0.0143MMIN + 0.0066MMAX +
0.4945CACHE – 0.1723CHMIN + 1.2012CHMAX – 66.48
• this type of model is known as a regression equation
• The general format of a linear regression equation is:
y = w1x1 + w2x2 + E + wnxn + c
where
y is the output attribute
x1, E , xn are the input attributes
w1, E , wn are numeric weights
linear regression
learns these values
c is an additional numeric constant
Linear Regression (cont.)
• Once the regression equation is learned, it can estimate
the output attribute for previously unseen instances.
• example: to estimate CPU performance for the instance
CTIME
480
MMIN
1000
MMAX
4000
CACHE
0
CHMIN
0
CHMAX
0
PERF
?
we plug the attribute values into the regression equation:
PERF = 0.066CTIME + 0.0143MMIN + 0.0066MMAX +
0.4945CACHE – 0.1723CHMIN + 1.2012CHMAX – 66.48
= 0.066 * 480 + 0.0143 * 1000 + 0.0066 * 4000 +
0.4945 * 0 – 0.1723 * 0 + 1.2012 * 0 – 66.48
= 5.9
Linear Regression with One Input Attribute
• Linear regression is easier
to understand when there's
only one input attribute, x1.
y
c
• In that case:
• the training examples are ordered pairs of the form (x1, y)
• shown as points in the graph above
x1
• the regression equation has the form y = w1x1 + c
• shown as the line in the graph above
• w1 is the slope of the line; c is the y-intercept
• Linear regression finds the line that "best fits"
the training examples.
Linear Regression with One Input Attribute (cont.)
y
y = w1x1 + c
c
x1
• The dotted vertical bars show the differences between:
• the actual y values (the ones from the training examples)
• the estimated y values (the ones given by the equation)
Why do these differences exist?
• Linear regression finds the parameter values (w1 and c) that
minimize the sum of the squares of these differences.
Linear Regression with Multiple Input Attributes
• When there are k input attributes, linear regression finds
the equation of a line in (k+1) dimensions.
• here again, it is the line that "best fits" the training examples
• The equation has the form we mentioned earlier:
y = w1x1 + w2x2 + E + wnxn + c
• Here again, linear regression finds the parameter values
(for the weights w1, E , wn and constant c) that minimize
the sum of the squares of the differences between the
actual and predicted y values.
Linear Regression in Weka
• Use the Classify tab in the Weka Explorer.
• Click the Choose button to change the algorithm.
• linear regression is in the folder labelled functions
• By default, Weka employs attribute selection, which means
it may not include all of the attributes in the regression equation.
Linear Regression in Weka (cont.)
• On the CPU dataset with M5 attribute selection,
Weka learns the following equation:
PERF = 0.0661CTIME + 0.0142MMIN + 0.0066MMAX +
0.4871CACHE + 1.1868CHMAX – 66.60
• it does not include the CHMIN attribute
• To eliminate attribute selection, you can click on the name
of the algorithm and change the attributeSelectionMethod
parameter to "No attribute selection".
• doing so produces our earlier equation:
PERF = 0.066CTIME + 0.0143MMIN + 0.0066MMAX +
0.4945CACHE – 0.1723CHMIN + 1.2012CHMAX – 66.48
Evaluating a Regression Equation
• To evaluate the goodness of a regression equation,
we again set aside some of the examples for testing.
• do not use these examples when learning the equation
• use the equation on the test examples and
see how well it does
• Weka provides a variety of error measures, which are based on
the differences between the actual and estimated y values.
• we want to minimize them
• The correlation coefficient measures the degree of correlation
between the actual and estimated y values.
• it is between 0.0 and 1.0
• we want to maximize it
Handling Non-Numeric Input Attributes
• We employ numeric estimation when the output attribute
is numeric.
• Linear regression – and many other techniques for numeric
estimation – also require that the input attributes be numeric.
• If we have a non-numeric input attribute, it may be possible
to convert it to a numeric one.
• example: if we have a binary attribute (yes/no or true/false),
we can convert the two values to 0 and 1
Handling Non-Numeric Input Attributes (cont.)
• There are techniques for numeric estimation that can handle
both numeric and non-numeric (categorical) attributes.
• One option:
• build a decision tree
• have each classification be a numeric value that is
the average of the values for the training examples
in that subgroup
• the result is called a regression tree
• Another option is to have a separate regression equation
for each classification in the tree – based on the training
examples in that subgroup.
• this is called a model tree
Regression Tree for the CPU Dataset
Regression and Model Trees in Weka
• To build a tree for estimation in Weka, select the M5P algorithm
in the trees folder.
• by default, it builds model trees
• you can click on the name of the algorithm and tell it
to build regression trees
Download