Data_Mining_tutorial

advertisement
1
Finding the
in your Data
SASTM Global Forum Workshop
David A. Dickey
March, 2014
TM
SAS and its products are registered trademarks of SAS Institute, Cary, NC
2
The tree method (recursive splitting) divides data into homogeneous rectangular regions. How?
When do we stop splitting? What happens with missing values of X variables?
3
Note: there are several other variables in the data set that are not chosen as splitting variables
anywhere in the tree. Throughout we will use examples, some in SAS software and some using
Enterprise Miner. One tree will also use smoking (yes, no) in a split. Examples next:
4
Apply tree techniques using defaults. Smoking comes into play for young people with high
cholesterol. Blood pressure matters for older people.
Now try some options (Gini splits, Average Square Error Assessment, fixed (N=4) number of
leaves). Youth is protective; smoking does not come into play. Note left leaf, bottom row.
Incidence rate less than overall rate but based on only 51 people!
Missing inputs are handled by placing them alternatively on either side of a 2 way split then
picking the better allocation based on prediction of the target.
5
Left table: Observed
Right table: Expected numbers if independent.
Degrees of freedom number is (r-1)(c-1) for rxc table. Compare as follows for test:
6
Intuitive explanations of “p-values”:
Hypothesis testing is like a court trial. H0: innocent versus H1: guilty. Do not reject innocence
unless evidence against it is beyond reasonable (p-value < 0.05?) doubt. Example of Chi-square:
Did status (first, second, third class, crew) affect Titanic survival? H0: no vs. H1: yes.
Demo 1: Titanic.sas – adds “Bonferoni Correction” to logworth and considers gender splits.
Try Titanic analysis in Enterprise Miner. Consider status as unordered and consider “gender”
(man, woman, child). Need to set up project and diagram within project.
7
8
Diagram specifies two trees: Default (top) and 4 leaf tree using Gini splits (bottom). These give
the trees shown earlier.
9
Candidate split 1 is gender (2 choices) and candidate 2 is age (87 possibilities). Is that fair? If I
get 87 tries to win a prize and you get only 2, is that OK with you? Some adjustment is needed!
Bonferroni adjustment multiplies p-value by number of tests (number of potential split points)
10
The way PRUNING is done depends on which of 3 major goals you have in mind:
(1) Ranking (e.g. who are top n candidates? )
(2) Decisions (e.g. is it malignant or benign?)
(3) Prediction (e.g. what is the probability of heart disease given your risk factors?)
Example: A split might be helpful in going from probability of defaulting 0.20 in a parent node
to probabilities 0.07 and 0.32 in the child nodes but if the goal is decisions, then both children
have less than 50% chance of defaulting (predict no default) the same as the parent. The split is
not helpful for decisions if we decide based only on which of the 2 outcomes is more likely.
11
For pruning, may look at misclassification rate, average squared error or some other measure for
“diversity”. Area under ROC curve (discussed later) is the same as C statistic below.
12
The cost (consequence) of accidentally saying a tumor is malignant when it is really benign may
be quite different from that of saying a tumor is benign when it is really malignant. The cost
should be combined with the probability when making a decision where costs associated with
decisions differ by decision. Recall that items within a leaf cannot be distinguished further –
we’ve used all useful information to distinguish this leaf from others - so we must make the same
decision for all items in a leaf.
So far we have been dealing with a categorical (binary in fact) response. What if the response is
continuous? The Chi-square contingency table would no longer be available for decisions about
splitting.
13
Perhaps the two X variables are age and debt to income ratio and the response is a credit card
balance in hundreds of dollars. Consider first split (vertical line in the middle of the plot above).
This divides the region into two groups, each with a mean. Thinking of these (high X1, low X1)
as 2 treatment groups we can do an analysis of variance F test and get its p-value for each
possible split on each variable. Now that we have a p-value, we can proceed just as in the
categorical response case. For example:
X1 (age) = young
X1 (age) = old
Balance
N = n1 + n2 obs.
Y1, Y2, Y3, …. Yn1
SSE(1)
Y n1+1, Y n1+2, Y n1+3, …. YN SSE(2)
[SS(total) –SSE(1)-SSE(2) ] / 1df = F numerator (model mean square)
MSE = [SSE(1) + SSE(2)] / (N-2)df = F denominator (error mean square)
p-value = Pr>F.
(# possible splits)(p-value) = Bonferroni adjusted p-value
-Log10 [(# possible splits)(p-value) ] = logworth of split
Keep on splitting as usual.
14
Here the prediction Pi is just the mean Yi  of the data in leaf i. The surface is a step function.
A real data example: cost to society (in units of cost for one fatality) for traffic accidents in
Portugal (personal communication from Dr. G. Torrao). Note: vehicle 1 is the one deemed to
have caused the accident.
15
Variable importance here is the reduction in variance obtained in splits using the given variable.
If a variable is never chosen as a split variable then its importance is 0. Variables that were
runners up for splits can also enter the calculation in a down-weighted fashion. Only two
variables have nonzero importance in the (hypothetical) chemical data here, allowing 3-D
graphics. The response is potency (continuous).
From the EM output we can infer the order and nature of the splits. Note that the seemingly
single blue horizontal surface is actually several slightly different rectangular surfaces at very
similar heights off the “floor” of the plot. Also note that C and D have 0 importance.
16
From this output
we infer this split pattern:
17
We need a function f(X) that stays between 0 and 1 no matter what X is. The logistic function
f(X) = exp(a+bX)/(1+exp(a+bX)) = 1/(exp(-(a+bX))+1) works. Depending on a and b we get
something that looks like a steep (large |b|) or mild (small |b|) ski slope:
Q: How do we find a and b for a given set of data?
A: Find a and b to maximize the likelihood of the observed data.
That is, find values of a and b maximize the joint probability density for the observed
configuraion of points.
18
-2.6 + 0.23 X
DATA Ignition;
INPUT time Y @@;
cards;
3 0
5 0
7 1
9 0
10 0
11 1
12 1
0
17 1
25 1 30 1
4 . 6 .
8 . 18 .
19 .
20 .
22 .
.
;
proc print;
PROC LOGISTIC;
Model Y(event='1') = time;
output out=out1 predicted=Plogistic; run;
proc sort data=out1; by time;
13 0
14 1
15 1
16
24 .
26 .
28 .
30
19
Proc sgplot;
reg Y=Y X=time;
series Y=Plogistic X=time;run;
Example: Challenger mission
.
20
There were 24 missions prior to Challenger. Each mission carries 6 O-rings used to keep the fuel
elements from combining before they should. Upon return, O rings are inspected and we write 0
if an O ring is intact, 1 if it showed signs of erosion or blowby. We predict this binary variable
from air temperature at launch. Optional SAS Demo: Shuttle.sas.
Probability p(X) of 1 specific ring failing
Pr{Two or more fail (any 2)}
Math: 1-Pr{0}-Pr{1} =
 6
 6
1   [ p( x)]0 [1  p( x)]6   [ p( x)]1[1  p( x)]5
 0
1
 1  [1  p( x)]6  6 p( x)][1  p( x)]5
21
A neural network is a composition of functions f(g(x)) where g often is a hyperbolic tangent
function H which in turn is just a slightly reparameterized logistic function with range stretching
from -1 to 1 rather than from 0 to 1.
22
 H1 = tanH(-10 -.4*X1+0.8*X2)
Advantage: Can fit very wiggly surface.
Disadvantage: Fitted function can be too wiggly (fits training data but does not hold up in future
data)
e2x
e x  e x
Relationship of hyperbolic tangent function H ( x)  x
to logistic function L(2 x)  2 x
.
e  ex
e 1
 e x  e  x  e x   e 2 x  1   2e 2 x  (e 2 x  1)   e 2 x 
    2 x
  
  2 2 x
  1  2 L(2 x)  1
H ( x)   x
 x  x 
e2 x  1
 e  e  e   e  1  
  e 1
The parameters (“biases” and “weights”) are estimated iteratively in steps. Approaches (2) and
(3) give model convergence for the training data. In appraoch (3), the validation fit deteriorates
from iteration 1 on, implying no fitting is helpful. Possibly this model is too complex for the
task at hand. We reject this approach and proceed with the complex model from step 2,
comparing it later to simpler models.
23
How should models be compared? Statisticians are used to using average squared error as a
measure of how well a model fits. Suppose you are more interested in making decisions as to
whether an event will occur or not. The slide below uses a measure called lift. The small labels
on the vertical axis go from about 0.09 to 0.30 where 0.30 is about 3.3 times 0.09 and 0.09 is the
overall event rate, for example it might be the overall proportion of customers buying some
product. The curve is constructed by using a model to predict the 5% most likely to purchase
(horizontal coordinate) and the actual observed response rate in that that group of customers, in
our case 30% (0.30 vertical coordinate). This is repeated for the most likely 10%, 15%, 20%,
etc. resulting in 20 points which are then connected together in a “lift chart”. It is seen that the
lift is a fraction. The actual response rate in the p% most likely to respond based on the model is
the numerator and the overall response rate the denominator. On the right side (p=100%) of
course the ratio of these is 1.
This graph is just for illustration. It did not arise from any of our example data sets.
24
Another graph that is of interest for assessment of models is the Receiver Operating
Characteristic Curve (ROC curve) which, for a binary decision starts out using a model
prediction variable along the horizontal axis with separate histograms for the events (1s in the
data) and non-events (0s in the data). For example, suppose logit scores are along the horizontal
axis and the histograms for the 1s (to the left) and 0s (to the right) are bell shaped. We then
move a proposed “critical value” along the horizontal axis, looking at what happens if we declare
that logits to its left represent 1s and those to the right represent 0s. We would capture some
percent of the 1’s (our Y coordinate) and mistakenly declare some 0s to be 1s (our X coordinate).
These coordinates trace out the ROC curve.
ROC point by point:
25
26
We want to move from the bottom left corner up as fast as possible (higher sensitivity) while
moving to the left as slowly as possible (less missclassification of 0’s so higher specificity)
Sensitivity: Pr{ calling a 1 a 1 }=Y coordinate.
Specificity; Pr{ calling a 0 a 0 } = 1-X.
Earlier it was claimed that the area under this Receiver Operating Characteristic curve (ROC
curve) is the proportion of concordant pairs plus half the number of ties. This is not particularly
intuitive so we now look at why this is so using a tree example.
27
The leaves from a three leaf tree are listed in this slide from the one that has the highest
proportion of 1’s (leaf 1) to that with the lowest proportion (leaf 3).
_________________
|_100_0s___100_1s_|
______ /______
______\______
| 20_0s__60_ 1s |
| 80 0s 40 1s |
Leaf 1 ____
/_____
__\_________
| 20 0s 20 1s _|
| 60 0s 20 1s |
Leaf 2
Leaf 3
Note that:
(1) We cannot treat the elements of any leaf differently (we have used all discrimatory
information) – we must delcare each leaf to be all 1’s or all 0’s.
(2) If we declare every leaf to be 1 we have captured all the 1’s (Y=1) and have misclassified
100% of the 0’s (X=1) , If we declare none ot the leaves to be 1’s then we have captured none of
the 1’s (Y=0) and have misclassified none of the 0’s (X=0) so (0,0) and (1,1) are on the ROC.
There are 100 1’s and 100 0’s and we can use counts then convert to proportions. The number of
1’s in a leaf (height) times the number of 0’s in that or another leaf (width) can be multiplied
together (area) to compute the number of tied or concordant pairs and interpret the number as an
area.
28
In the previous slide we accounted for the area in the bottom row of rectangles. That was for a
cut point at leaf 1. Moving the cut point to leaf 2 (leaf 1 and 2 declared to be 1’s, leaf 3 declared
to be 0’s) we pick up some more ties (20x20=400 from leaf 2) and some more concordant pairs
(any combination of the 20 1’s in leaf 2 and the 60 0’s in leaf 3 so 1200 more). This gives a
middle row of rectangles in our diagram. Finally, when the cut point is beyond leaf 3, everything
is declared to be 1 so this adds only ties and a single rectangle to the top of our diagram.
The lines shown split the rectangles associated with ties in half. These lines form the ROC curve
by definition if we change from counts to proportions. With 100 1’s and 100 0’s, the proportions
are just the counts divided by 100. These adjusted points and connecting lines are extracted from
the graph in the slide above and form the ROC curve by definition.
These observations about a tree show why the area under the ROC curve is the same as the
proportions of concordant pairs plus half the number of ties.
29
Generated data from hypothetical cell phone monitoring:
Conclusion: Tree model is clearly the best in terms of lift in both training and validation data,
likely because the cell signals are detecting streets which form rectangular regions. These are
exactly the kind of thing that trees give as results. The smooth function based models are unable
to rise quickly enough.
30
Conclusion: The Neural Network performs as well or better than the others in every measure,
however there is no validation data used and there is no penalty for the quite large number of
parameters the neural net has. These are not like information criteria (e.g. AIC, SBC) with built
in complexity penalties.
31
Compare several models for the breast cancer data: Compare with comparison node. First, use
the decision tree node to select variables. Only 2 seem to help.
32
We can save the score code to score (predict) across a grid of the two predictor variables.
Validation misclassificaion is least for the tree model though the other models beat it in the other
2 validation criteria. For cancer screening, decisions might be more important than ranking or
probability estimation. Misclassification would then trump ROC area or average square error as a
criterion. Fitted surfaces for our three models (score code over a grid in SAS/GRAPH) look like
this:
DATA SURFACE;
DO BARE = ____ TO ____;
DO SIZE = ____ TO _____;
Insert saves score code here
OUTPUT;
END;
END;
PROC G3D; (etc.)
33
Next topic: Association Analysis. This is just basic probability under new names. A rule is
something like “If a person buy item B they will also buy A,” or just B=>A.
Data Mining to Statistics Dictionary for rule B=>A:
Support
Pr{A and B}
Confidence
Pr{ A given B } = Pr(A|B}.
Expected Confidence
Pr{ A } (unconditional)
Lift
Pr{A|B }/Pr{ A }
Gain is 100(Lift-1) %
Example: Probability of buying a shirt given that you bought a tie 0.75. This seems like a high
probability but what if the overall (unconditional) probability of buying a shirt is 0.90? Using tie
purchase as a criterion for suggesting a customer buy a shirt is not a good strategy. You’d be
better off marketing shirts to a random sample of customer, or better yet to people who did not
buy a tie. This motivates the idea of lift. Here the lift would be 0.75/0.90 =0.83 (less than 1).
34
Example: Generated data representing contents of a large number of grocery carts:
Demo 6 (EM): Groceries (Association Analysis)
Association node output is shown above. The association node has a graphical tool called a “link
graph” whose line width and color are related to confidence ( Pr{A|B} ). There is a slider bar
that suppresses lines for confidence less than the value associated with the slider’s position. In
this example, 62% indicates that the slider is 62% of the way from the least confidence to the
most confidence. As it moves from right (all links shown) to left (no links shown) the less strong
relationships disappear. The node colors and sizes indicate the number of carts containing the
item combinations.
35
Unsupervised learning means that we have only features of each item but no target. For example
we might want to group people in terms of favorite hobbies, favorite music, favorite literature,
education level, etc. to find groups that might be socially compatible or perhaps a likely audience
for a play we are producing. From that group some may and some may not attend but we do not
now have data on attendance (that would allow supervised learning). Further, cluster
membership might be used as a (nominal) predictor variable for a second analysis that uses
supervised learning, for example looking at which group had the highest attendance level at our
play after it had its run. Perhaps we can target them again for a similar play we will produce
next time. As an example let’s see what would happen if we clustered the cell features in the
breast cancer data as though we did not have a benign versus malignant diagnosis. We can then
see (1) if the system picks just 2 clusters and (2) if these are related to the diagnosis.
Demo 7 (EM): Cancer Data Revisited (Clustering)
36
Cluster 1 (left panel top pie slice) is mostly benign cells (target = 0, right panel top pie slice).
Clustering indicates 2 somewhat distinguishable tumor types based on features.
Enterprise Miner has a segment profiler. There are 2 segments (clusters) diagnosed by the
cluster algorithm and thus 2 rows of plots. There are 10 features for each cluster – 10 panels per
row of which we see 9 below. The skeletal plots show the overall distribution of scores for each
feature. We see for cluster (row) 1 a large number of cells have low scores on most of the 9
varibles shown. For most features in fact, over 80% have low scores (left histogram bars) versus
about 62% of cells overall (skeletal bars). Less than 20% of the cluster 2 cells have such low
scores in the second cluster (second row). As with the first row, the exact size of the left
histogram bar varies by feature (by panel).
Recall our cluster variables are scores on 10 cancer cell properties.
37
Among other things, text mining attempts to cluster documents based on a spread sheet of word
counts in a collection of documents. SAS text miner has tools for preparing the spread sheet for
analysis. For example we might alias words like work and task. We might have to decide if
‘work’ is a verb (I will work on a project ), a noun (That’s quite piece of work) an attributive
noun (I need a work permit). Is a word negated (I will not, after my most recent disasterous
experience, work with him again. His idea will never work)? This natural language processing
will not be discussed here. We might want to truncate words (working worked works) and
ignore others (the and in a an). Here we have a small example to illustrate what happens to the
spread sheet once constructed.
38
We have W=13 words in D=14 documents here.
The bottom left graph (not for our current data) has one dot for each of D=14 documents. The
horizontal coordinate is the count for word 2 (perhaps ‘football’) and the vertical the count for
word 1 (perhaps ‘symphony’) for an example with W=2 words only. Notice that most of the
variation in these points is along an axis, Prin1, running from upper left to lower right,
indicating a negative correlation between these words from document to document. Documents
with high counts of ‘football’ tend to have low counts of ‘opera’ and vice versa. This axis is
called the first principal component in statistics and its length is proportional to the square root of
the largest eigenvalue of the 2x2 correlation matrix of the counts of these 2 words. The
eigenvalues of this matrix, a WxW matrix in general, indicate the proportion of variation
39
accounted for along the first few principal component axes. For our 13 words, the first
eigenvalue accounts for 7.1095/13 = 0.55 = 55% of the variation among documents of all 13
words. The first two axes (of 13 possible) account for 72.42% of the variation.
For the two word example, we can approximate the distance between plotted dots by looking at
the distance of their coordinates on the Prin1 axis. This would emphasize the gap between the
football and opera clusters and would likely produce 2 cluster based on only the 1 dimensional
(as opposed to W=13 dimensional) representation. To compute the D=14 points (documents)
projected onto this first, or longest, principal component axis we use weights that are given, for
example, in SAS PROC PRINCOMP. For any document we take the counts of each of the 13
words, multiply each by the weight shown here (and above) and then add them together to
“score” the document.
We are fortunate here because we can interpret the weights. First we see a big gap between
-0.08011 and 0.273525. Documents with large counts of the first 6 words in our list and small
counts of the last 7 would score low on the principal component axis. Low scores seem to be
associated with documents discussing sports. In contrast, documents with low counts of those
first 6 words and high counts of the other 6 seem to be about politics. Plotting the weights heelps
40
illustrate this point.
We can apply these ideas to our 14 docments by (1) computing the score for each document then
(2) plotting these on a line to see if they cluster. Alternatively we could use these scores as
values of a response variable and try to relate them to predictor variables like gender, age,
education level, etc.
Again we get lucky in that the coordinates of the documents fall into 2 fairly distinct clusters.
The biggest gap in the document scores lies between document 8 which plots at a coordinate
-1.79370 on the number line and document 1 which plots at -0.00738. We have already decided
that documents with lower scores are about sports and those with higher coordinates are about
politics. The gap gives us a convenient division point between high and low scores.
41
SAS Clustering procedures (PROC CLUSTER, PROC TREE) can be used to produce clusters
and make a dendogram (like a tree diagram).
Not surprisingly the SAS clustering algorithm reproduces the visually obvious 2 clusters we saw
in the graph. Notice that the clustering algorithm will work with 2 or more principal component
dimensions. In more than 3 dimensions, visual assessment is not practical.
42
Optional Topic: Nearest Neighbor Method:
This is a simple, tried but true method. The training data has classes and multiple input. As an
example here we have 2 classes, indicated by colors and 2 predictors.
A probe point is set into the feature space (here the two dimensional space of features or
predictor variables) as is seen in the blowup of the small region shown on the right. The smallest
circle containing k points and entered at the probe point is inscribed. The default in Enterprise
Miner is k=16. The sample probability, i/k = i/16 where i is the number of blue points, is used for
probability of blue in this example. This is used for decisions. Here is the plot of predicted
classes for the traiing data:
The plot on the right is obtained from the “Exported Data” using the explore and graph options.
43
Enterprise Miner also allows the importing of a “score” data set having possily only the features
or predictor variables then connecting the MBR node and the new score data set to a score node
where the model is applied to the score data which in this example is a grid.
We see increasing clarity as we move to the right in this set of plots.
44
Optional Topic: Fisher’s Linear Discriminant Function:
For a given X (financial stress index) from which population did it most likely
come? Example X=22 most likely came from the middle (pays only part of credit
bill) population because the green curve is highest at X=22. Note: Equal variances
assumed here.
45
Unequal variances change the situation, adding a second green interval:
So far, the area under each curve has been 1, but if our overall population has 40%, 40%, and
20% in the three subpopulations respectively, the area under the right hand (defaulters) curve
should be half that of the others to reflect the unequal “priors.” (Prior: without seeing X we give
probabilities .4, .4. .2 to the three subpopulations but these change to “posterior” probabilities
when the value of X is observed). Note the two green regions.
We need a mechanism to find the posterior probabilities (when X has been observed). The
computation depends on the “Fisher Linear Discriminant” for each subpopulation when the
variance is constant and the ”Fisher Quadratic Discriminant” when variances differ. Here the
Fisher Linear Discriminant Function is aX+b where b =, a= and  is the mean of
46
the subpopulation to which the observation with predictor variable X is being compared.
Exponentiating these functions, one per subpopulation, gives three numbers proportional to the
three probabilities so, dividing each by their sum gives the probabilities.
The probability densities f(X) for the three populations have the above square bracketed term [ ]
in common so the exponentials outside the [ ] are proportional to the probability densities so we
divide each (line 3 below for X=21) by their sum to get the probabilities (line 4).
Priors enter the computations when subpopulations are not equally likely.
47
Line 2 in the table above uses X=21 in the three discriminant functions. The functions are linear
functions of X, one for each of the three subpopulations. A similar linear form with more terms
arises in bivariate and general multivariate situations as long as the variance matrix is the same in
each subpopulation.
Boundaries between classification regions are just cutoff points in the univariate credit card
example. Bivariate cases are more interesting.
We can plot three bivariate normal populations for a bivariate example, It appears that with a
constant common variance matrix, the boundaries are lines. Only the highest of the three
subpopulations is plotted for clarity.
48
The boundaries can be shown to be linear and here are the regions defined by them. The data are
made up but we’ll think of the two predictors as scaled debt and income scores of some sort and
the target classes as defaulters, those who pay some but not all (e.g. the minimum), and those
who pay the debt off each month.
The plots above use theoretical normal distributions. We can also generate data from the three
distributions. Here we have 1000 observations from each of the three groups.
and for these data we can run a discriminant analysis with PROC DISCRIM then score a gridded
data set B to produce a scored dataset C with the results.
PROC DISCRIM data=a testdata=b testout=c;
CLASS group; VAR X1 X2;
run;
49
50
In summary data mining is a collection of tools and in some programs like Enterprise Miner, a
graphical user interface is used to access them. The tools must work fast on large data sets.
Because the data sets are large, a true model comparison can be computed by reserving a
validation data set on which to compare results.
Download