1 Finding the in your Data SASTM Global Forum Workshop David A. Dickey March, 2014 TM SAS and its products are registered trademarks of SAS Institute, Cary, NC 2 The tree method (recursive splitting) divides data into homogeneous rectangular regions. How? When do we stop splitting? What happens with missing values of X variables? 3 Note: there are several other variables in the data set that are not chosen as splitting variables anywhere in the tree. Throughout we will use examples, some in SAS software and some using Enterprise Miner. One tree will also use smoking (yes, no) in a split. Examples next: 4 Apply tree techniques using defaults. Smoking comes into play for young people with high cholesterol. Blood pressure matters for older people. Now try some options (Gini splits, Average Square Error Assessment, fixed (N=4) number of leaves). Youth is protective; smoking does not come into play. Note left leaf, bottom row. Incidence rate less than overall rate but based on only 51 people! Missing inputs are handled by placing them alternatively on either side of a 2 way split then picking the better allocation based on prediction of the target. 5 Left table: Observed Right table: Expected numbers if independent. Degrees of freedom number is (r-1)(c-1) for rxc table. Compare as follows for test: 6 Intuitive explanations of “p-values”: Hypothesis testing is like a court trial. H0: innocent versus H1: guilty. Do not reject innocence unless evidence against it is beyond reasonable (p-value < 0.05?) doubt. Example of Chi-square: Did status (first, second, third class, crew) affect Titanic survival? H0: no vs. H1: yes. Demo 1: Titanic.sas – adds “Bonferoni Correction” to logworth and considers gender splits. Try Titanic analysis in Enterprise Miner. Consider status as unordered and consider “gender” (man, woman, child). Need to set up project and diagram within project. 7 8 Diagram specifies two trees: Default (top) and 4 leaf tree using Gini splits (bottom). These give the trees shown earlier. 9 Candidate split 1 is gender (2 choices) and candidate 2 is age (87 possibilities). Is that fair? If I get 87 tries to win a prize and you get only 2, is that OK with you? Some adjustment is needed! Bonferroni adjustment multiplies p-value by number of tests (number of potential split points) 10 The way PRUNING is done depends on which of 3 major goals you have in mind: (1) Ranking (e.g. who are top n candidates? ) (2) Decisions (e.g. is it malignant or benign?) (3) Prediction (e.g. what is the probability of heart disease given your risk factors?) Example: A split might be helpful in going from probability of defaulting 0.20 in a parent node to probabilities 0.07 and 0.32 in the child nodes but if the goal is decisions, then both children have less than 50% chance of defaulting (predict no default) the same as the parent. The split is not helpful for decisions if we decide based only on which of the 2 outcomes is more likely. 11 For pruning, may look at misclassification rate, average squared error or some other measure for “diversity”. Area under ROC curve (discussed later) is the same as C statistic below. 12 The cost (consequence) of accidentally saying a tumor is malignant when it is really benign may be quite different from that of saying a tumor is benign when it is really malignant. The cost should be combined with the probability when making a decision where costs associated with decisions differ by decision. Recall that items within a leaf cannot be distinguished further – we’ve used all useful information to distinguish this leaf from others - so we must make the same decision for all items in a leaf. So far we have been dealing with a categorical (binary in fact) response. What if the response is continuous? The Chi-square contingency table would no longer be available for decisions about splitting. 13 Perhaps the two X variables are age and debt to income ratio and the response is a credit card balance in hundreds of dollars. Consider first split (vertical line in the middle of the plot above). This divides the region into two groups, each with a mean. Thinking of these (high X1, low X1) as 2 treatment groups we can do an analysis of variance F test and get its p-value for each possible split on each variable. Now that we have a p-value, we can proceed just as in the categorical response case. For example: X1 (age) = young X1 (age) = old Balance N = n1 + n2 obs. Y1, Y2, Y3, …. Yn1 SSE(1) Y n1+1, Y n1+2, Y n1+3, …. YN SSE(2) [SS(total) –SSE(1)-SSE(2) ] / 1df = F numerator (model mean square) MSE = [SSE(1) + SSE(2)] / (N-2)df = F denominator (error mean square) p-value = Pr>F. (# possible splits)(p-value) = Bonferroni adjusted p-value -Log10 [(# possible splits)(p-value) ] = logworth of split Keep on splitting as usual. 14 Here the prediction Pi is just the mean Yi of the data in leaf i. The surface is a step function. A real data example: cost to society (in units of cost for one fatality) for traffic accidents in Portugal (personal communication from Dr. G. Torrao). Note: vehicle 1 is the one deemed to have caused the accident. 15 Variable importance here is the reduction in variance obtained in splits using the given variable. If a variable is never chosen as a split variable then its importance is 0. Variables that were runners up for splits can also enter the calculation in a down-weighted fashion. Only two variables have nonzero importance in the (hypothetical) chemical data here, allowing 3-D graphics. The response is potency (continuous). From the EM output we can infer the order and nature of the splits. Note that the seemingly single blue horizontal surface is actually several slightly different rectangular surfaces at very similar heights off the “floor” of the plot. Also note that C and D have 0 importance. 16 From this output we infer this split pattern: 17 We need a function f(X) that stays between 0 and 1 no matter what X is. The logistic function f(X) = exp(a+bX)/(1+exp(a+bX)) = 1/(exp(-(a+bX))+1) works. Depending on a and b we get something that looks like a steep (large |b|) or mild (small |b|) ski slope: Q: How do we find a and b for a given set of data? A: Find a and b to maximize the likelihood of the observed data. That is, find values of a and b maximize the joint probability density for the observed configuraion of points. 18 -2.6 + 0.23 X DATA Ignition; INPUT time Y @@; cards; 3 0 5 0 7 1 9 0 10 0 11 1 12 1 0 17 1 25 1 30 1 4 . 6 . 8 . 18 . 19 . 20 . 22 . . ; proc print; PROC LOGISTIC; Model Y(event='1') = time; output out=out1 predicted=Plogistic; run; proc sort data=out1; by time; 13 0 14 1 15 1 16 24 . 26 . 28 . 30 19 Proc sgplot; reg Y=Y X=time; series Y=Plogistic X=time;run; Example: Challenger mission . 20 There were 24 missions prior to Challenger. Each mission carries 6 O-rings used to keep the fuel elements from combining before they should. Upon return, O rings are inspected and we write 0 if an O ring is intact, 1 if it showed signs of erosion or blowby. We predict this binary variable from air temperature at launch. Optional SAS Demo: Shuttle.sas. Probability p(X) of 1 specific ring failing Pr{Two or more fail (any 2)} Math: 1-Pr{0}-Pr{1} = 6 6 1 [ p( x)]0 [1 p( x)]6 [ p( x)]1[1 p( x)]5 0 1 1 [1 p( x)]6 6 p( x)][1 p( x)]5 21 A neural network is a composition of functions f(g(x)) where g often is a hyperbolic tangent function H which in turn is just a slightly reparameterized logistic function with range stretching from -1 to 1 rather than from 0 to 1. 22 H1 = tanH(-10 -.4*X1+0.8*X2) Advantage: Can fit very wiggly surface. Disadvantage: Fitted function can be too wiggly (fits training data but does not hold up in future data) e2x e x e x Relationship of hyperbolic tangent function H ( x) x to logistic function L(2 x) 2 x . e ex e 1 e x e x e x e 2 x 1 2e 2 x (e 2 x 1) e 2 x 2 x 2 2 x 1 2 L(2 x) 1 H ( x) x x x e2 x 1 e e e e 1 e 1 The parameters (“biases” and “weights”) are estimated iteratively in steps. Approaches (2) and (3) give model convergence for the training data. In appraoch (3), the validation fit deteriorates from iteration 1 on, implying no fitting is helpful. Possibly this model is too complex for the task at hand. We reject this approach and proceed with the complex model from step 2, comparing it later to simpler models. 23 How should models be compared? Statisticians are used to using average squared error as a measure of how well a model fits. Suppose you are more interested in making decisions as to whether an event will occur or not. The slide below uses a measure called lift. The small labels on the vertical axis go from about 0.09 to 0.30 where 0.30 is about 3.3 times 0.09 and 0.09 is the overall event rate, for example it might be the overall proportion of customers buying some product. The curve is constructed by using a model to predict the 5% most likely to purchase (horizontal coordinate) and the actual observed response rate in that that group of customers, in our case 30% (0.30 vertical coordinate). This is repeated for the most likely 10%, 15%, 20%, etc. resulting in 20 points which are then connected together in a “lift chart”. It is seen that the lift is a fraction. The actual response rate in the p% most likely to respond based on the model is the numerator and the overall response rate the denominator. On the right side (p=100%) of course the ratio of these is 1. This graph is just for illustration. It did not arise from any of our example data sets. 24 Another graph that is of interest for assessment of models is the Receiver Operating Characteristic Curve (ROC curve) which, for a binary decision starts out using a model prediction variable along the horizontal axis with separate histograms for the events (1s in the data) and non-events (0s in the data). For example, suppose logit scores are along the horizontal axis and the histograms for the 1s (to the left) and 0s (to the right) are bell shaped. We then move a proposed “critical value” along the horizontal axis, looking at what happens if we declare that logits to its left represent 1s and those to the right represent 0s. We would capture some percent of the 1’s (our Y coordinate) and mistakenly declare some 0s to be 1s (our X coordinate). These coordinates trace out the ROC curve. ROC point by point: 25 26 We want to move from the bottom left corner up as fast as possible (higher sensitivity) while moving to the left as slowly as possible (less missclassification of 0’s so higher specificity) Sensitivity: Pr{ calling a 1 a 1 }=Y coordinate. Specificity; Pr{ calling a 0 a 0 } = 1-X. Earlier it was claimed that the area under this Receiver Operating Characteristic curve (ROC curve) is the proportion of concordant pairs plus half the number of ties. This is not particularly intuitive so we now look at why this is so using a tree example. 27 The leaves from a three leaf tree are listed in this slide from the one that has the highest proportion of 1’s (leaf 1) to that with the lowest proportion (leaf 3). _________________ |_100_0s___100_1s_| ______ /______ ______\______ | 20_0s__60_ 1s | | 80 0s 40 1s | Leaf 1 ____ /_____ __\_________ | 20 0s 20 1s _| | 60 0s 20 1s | Leaf 2 Leaf 3 Note that: (1) We cannot treat the elements of any leaf differently (we have used all discrimatory information) – we must delcare each leaf to be all 1’s or all 0’s. (2) If we declare every leaf to be 1 we have captured all the 1’s (Y=1) and have misclassified 100% of the 0’s (X=1) , If we declare none ot the leaves to be 1’s then we have captured none of the 1’s (Y=0) and have misclassified none of the 0’s (X=0) so (0,0) and (1,1) are on the ROC. There are 100 1’s and 100 0’s and we can use counts then convert to proportions. The number of 1’s in a leaf (height) times the number of 0’s in that or another leaf (width) can be multiplied together (area) to compute the number of tied or concordant pairs and interpret the number as an area. 28 In the previous slide we accounted for the area in the bottom row of rectangles. That was for a cut point at leaf 1. Moving the cut point to leaf 2 (leaf 1 and 2 declared to be 1’s, leaf 3 declared to be 0’s) we pick up some more ties (20x20=400 from leaf 2) and some more concordant pairs (any combination of the 20 1’s in leaf 2 and the 60 0’s in leaf 3 so 1200 more). This gives a middle row of rectangles in our diagram. Finally, when the cut point is beyond leaf 3, everything is declared to be 1 so this adds only ties and a single rectangle to the top of our diagram. The lines shown split the rectangles associated with ties in half. These lines form the ROC curve by definition if we change from counts to proportions. With 100 1’s and 100 0’s, the proportions are just the counts divided by 100. These adjusted points and connecting lines are extracted from the graph in the slide above and form the ROC curve by definition. These observations about a tree show why the area under the ROC curve is the same as the proportions of concordant pairs plus half the number of ties. 29 Generated data from hypothetical cell phone monitoring: Conclusion: Tree model is clearly the best in terms of lift in both training and validation data, likely because the cell signals are detecting streets which form rectangular regions. These are exactly the kind of thing that trees give as results. The smooth function based models are unable to rise quickly enough. 30 Conclusion: The Neural Network performs as well or better than the others in every measure, however there is no validation data used and there is no penalty for the quite large number of parameters the neural net has. These are not like information criteria (e.g. AIC, SBC) with built in complexity penalties. 31 Compare several models for the breast cancer data: Compare with comparison node. First, use the decision tree node to select variables. Only 2 seem to help. 32 We can save the score code to score (predict) across a grid of the two predictor variables. Validation misclassificaion is least for the tree model though the other models beat it in the other 2 validation criteria. For cancer screening, decisions might be more important than ranking or probability estimation. Misclassification would then trump ROC area or average square error as a criterion. Fitted surfaces for our three models (score code over a grid in SAS/GRAPH) look like this: DATA SURFACE; DO BARE = ____ TO ____; DO SIZE = ____ TO _____; Insert saves score code here OUTPUT; END; END; PROC G3D; (etc.) 33 Next topic: Association Analysis. This is just basic probability under new names. A rule is something like “If a person buy item B they will also buy A,” or just B=>A. Data Mining to Statistics Dictionary for rule B=>A: Support Pr{A and B} Confidence Pr{ A given B } = Pr(A|B}. Expected Confidence Pr{ A } (unconditional) Lift Pr{A|B }/Pr{ A } Gain is 100(Lift-1) % Example: Probability of buying a shirt given that you bought a tie 0.75. This seems like a high probability but what if the overall (unconditional) probability of buying a shirt is 0.90? Using tie purchase as a criterion for suggesting a customer buy a shirt is not a good strategy. You’d be better off marketing shirts to a random sample of customer, or better yet to people who did not buy a tie. This motivates the idea of lift. Here the lift would be 0.75/0.90 =0.83 (less than 1). 34 Example: Generated data representing contents of a large number of grocery carts: Demo 6 (EM): Groceries (Association Analysis) Association node output is shown above. The association node has a graphical tool called a “link graph” whose line width and color are related to confidence ( Pr{A|B} ). There is a slider bar that suppresses lines for confidence less than the value associated with the slider’s position. In this example, 62% indicates that the slider is 62% of the way from the least confidence to the most confidence. As it moves from right (all links shown) to left (no links shown) the less strong relationships disappear. The node colors and sizes indicate the number of carts containing the item combinations. 35 Unsupervised learning means that we have only features of each item but no target. For example we might want to group people in terms of favorite hobbies, favorite music, favorite literature, education level, etc. to find groups that might be socially compatible or perhaps a likely audience for a play we are producing. From that group some may and some may not attend but we do not now have data on attendance (that would allow supervised learning). Further, cluster membership might be used as a (nominal) predictor variable for a second analysis that uses supervised learning, for example looking at which group had the highest attendance level at our play after it had its run. Perhaps we can target them again for a similar play we will produce next time. As an example let’s see what would happen if we clustered the cell features in the breast cancer data as though we did not have a benign versus malignant diagnosis. We can then see (1) if the system picks just 2 clusters and (2) if these are related to the diagnosis. Demo 7 (EM): Cancer Data Revisited (Clustering) 36 Cluster 1 (left panel top pie slice) is mostly benign cells (target = 0, right panel top pie slice). Clustering indicates 2 somewhat distinguishable tumor types based on features. Enterprise Miner has a segment profiler. There are 2 segments (clusters) diagnosed by the cluster algorithm and thus 2 rows of plots. There are 10 features for each cluster – 10 panels per row of which we see 9 below. The skeletal plots show the overall distribution of scores for each feature. We see for cluster (row) 1 a large number of cells have low scores on most of the 9 varibles shown. For most features in fact, over 80% have low scores (left histogram bars) versus about 62% of cells overall (skeletal bars). Less than 20% of the cluster 2 cells have such low scores in the second cluster (second row). As with the first row, the exact size of the left histogram bar varies by feature (by panel). Recall our cluster variables are scores on 10 cancer cell properties. 37 Among other things, text mining attempts to cluster documents based on a spread sheet of word counts in a collection of documents. SAS text miner has tools for preparing the spread sheet for analysis. For example we might alias words like work and task. We might have to decide if ‘work’ is a verb (I will work on a project ), a noun (That’s quite piece of work) an attributive noun (I need a work permit). Is a word negated (I will not, after my most recent disasterous experience, work with him again. His idea will never work)? This natural language processing will not be discussed here. We might want to truncate words (working worked works) and ignore others (the and in a an). Here we have a small example to illustrate what happens to the spread sheet once constructed. 38 We have W=13 words in D=14 documents here. The bottom left graph (not for our current data) has one dot for each of D=14 documents. The horizontal coordinate is the count for word 2 (perhaps ‘football’) and the vertical the count for word 1 (perhaps ‘symphony’) for an example with W=2 words only. Notice that most of the variation in these points is along an axis, Prin1, running from upper left to lower right, indicating a negative correlation between these words from document to document. Documents with high counts of ‘football’ tend to have low counts of ‘opera’ and vice versa. This axis is called the first principal component in statistics and its length is proportional to the square root of the largest eigenvalue of the 2x2 correlation matrix of the counts of these 2 words. The eigenvalues of this matrix, a WxW matrix in general, indicate the proportion of variation 39 accounted for along the first few principal component axes. For our 13 words, the first eigenvalue accounts for 7.1095/13 = 0.55 = 55% of the variation among documents of all 13 words. The first two axes (of 13 possible) account for 72.42% of the variation. For the two word example, we can approximate the distance between plotted dots by looking at the distance of their coordinates on the Prin1 axis. This would emphasize the gap between the football and opera clusters and would likely produce 2 cluster based on only the 1 dimensional (as opposed to W=13 dimensional) representation. To compute the D=14 points (documents) projected onto this first, or longest, principal component axis we use weights that are given, for example, in SAS PROC PRINCOMP. For any document we take the counts of each of the 13 words, multiply each by the weight shown here (and above) and then add them together to “score” the document. We are fortunate here because we can interpret the weights. First we see a big gap between -0.08011 and 0.273525. Documents with large counts of the first 6 words in our list and small counts of the last 7 would score low on the principal component axis. Low scores seem to be associated with documents discussing sports. In contrast, documents with low counts of those first 6 words and high counts of the other 6 seem to be about politics. Plotting the weights heelps 40 illustrate this point. We can apply these ideas to our 14 docments by (1) computing the score for each document then (2) plotting these on a line to see if they cluster. Alternatively we could use these scores as values of a response variable and try to relate them to predictor variables like gender, age, education level, etc. Again we get lucky in that the coordinates of the documents fall into 2 fairly distinct clusters. The biggest gap in the document scores lies between document 8 which plots at a coordinate -1.79370 on the number line and document 1 which plots at -0.00738. We have already decided that documents with lower scores are about sports and those with higher coordinates are about politics. The gap gives us a convenient division point between high and low scores. 41 SAS Clustering procedures (PROC CLUSTER, PROC TREE) can be used to produce clusters and make a dendogram (like a tree diagram). Not surprisingly the SAS clustering algorithm reproduces the visually obvious 2 clusters we saw in the graph. Notice that the clustering algorithm will work with 2 or more principal component dimensions. In more than 3 dimensions, visual assessment is not practical. 42 Optional Topic: Nearest Neighbor Method: This is a simple, tried but true method. The training data has classes and multiple input. As an example here we have 2 classes, indicated by colors and 2 predictors. A probe point is set into the feature space (here the two dimensional space of features or predictor variables) as is seen in the blowup of the small region shown on the right. The smallest circle containing k points and entered at the probe point is inscribed. The default in Enterprise Miner is k=16. The sample probability, i/k = i/16 where i is the number of blue points, is used for probability of blue in this example. This is used for decisions. Here is the plot of predicted classes for the traiing data: The plot on the right is obtained from the “Exported Data” using the explore and graph options. 43 Enterprise Miner also allows the importing of a “score” data set having possily only the features or predictor variables then connecting the MBR node and the new score data set to a score node where the model is applied to the score data which in this example is a grid. We see increasing clarity as we move to the right in this set of plots. 44 Optional Topic: Fisher’s Linear Discriminant Function: For a given X (financial stress index) from which population did it most likely come? Example X=22 most likely came from the middle (pays only part of credit bill) population because the green curve is highest at X=22. Note: Equal variances assumed here. 45 Unequal variances change the situation, adding a second green interval: So far, the area under each curve has been 1, but if our overall population has 40%, 40%, and 20% in the three subpopulations respectively, the area under the right hand (defaulters) curve should be half that of the others to reflect the unequal “priors.” (Prior: without seeing X we give probabilities .4, .4. .2 to the three subpopulations but these change to “posterior” probabilities when the value of X is observed). Note the two green regions. We need a mechanism to find the posterior probabilities (when X has been observed). The computation depends on the “Fisher Linear Discriminant” for each subpopulation when the variance is constant and the ”Fisher Quadratic Discriminant” when variances differ. Here the Fisher Linear Discriminant Function is aX+b where b =, a= and is the mean of 46 the subpopulation to which the observation with predictor variable X is being compared. Exponentiating these functions, one per subpopulation, gives three numbers proportional to the three probabilities so, dividing each by their sum gives the probabilities. The probability densities f(X) for the three populations have the above square bracketed term [ ] in common so the exponentials outside the [ ] are proportional to the probability densities so we divide each (line 3 below for X=21) by their sum to get the probabilities (line 4). Priors enter the computations when subpopulations are not equally likely. 47 Line 2 in the table above uses X=21 in the three discriminant functions. The functions are linear functions of X, one for each of the three subpopulations. A similar linear form with more terms arises in bivariate and general multivariate situations as long as the variance matrix is the same in each subpopulation. Boundaries between classification regions are just cutoff points in the univariate credit card example. Bivariate cases are more interesting. We can plot three bivariate normal populations for a bivariate example, It appears that with a constant common variance matrix, the boundaries are lines. Only the highest of the three subpopulations is plotted for clarity. 48 The boundaries can be shown to be linear and here are the regions defined by them. The data are made up but we’ll think of the two predictors as scaled debt and income scores of some sort and the target classes as defaulters, those who pay some but not all (e.g. the minimum), and those who pay the debt off each month. The plots above use theoretical normal distributions. We can also generate data from the three distributions. Here we have 1000 observations from each of the three groups. and for these data we can run a discriminant analysis with PROC DISCRIM then score a gridded data set B to produce a scored dataset C with the results. PROC DISCRIM data=a testdata=b testout=c; CLASS group; VAR X1 X2; run; 49 50 In summary data mining is a collection of tools and in some programs like Enterprise Miner, a graphical user interface is used to access them. The tools must work fast on large data sets. Because the data sets are large, a true model comparison can be computed by reserving a validation data set on which to compare results.