Exploring Multivariate Data • Supervised Classification: forests, neural networks • Clustering: k-means, model-based, selforganizing maps Supervised Classification • Build a model to predict the class (group) for future data • Examples: spam filters, face recognition, ... Preparation • Many data mining methods do not have inbuilt cross-validation techniques • Error on new data may be higher than error on model trained data • Hold out a sample to use for assessing model Plot the Data palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic palmitic palmitoleic Scatterplot matrix stearic oleic linoleic • No obvious clustering • Some variables strongly linolenic associated, eg palmitic, palmitoleic arachidic eicosenoic Plot the Data 1.0 0.8 area South Apulia value 0.6 Sicily 0.4 North Apulia Calabria 0.2 0.0 palmitic palmitoleic linoleic stearic variable oleic linolenic arachidic eicosenoic Parallel coordinate plot • Areas S. Apulia, N. Apulia, Calabria are different from each other • Area Sicily overlaps with all Trees • Sequentially split the linoleic< 950.5 | data to get subsets that have “pure” classes • Training error 14/217=0.065 • Test error 18/106=0.170 palmitoleic>=95.5 stearic>=258 Sicily linolenic>=37 Calabria oleic>=7724 South Apulia North Apulia Sicily South Apulia Random Forests A random forest is a classifier that is built from multiple trees, generated by random sampling cases, and variables. Forests are computationally intensive but retain some of the interpretability of trees. There are several parameters that control the algorithm, and there are numerous diagnostics output by random forests. http://www.math.usu.edu/~adele/forests/index.htm is a good site for more information Random Forests • Inputs: number of variables, trees • Diagnostics returned: Error rate, importance, ... Random Forests • Training error: 11/217=0.051 • Test error: 13/106=0.123 • Importance of variables: linoleic, al Networks palmitoleic, oleic, ... 250 on the neuron systems in organisms, where dendrites pass information area cal threshold. As the level of a chemical builds up in the neuron it South Apulia fires the chemical signal to the next neuron. Sicily 8 off Feed-forward Neural Networks palmitoleic 200 150 North Apulia Calabria Neural networks are loosely based on the neuron100systems in organisms, where dendrites pass information along a network based on a chemical threshold. As the level of a chemical builds up in the neuron it approaches a threshold at which it fires off the chemical signal to the next neuron. 50 600 800 1000 linoleic 1200 1400 Neural Networks Feed-forward neural networks (FFNN) were developed from this concept, that combining small comFFNN) were developed from this concept, that (FFNN) combining comA feed-forward neural networks fitssmall a ponents is a way to build a model from predictors to response. They actually generalize linear regression from predictors to response. They actually generalize linear regression of theasform: functions. A simplemodel network model produced by nnet code in S-Plus may be represented by the equation: as produced by nnet code in S-Plus mays be represented by the equation: p f (x) = φ(α + s ! ŷ = f (x) = φ(α + wh φ(αh + p ! ! h=1 wih xi )) wh φ(αh + ! wih xi )) i=1 where x is the vector of explanatory variable values, y is the target value, p is the number of variables, s is i=1layer and φ is a fixed function, usually a linear or logistic function. h=1 the number of nodes single hidden xin= the explanatory variables, p of them This model has a single hidden layer, and univariate output values. y = categorical response variable values,s =ynumber is the oftarget value, p ishidden the number of variables, s is nodes in the single layer dden layer and φ is a fixed function, linearfunction. or logistic function. = fixed function, usually ausually linear oralogistic r, and univariate output values. Feed-forward neural networks (FFNN) were developed from this concept, that combining small comonents is a way to build a model from predictors to response. They actually generalize linear regression nctions. A simple network model as produced by nnet code in S-Plus may be represented by the equation: ŷ = f (x) = φ(α + s ! h=1 wh φ(αh + p ! wih xi )) i=1 here x is the vector of explanatory variable values, y is the target value, p is the number of variables, s is e number of nodes in the single hidden layer and φ is a fixed function, usually a linear or logistic function. his model has a single hidden layer, and univariate output values. he response variable can be multivariate. A simple linear regression model, y = w0 + presented as a feed-forward neural network: Neural Networks • Choose number of32 nodes in hidden layer, and amount of smoothing of boundary, ... • Fit by minimizing a loss (or error) function • Difficult to fit, and possible to overfit • Depends on initial random start "p j=1 wi xi , can be Neural Networks • After several starts best minimum value is 9.77 (Save this model!!!) • Training error: 2/217=0.009 • Test error: 12/106=0.113 Your turn For the music data, using type as the response, and variables lvar, lave, lmax, lfener, lfreq as explanatory variables: • Fit a random forest • Fit a neural network Report the training errors, and, for the forest, which variables are the most important. Also predict the 5 unlabeled tracks as either Classical or Rock. Cluster analysis • Group cases together according to a measure of similarity • Examples: market segmentation, gene function, ... ters s u l c the e r ts? a e s t a a t h a W se d e h t in Cluster analysis t= x̄1 − x̄2 √ s/ n • To define similar, need to have a distance measure • Ho : µ1 = µ2 vs Ha : µ1 #= µ2 ˆ tip = 0.92 + 0.104 × bill Euclidean distance between A, B: ! (A1 − B1 )2 + (A2 − B2 )2 + . . . + (Ap − Bp )2 Model-based clustering • Fit a mixture of Gaussians EII VII EEE EEV VVV Plot the Data glucose insulin sspg glucose • No obvious clusters insulin • Concentration sspg of points, and two strings of points 2 4 EII VVI VII EEE EEI EEV VEI VEV EVI VVV 6 Highest BIC value corresponds to a model with 3 clusters and unconstrained (VVV) variance model. 8 number of components Model-based Clustering 500 1000 300 0 1000 1500 100 200 glucose 500 insulin 0 sspg 200 400 600 0 BIC -5800 -5600 -5400 -5200 -5000 -4800 Model-based Clustering 100 200 300 0 200 600 Three clusters correspond to one with low values on all variables (green), high on glucose and insulin (blue), high on sspg (orange). Self-organizing maps • Fit k means to data • Constrain the means to lie on a 2D grid • Fitting can be tricky! Check the sum of squared differences between points and means after fitting. Self-organizing Maps Suggests 3 connected clusters: one with low values on all variables, one with high values on glucose and insulin, one with high values on sspg. glucose insulin sspg Your turn For the music data, use model-based clustering and self-organizing maps to group the music clips.