Sugata Sen Roy Department of Statistics, University of Calcutta. What is multivariate analysis ? It is the simultaneous study of m related variables X = (x1 x2 … xm ) gives a complete profile of the unit under study allows the inter-relationships between the variables to come into play Examples (Gender, age, nationality) of an individual (min temp., max temp., rainfall, humidity) on a day (GDP, life expectancy, literacy rate) of a country ( blood pressure, blood sugar, hemoglobin, cholesterol, creatinine, albumin, ….) of an individual Argentina Australia Austria Belgium Bermuda Brazil Burma Canada Chile China Colombia Cook Is. Costa Rica Czeck Denmark Dom.Rep. Finland France GDR FRG GB Greece 100m 10.39 10.31 10.44 10.34 10.28 10.22 10.64 10.17 10.34 10.51 10.43 12.18 10.94 10.35 10.56 10.14 10.43 10.11 10.12 10.16 10.11 10.22 200m 20.81 20.06 20.81 20.68 20.58 20.43 21.52 20.22 20.8 21.04 21.05 23.2 21.9 20.65 20.52 20.65 20.69 20.38 20.33 20.37 20.21 20.71 400m 46.84 44.84 46.82 45.04 45.91 45.21 48.3 45.68 46.2 47.3 46.1 52.94 48.66 45.64 45.89 46.8 45.49 45.28 44.87 44.5 44.93 46.56 800m 1.81 1.74 1.79 1.73 1.8 1.73 1.8 1.76 1.79 1.81 1.82 2.02 1.87 1.76 1.78 1.82 1.74 1.73 1.73 1.73 1.7 1.78 1500m 3.7 3.57 3.6 3.6 3.75 3.66 3.85 3.63 3.71 3.73 3.74 4.24 3.84 3.58 3.61 3.82 3.61 3.57 3.56 3.53 3.51 3.64 5000m 14.04 13.28 13.26 13.22 14.68 13.62 14.45 13.55 13.61 13.9 13.49 16.7 14.03 13.42 13.5 14.91 13.27 13.34 13.17 13.21 13.01 14.59 10000m 29.36 27.66 27.72 27.45 30.55 28.62 30.28 28.09 29.3 29.13 28.88 35.38 28.81 28.19 28.11 31.45 27.52 27.97 27.42 27.61 27.51 28.45 marathon 137.72 128.3 135.9 129.95 146.62 133.13 139.95 130.15 134.03 133.53 131.35 164.7 136.58 134.32 130.78 154.12 130.87 132.3 129.92 132.23 129.13 134.6 Guatemala Hungary India 10.98 10.26 10.6 21.82 20.62 21.42 48.4 46.02 45.73 1.89 1.77 1.76 3.8 3.62 3.73 14.16 13.49 13.77 30.11 28.44 28.81 139.33 132.58 131.98 Multivariate analysis techniques can be extensions of the univariate techniques to the simultaneous study of several variables can be techniques which are peculiar to the case of several variables only Three broad aspects • Grouping of Individuals • Dimension Reduction • Cause-Effect Relationships Grouping of Individuals Q : Can the individuals be grouped according to similarity ? • Cluster Analysis • Discriminant Analysis and Classification Dimension Reduction Q : Is it necessary to look at all the variables, or is it possible for a smaller set of variables to capture the same information ? • Principal Component Analysis • Factor Analysis • Canonical Correlations • Multidimensional Scaling Cause-Effect Relationships Q : Do a set of variables have an effect on another set of variables ? If so, how are they related ? ➢ Multivariate Linear Models ➢ Multivariate Regression ➢ MANOVA ➢ MANCOVA Cluster Analysis given a group of individuals with a set of characteristics how can we form groups according to their similarity ➢ easy if it is a single or may be two characteristics ➢ difficult for large individual profiles How to arrange these hats on the shelves? And how to group these children ? What is clustering ? An automated statistical method which segregates multi- dimensional data according to similarity and dissimilarity. Leads to ➢ either a tree with closer individuals in nearer branches ➢ or homogeneous groups separated as far apart as possible Types of Clustering Hierarchical Clustering • here the individuals are clustered in a hierarchically arranged tree with similar members closer to each other than dissimilar members Agglomerative Divisive Example US Health Data – 1986 States : ME NH VT MA RI CT NY NJ PA Variables : no. of deaths due to (1) accidents (2) cardiovascular diseases (3) cancer (4) pulmonary diseases (5) pneumonia (6) diabetes (7) liver diseases NJ MA ME PA NY 10 CT VT RI 20 NH 30 40 Height 50 60 70 Hierarchical Clustering Partitioning the individuals are grouped such that ➢ the members within a cluster are as similar among themselves as possible ➢ the members of a cluster are as different from members of other clusters as possible Overlapping Clusters Partitioning NH, VT, CT Low Cardiovascular, Cancer, Diabetes, Liver High Pulmonary diseases, Accidents NY, PA High cardiovascular, Pneumonia, Liver Low Pulmonary diseases ME, MA, RI, NJ High cancer Intermediate for all others Distance X = (x1, x2, …, xm) vector of m variables or characteristics Xi , i = 1, …, n observations corresponding to n individuals Need to define how far the individuals are from each other i.e. their Distance from the others. Distance between observations Common distance measures between two individuals, say r and s, City-block : m d rs = | xrj − xsj | j =1 Euclidean : m d rs = ( xrj − xsj ) 2 j =1 Distance matrix D nn 0 d12 0 = d13 d 23 0 .. .. .. d 1n d 2 n d 3n .. 0 Distance between Clusters Suppose one cluster (C1) has n1 members and another cluster (C2) has n2 members. What is the distance between the two clusters ? Simple Linkage : 12 = min{ drs | r C1, s C2} Complete Linkage : 12 = max{ drs | r C1, s C2} More Distances between Clusters Average Linkage : 12 1 = n1n2 n1 n2 d r =1 s =1 rs Very often scaled distances are preferred : Mahalanobis Distance : 12 = ( Xr – Xs )’S-1( Xr – Xs ) , where S is the variance-covariance matrix. Hierarchical Clustering Start with n clusters of one member each. Find the closest two members and link them. Then find the next two closest member (can be individual members or can be an individual and the first formed group). Continue till all the members are linked. Breakfast Cereal Data Cereal Brand Protein Carbohydrate Fat (gms) s (gms) (gms) Calories (per oz. Vitamin A (%)_ Life 6 19 1 110 0 Grape Nuts 3 23 0 100 25 Super Sugar Crisp 2 26 0 110 25 Special K 6 21 0 110 25 Rice Krispies 2 25 0 110 25 Raisin Bran 3 28 1 120 25 Product 19 2 24 0 110 100 Wheaties 3 23 1 110 25 Total 3 23 1 110 100 Puffed Rice 1 13 0 50 0 Sugar Corn Pops 1 26 0 110 25 Sugar Smacks 2 25 0 110 25 9 4 8 7 12 5 11 3 6 2 0 1 20 40 Height 10 60 Hierarchical Clustering Partitioning Decide on the number of groups. Segregate the observations according to their closeness into these groups. Criterion : Since Total variability T = W + B, where ➢ W = Within cluster variability ➢ B = Between cluster variability Segregate into the groups so as to Minimize W/ B or W/ T IRIS Data Sepal Length Sepal Width Petal Length Petal Width Sepal Length Sepal Width Petal Length Petal Width 5.1 3.5 1.4 0.2 6.5 3.0 5.8 2.2 7.0 3.2 4.7 1.4 7.6 3.0 6.6 2.1 6.3 3.3 6.0 2.5 4.6 3.1 1.5 0.2 7.8 2.7 5.1 1.9 5.0 3.6 1.4 0.2 6.4 3.2 4.5 1.5 6.5 2.8 4.6 1.5 4.9 3.0 1.4 0.2 6.9 2.5 5.5 1.7 7.1 3.0 5.9 2.1 5.4 3.9 1.7 0.4 4.7 3.2 1.3 0.2 7.3 2.9 6.3 1.8 6.9 3.1 4.9 1.5 4.6 3.4 1.4 0.3 5.5 2.3 4.0 1.3 5.7 2.8 4.5 1.3 6.3 2.9 5.6 1.8 Results ➢ Cluster sizes : 8 7 6 ➢ Clustering membership 2 3 1 1 3 2 1 2 3 3 1 1 1 2 2 3 1 2 1 2 3 Means Sepal Length Sepal Width Petal Length Petal Width Within Sum of Squares Cluster 1 Cluster 2 6.97 4.90 2.91 3.38 5.85 1.44 2.01 0.24 4.75 1.24 Cluster 3 6.33 2.90 4.53 1.42 2.99 Partitioning 3, 4, 7, 11, 12, 13, 17, 19 Iris Virginica 1,6, 8, 14, 15, 18, 20 Iris Setosa 2, 5, 9, 10, 16, 21 Iris Versicolor Looking for the elbow What after the clusters are formed ? Suppose we have well-formed clusters ➢ identify what characterizes a particular cluster and how different it is from the other clusters • Discriminant Analysis ➢ how to place a new entrant into one of the groups • Classification Example A new entrant Ohio (OH) with characteristics (33.2, 443.1, 198.8, 27.4, 18.0, 18.9, 10.2) has to be classified in one of the 3 groups. Problem : ➢ ➢ ➢ ➢ ➢ ➢ ➢ cardiovascular rate between Gr 1 and Gr 3 accident rate close to Gr 2 cancer rate close to Gr 1 Pulmonary rate close to Gr 3 Pneumonia rate very low – closest is Gr 3 Diabetes rate very high – closest are Gr 2 and Gr 3 Liver rate very low – closest is Gr 1 How to classify ? Let there be two groups : A and B characterized by p.f.s fA(x) and fB(x). Let pA = probability of coming from A and pB = probability of coming from B Probability of misclassification : ➢ An individual from A is classified in B = P(B|A) ➢ An individual from B is classified in A = P(A|B) Cost of Misclassification Cost of misclassification c(B|A) and c(A|B) Correct classifications : P(A|A) pA and P(B|B) pB Misclassifications : P(B|A) pA and P(A|B) pB Minimize Expected Cost of Misclassification Expected cost of misclassification : EMC = c(B|A)P(B|A) pA + c(A|B)P(A|B) pB Minimize the ECM to get the classification rule Classify in A if f A ( x) c( A | B) pB f B ( x ) c ( B | A) p A Classify in B otherwise. Special Cases For equal misclassification cost 𝑓𝐴 (𝑥) 𝑝𝐵 ≥ 𝑓𝐵 (𝑥) 𝑝𝐴 For equal misclassification and prior probabilities 𝑓𝐴 (𝑥) ≥1 𝑓𝐵 (𝑥) Apparent Error Rate : AER = # misclassification from A to B + # misclassification from B to A total # of observations Normal population Suppose for A, fA(x) ~ N(A, ΣA) and for B, fB(x) ~ N(B, ΣB) Assume ΣA = ΣB = Σ. Then the Linear Discriminant function is Allocate x to A if ( A − B )' −1x 1 ( A − B )' −1 ( A + B ) + ln c( A | B) pB 2 c( B | A) p A Allocate x to B if ( A − B )' −1x 1 ( A − B )' −1 ( A + B ) + ln c( A | B) pB 2 c( B | A) p A Normal population Let C= 1 ( A − B )' −1 ( A + B ) + ln c( A | B) pB 2 c( B | A) p A Then allocate x to A if and allocate to B if ( A − B )' −1x C ( A − B )' −1x C If ΣA ≠ ΣB , then we get a Quadratic Discriminant function. R-packages Agglomerative Clustering : Functions hclust() from stats and agnes() from cluster Divisive Clustering : diana() from cluster Partitioning : kmeans() from stats and pam() from cluster Several other specialized packages References Johnson R.A. & Wichern D.W. (2002) : Applied Multivariate Statistical Analysis, 5th ed.,PHI Learning, New Delhi. Seber G.A.F. (2004) : Multivariate Observations, J. Wiley, New York. Anderson T.W. (1958) : Introduction to Multivariate Statistical Analysis, John Wiley, New York. Muirhead R.J. (1982) : Aspects of Multivariate Statistical Theory, John Wiley, New York. References Rencher A.C. & Christensen W.F. (2012) : Methods of Multivariate Analysis, 3rd ed., John Wiley, New York. Kendall M.G. (1975) : Multivariate Analysis, Griffin, London. Everitt B.S. (1993) : Cluster Analysis, 3rd ed., Edward Arnold, London. Hand D.J. (1981) : Discrimination and Classification, New York, John Wiley