Multivariate data and multivariate analysis Multivariate data Multivariate data - values recorded for several random variables on a number of units. Unit 1 Variable 1 … Variable q … x11 . . . . . . n xn1 x1q . . . … . . . xnq n – number of units q – number of variables recorded on each unit xij - value of the jth variable for the ith unit Multivariate analysis Multivariate statistical analysis – simultaneous statistical analysis of a collection of variables (e.g.handle apples, oranges, and pears at the same time). Simultaneous statistical analysis of a collection of variables improves upon separate univariate analyses of each variable by using information about the relationships between the variables. The general aim of most multivariate analysis is to uncover, display, or extract any “signal” in the data in the presence of noise and to discover what the data have to tell us. Multivariate analysis - Exploratory methods allow the detection of possible unanticipated patterns in the data, opening up a wide range of competing explanations. These methods are characterized by an emphasis on the importance of graphical display and visualization of the data and the lack of any associated probabilistic model that would allow for formal inferences (discussed in this course). - Statistical inferences methods are used when individuals from a multivariate data set have been sampled from some population and the investigator wishes to test a well defined hypothesis about the parameters of that population’s probability density function (multivariate normal). History of the development of multivariate analysis - late 19th century Francis Galton and Karl Pearson quantified the relationship between offspring and parental characteristics and developed the correlation coefficient. - early 20th century Charles Spearman introduced the factor analysis while investigating correlated intelligence quotient (IQ) tests. Over the next two decades, Spearman’s work was extended by Hotelling and by Thurstone. - in 1920s Fisher’s introduction of the analysis of variance was followed by its multivariate generalization, multivariate analysis of variance, based on work by Bartlett and Roy. History of the development of multivariate analysis - At the beginning, the computational aids to take the burden of the vast amounts of arithmetic involved in the application of the multivariate methods were very limited. - In the early years of the 21st century, the wide availability of cheap and extremely powerful personal computers and laptops, and flexible statistical software has meant that all the methods of multivariate analysis can be applied routinely even to very large data sets. - Application of multivariate techniques to large data sets was named “data mining” (the nontrivial extraction of implicit, previously unknown and potentially useful information from data). multivariate data ExampleExample of ofmultivariate data individual sex age IQ depression health weight 1 Male 21 120 Yes Very good 150 2 Male 43 NA No Very good 160 3 Male 22 135 No Average 135 4 Male 86 150 No Very poor 140 5 Male 60 92 Yes Good 110 6 Female 16 130 Yes Good 110 7 Female NA 150 Yes Very good 120 8 Female 43 NA Yes Average 120 9 Female 22 84 No Average 105 10 Female 80 70 No Good 100 NA – Not Available Number of units n=10 Number of variables q=7 Example of multivariate data Types of measurements Nominal - Unordered categorical variables (e.g. treatment allocation, the sex of the responder, hair color). Ordinal - Where there is an ordering but no implication of equal distance between the different points of the scale (e.g. educational level: no schooling, primary, secondary, or tertiary education). Interval - Where there are equal differences between successive points on the scale but the position of zero is arbitrary (e.g. measurement of temperature using the Celsius or Fahrenheit scales). Ratio – The highest level of measurement, relative magnitudes of scores as well as the differences between them. The position of zero is fixed (e.g. absolute measure of temperature (Kelvin), age, weight, and length). Example of multivariate data Missing values Observations and measurements that should have been recorded but for one reason or another, were not (e.g. non-response in sample surveys, dropouts in longitudinal data). How to deal with missing values? 1. complete - case analysis by omitting any case with a missing value on any of the variables (not recommended because might lead to misleading conclusion and inferences). 2. available-case analysis – exploit incomplete information by using all the cases available to estimate the quantities of interest (difficulties arise when missing data is not missing completely at random). 3. multiple imputation – a Monte Carlo technique in which the missing values are replaced by m>1 simulated versions (3< m <10) (the most appropriate way). Covariance Example of multivariate data Covariance of two random variables is a measure of their linear dependence. Cov( Xi, Xj ) E ( Xi i )( Xj j ) i E ( Xi ) j E ( Xj ) E - expectation If the two variables are independent of each other, their covariance is equal to zero. Larger values of the covariance show greater degree of linear dependence between two variables. Example of multivariate data Covariance If i j the covariance of the variable with itself is simply its variance. The variance of variable Xi is: E((Xi i) ) 2 2 In a multivariate data with q observed variables, there are q variances and q(q-1)/2 covariances. Covariance depends on the scales on which the two variables are measured. Example of multivariate d Covariance matrix 12 12 21 22 q1 q 2 1q 2q q2 ij covariance of Xi and Xj i2 - varianceof variableXi ij ji Example of multivariate Measure datad Measurements of chest, waist, and hips on a sample 20 men and women . R data “Measure” chest 34 37 38 36 38 43 40 38 40 41 waist 30 32 30 33 29 32 33 30 30 32 hips 32 37 36 39 33 38 42 40 37 39 gender male male male male male male male male male male chest 36 36 34 33 36 37 34 36 38 35 waist 24 25 24 22 26 26 25 26 28 23 hips 35 37 37 34 37 37 38 37 40 35 gender female female female female female female female female female female Example of multivariate d Covariance Calculate the covariance matrix for the numerical variables : chest, waist, hips in “Measure” data. We remove the categorical variable “gender” (column 4). > cov(Measure[,1:3] chest waist chest 6.631579 6.368421 waist 6.368421 12.526316 hips 3.052632 3.684211 hips 3.052632 3.684211 5.894737 Example of multivariate d Covariance Covariance matrix for gender female in “Measure” data >cov(subset(Measure,Measure$gender=="female")[,c(1:3)]) chest waist hips chest 2.277778 2.166667 1.500000 waist 2.166667 2.988889 2.633333 hips 1.500000 2.633333 2.900000 Covariance matrix for gender male in “Measure” data >cov(subset(Measure, Measure$gender=="male")[,c(1:3)]) chest waist hips chest 6.7222222 0.9444444 3.944444 waist 0.9444444 2.1000000 3.077778 hips 3.9444444 3.0777778 9.344444 Example of multivariate d Correlation Correlation is independent of the scales of the two variables. Correlation coefficient ( ij ) is the covariance divided by the product of standard deviations of the two variables. ij ij ij i where 2 i i - standard deviation The correlation coefficient lies between -1 and +1and gives a measure of the linear relationship of the variables Xi and Xj. It is positive if high values of Xi are associated with high values of Xj and negative if high values of Xi are associated with low values of Xj. Example of multivariate d Correlation Correlation matrix for “Measure” data > cor(Measure[,1:3]) chest waist hips chest 1.0000000 0.6987336 0.4882404 waist 0.6987336 1.0000000 0.4287465 hips 0.4882404 0.4287465 1.0000000 Example of multivariate d Distances Distance between the units in the data is often of considerable importance in some multivariate techniques. The most common measure used : Euclidian distance dij q 2 ( x x ) ik jk k 1 xik and xjk, k = 1,…, q variable values for units i and j. When the variables in a multivariate data set are on different scales we need to do a standardization before calculating the distances (e.g. divide each variable by its standard deviation). Example of multivariate d Distances The distance matrix for the first 12 observations in “Measure” data after standardization. >dist(scale(Measure[,1:3],center = FALSE)) 1 2 3 4 5 6 7 8 9 10 11 2 0.17 3 0.15 0.08 4 0.22 0.07 0.14 5 0.11 0.15 0.09 0.22 6 0.29 0.16 0.16 0.19 0.21 7 0.32 0.16 0.20 0.13 0.28 0.14 8 0.23 0.11 0.11 0.12 0.19 0.16 0.13 9 0.21 0.10 0.06 0.16 0.12 0.11 0.17 0.09 10 0.27 0.12 0.13 0.14 0.20 0.06 0.09 0.11 0.09 11 0.23 0.28 0.22 0.33 0.19 0.34 0.38 0.25 0.24 0.32 12 0.22 0.24 0.18 0.28 0.18 0.30 0.32 0.20 0.20 0.28 0.06 of multivariate d MultivariateExample normal density function Multivariate normal density function for two variables x1 and x2: f ( x1 , x2 ); ( 1 , 2 ), 1 , 2 , ) 2 x 2 1 x x x 2 1 / 2 2 2 (21 2 (1 )) exp 1 1 2 1 1 2 2 2 1 2 2 2(1 ) 1 µ1 and µ2 – population means of the two variables σ1 and σ2 - population variances ρ - population correlation between two variables X1 and X2 Linear combinations of the variables are themselves normally distributed. Example of multivariate d Multivariate normal density function Methods to test the multivariate normal distribution: - normal probability plots for each variable separately - convert each multivariate observation to a single number before plotting (i.e. each q-dimensional observation xi could be converted into a generalized distance di2 , giving a measure of the distance of the particular observation from the mean vector of the complete sample x̅ ). di2 = (xi - x̅)T S-1 (xi - x̅) S- sample covariance matrix If the observations are from a multivariate normal distribution, then distances have approximately a chi-squared distribution with q degrees 2 of freedom and are denoted by the symbol q . Example of multivariate d Air pollution data Air pollution in 41 cities in the USA. R data “USairpollution” Variables: SO2: SO2 content of air in micrograms per cubic meter temp: average annual temperature in degrees Fahrenheit manu: number of manufacturing enterprises employing 20 or more workers popul: population size (1970 census) in thousands wind: average annual wind speed in miles per hour precip: average annual precipitation in inches predays: average number of days with precipitation per year Multivariate normal density function Read the “USairpollution” data with the first column as row names: >USairpollution=read.csv("E:/Multivariate_analysis/Data/USairpollution.csv" ,header=T,row.names=1) Normal probability plots for “manu” and “popul” variables in “USairpollution” data. > qqnorm(USairpollution$manu,main=“manu”) > qqline(USairpollution$manu) > qqnorm(USairpollution$popul,main=“popul”) > qqline(USairpollution$popul) Multivariate normal density function popul 2500 0 1000 Sample Quantiles 2500 1000 0 Sample Quantiles manu -2 -1 0 1 2 Theoretical Quantiles -2 -1 0 1 2 Theoretical Quantiles Multivariate normal density function Normal probability plots for each variable separately in “USairpollution” data. layout(matrix(1:8,nc=2)) sapply(colnames(USairpollution), function(x){ qqnorm(USairpollution[[x]], main=x) qqline(USairpollution[[x]]) }) 0 1 2 11 wind 6 8 Sample Quantiles 20 60 -1 -2 -1 0 1 2 precip 60 45 -2 -1 0 1 2 10 30 50 temp Sample Quantiles Theoretical Quantiles 75 Theoretical Quantiles -2 -1 0 1 2 manu predays -2 -1 0 1 2 0 1500 3500 popul -2 -1 0 1 2 Theoretical Quantiles 40 100 Theoretical Quantiles Sample Quantiles Theoretical Quantiles 0 1500 Sample Quantiles Sample Quantiles Sample Quantiles -2 Theoretical Quantiles Sample Quantiles The plots for SO2 concentration and precipitation both deviate considerably from linearity, and the plots for manufacturing and population show evidence of a number of outliers. SO2 -2 -1 0 1 2 Theoretical Quantiles Multivariate normal density function Chi-square plot: >x=USairpollution >cm=colMeans(x) >S=cov(x) >d=apply(x,1,function(x)t(x-cm)%*%solve(S)%*%(x-cm)) >plot(qc <- qchisq((1:nrow(x)-1/2)/nrow(x),df=6), sd<-sort(d), xlab=expression(paste(chi[6]^2,"Quantile")), ylab="Ordered distances",xlim=range(qc)*c(1,1.1)) >oups=which(rank(abs(qc-sd),ties="random")>nrow(x)-3) >text(qc[oups],sd[oups]-1.5,names(oups)) >abline(a=0,b=1) Multivariate normal density function Plotting the ordered distances against the corresponding quantiles of the appropriate chi-square distribution should lead to a straight line through the origins. Chi-square plot is also useful for detecting outliers (i.g. Chicago, Phoenix, Providence). 25 Chicago 15 Phoenix Providence 5 Ordered distances Chi-square plot 5 10 2 Quantile 6 15