732A37 Multivariate Statistical Methods, Autumn Semester 2015 Computer lab 1: Examining multivariate data The first step in all data analyses is a careful examination of the raw data. With multivariate data this can be a rather tedious task, but is absolutely necessary for a successful analysis. Learning objectives After reading the recommended text and completing the computer lab the student shall be able to: give brief descriptions of a multivariate data set study the structure/relationships between the variables examine possible extreme values or even outliers understand and use different multivariate residuals Recommended reading Chapter 1-2 in Johnson-Wichern Assignment 1: Description of each variable Consider the data set in table 1.9, National track records for women. For 55 different countries we have the national records for seven variables (100, 200, 400, 800, 1500, 3000 m and marathon) in 1984. Use SAS or some other software for the following analyses: a) Describe the seven variables with mean values, standard deviations etc. b) Illustrate the variables with different graphs, e.g. box plots or dot plots. Can you recognize any apparent extreme value? Are the variables normal distributed? Assignment 2: Relationships between the variables a) Compute the covariance and correlation matrices for the seven variables (save the matrices for future use). Can you see any interesting structure in the matrices? b) Go further and study scatterplots between each pair of variables (matrix plot). Any apparent extreme values? c) Utilize also three-dimensional scatterplots for the analysis. If necessary, rotate the axes for a better view. Assignment 3: Examine possible extreme values a) Look at the scatterplots above (both 2D and 3D). Which 3-4 countries are most extreme, and why? To have a measure of “extremism”, we want to define a distance between an observation and the mean vector. Such a distance can be termed a multivariate residual for the actual observation. b) The simplest multivariate residual is the Euclidean distance between the observation and the mean vector, i.e. (x x)' (x x) Compute the squared Euclidean distances for all 55 countries, by utilizing matrix operations. First standardize the raw data by the means only, to get x x for each country. Copy these columns to a matrix and compute a suitable matrix, which has the squared distances in its diagonal. Copy this diagonal to a column. Which are the five most extreme countries? c) The different variables have very different scales, so the distances above can be dominated by some few variables. To avoid this we can use the squared distance (x x)'V 1 (x x) , where V is a diagonal matrix with the variances in the diagonal. The effect is, that for each variable the squared difference is divided by its variance and we have a scale independent distance. It is simple to compute this measure by standardizing the raw data with both means and standard deviations, and then compute the Euclidean distance for the standardized data. Carry out these computations and conclude which countries are the most extreme ones. d) The most common statistical distance is the Mahalanobis distance (x x)'C 1 (x x) , where C is the sample covariance matrix for the data. With this measure we also utilize the relationships (covariances) between the variables and not only the variances. Compute the Mahalanobis distances by appropriate matrix operations and notice the most extreme cases. e) Compare the results in b)-d). Some of the countries are in the upper end with all measures and perhaps they can be classified as extreme, at least one of them. But notice also that different distance measures give rather different results; take e.g. the Swedish “rank” with the three measures. To hand in Solutions to the three assignments. No later than Wednesday 11 November