Computer lab 1: Examining multivariate data

advertisement
732A37 Multivariate Statistical Methods, Autumn Semester 2015
Computer lab 1: Examining multivariate data
The first step in all data analyses is a careful examination of the raw data. With
multivariate data this can be a rather tedious task, but is absolutely necessary for a
successful analysis.
Learning objectives
After reading the recommended text and completing the computer lab the student shall be
able to:
give brief descriptions of a multivariate data set
study the structure/relationships between the variables
examine possible extreme values or even outliers
understand and use different multivariate residuals
Recommended reading
Chapter 1-2 in Johnson-Wichern
Assignment 1: Description of each variable
Consider the data set in table 1.9, National track records for women. For 55 different
countries we have the national records for seven variables (100, 200, 400, 800, 1500,
3000 m and marathon) in 1984.
Use SAS or some other software for the following analyses:
a) Describe the seven variables with mean values, standard deviations etc.
b) Illustrate the variables with different graphs, e.g. box plots or dot plots. Can you
recognize any apparent extreme value? Are the variables normal distributed?
Assignment 2: Relationships between the variables
a) Compute the covariance and correlation matrices for the seven variables (save the
matrices for future use). Can you see any interesting structure in the matrices?
b) Go further and study scatterplots between each pair of variables (matrix plot).
Any apparent extreme values?
c) Utilize also three-dimensional scatterplots for the analysis. If necessary, rotate the
axes for a better view.
Assignment 3: Examine possible extreme values
a) Look at the scatterplots above (both 2D and 3D). Which 3-4 countries are most
extreme, and why?
To have a measure of “extremism”, we want to define a distance between an
observation and the mean vector. Such a distance can be termed a multivariate
residual for the actual observation.
b) The simplest multivariate residual is the Euclidean distance between the
observation and the mean vector, i.e. (x  x)' (x  x)
Compute the squared Euclidean distances for all 55 countries, by utilizing matrix
operations. First standardize the raw data by the means only, to get x  x for each
country. Copy these columns to a matrix and compute a suitable matrix, which
has the squared distances in its diagonal. Copy this diagonal to a column. Which
are the five most extreme countries?
c) The different variables have very different scales, so the distances above can be
dominated by some few variables. To avoid this we can use the squared distance
(x  x)'V 1 (x  x) , where V is a diagonal matrix with the variances in the
diagonal. The effect is, that for each variable the squared difference is divided by
its variance and we have a scale independent distance.
It is simple to compute this measure by standardizing the raw data with both
means and standard deviations, and then compute the Euclidean distance for the
standardized data. Carry out these computations and conclude which countries are
the most extreme ones.
d) The most common statistical distance is the Mahalanobis distance
(x  x)'C 1 (x  x) , where C is the sample covariance matrix for the data. With
this measure we also utilize the relationships (covariances) between the variables
and not only the variances.
Compute the Mahalanobis distances by appropriate matrix operations and notice
the most extreme cases.
e) Compare the results in b)-d). Some of the countries are in the upper end with all
measures and perhaps they can be classified as extreme, at least one of them. But
notice also that different distance measures give rather different results; take e.g.
the Swedish “rank” with the three measures.
To hand in
Solutions to the three assignments.
No later than Wednesday 11 November
Download