Homework 5 – Due March 25, 2011

advertisement
Homework 5 – Due March 25, 2011
1. Consider the Iris dataset. For each of the three classes, evaluate quantitatively the
feasibility of the assumption of the multivariate normal distribution. [5 points]
(a) For any class for which the assumption is suspect, try to find a multivariate BoxCox transform that will transform the variables to be from a multivariate normal
distribution. Since there a range of parameters λi s that can transform the data
in each dimension, you may consider those λi s that are closest to 1. [13 points]
(b) Evaluate whether there is a difference between the means of the sepal/petal
lengths for the I. versicolor and the I. virginica classes. Use also Bonferroni
(α = 0.05) and Benjamini and Hochberg’s method for controlling the false discovery rates at 5% level to determine which of the petal or sepal lengths and widths
distinguish the two classes. [12 points]
2. Our next goal will be a limited investigation of the performance of the practical
simulation-based approach discussed in class to determining multivariate normality
of a sample. Our basic approach in this assignment will be to investigate performance
in terms of distinguishing samples from fatter-tailed distributions from the normal,
and to see how this does as the “fat” parameter is tamped down. To see this, we will
need to have some more discussions.
(a) The Multivariate t-distribution with ν degrees of freedom. The p-variate t-distribution
with degrees of freedom ν (see Lange, et al., 1989) with mean parameter µ and
dispersion Σ is denoted by tp (µ, Σ, ν) and has density
1
fν (x | µ, Σ) ∝ |Σ|− 2 {ν + (x − µ)0 Σ−1 (x − µ)}−
ν+p
2
,
x ∈ <p .
A random vector V has the tp (µ, Σ, ν) density, if, analogous to the univariate
√
1
case, V = µ + Σ 2 Z/ U , where Z ∼ Np (0, I p ) is independent of U ∼ χ2ν . This
representation makes it possible to obtain realizations from the multivariate t
distribution.
The above provides for an approach to simulating random vector-valued realizations from the multivariate t-distribution. You may use the above to develop your
own function, or use the functions (e.g., rmvt) available in contributed R packages
(e.g., mvtnorm). Our strategy therefore will be as follows: generate 100 samples
each from tp (µ, Σ, ν) for ν = 1, 2, 3, 10, 30, 50, 100, 1000 and p = 5 and evaluate
our simulation-based approach on each of these datasets. We will assume that
µ = (1, 1, 1, 1, 1)0 while Σ is randomly generated from the Wishart distribution
Wp (ν, I p ), where ν = 6. To do so, we can make use of the following result:
(b) Generating realizations from the Wishart distribution using Bartlett’s decomposition. Bartlett (1939) provided the following decomposition of a realization from
Stat 501, Spring 2011 – Maitra
2
the Wishart distribution Wp (ν, I p ). Let B = ((bij )) be √
the lower triangular matrix of independent random variables with entries bii = Xi , with Xi ∼ χ2(ν−i+1) ,
and bij ∼ N (0, 1), for i > j. Then W = BB 0 is a realization from Wp (ν, I p ).
Since for any W ∼ Wp (ν, I p ) implies that T W T 0 ∼ Wp (ν, Σ), where Σ = T T 0 ,
the above provides an easy approach to simulating realizations from the Wishart
distribution.
The above provides a practical approach to generating samples from the Wishart
distribution. Once again, you may use the above to write your own R function,
or use functions (such as rwishart or rWishart) from contributed R packages
(such as mixAK or bayesm).
For each of the simulated datasets, use the practical simulation-based approach to
decide if the dataset is normally distributed or not. Then use the following function
mvnorm.etest from the energy package in R) with R = 99,999 to test, using an
alternative approach to deciding multivariate normality. Summarize the results for
the different values of ν in a table. (The mvnorm.etest function uses an alternative
approach to testing for multivariate normality using energy functions proposed by
Székely and Rizzo, 2005; its exact details are outside the purview of the syllabus for
this class.) [50 points]
3. The file image.dat available on the datasets section of the class website consists of
19 attributes on 30 observations each on seven types of image types. These twenty
attributes (region type plus the 19 attributes) are provided in the file. The first row
of the file, contains the names of the region types and the attributes. The fourth
attribute of the file (REGION.PIXEL.COUNT) is redundant. So please remove this
column from your calculations. It is absolutely important that you do this. Answer
the following questions:
(a) From the discussion of the first mid-term exam, we have identified one pair of
image types that stands out from each other and one pair of image types that is
hard to distinguish from the other. For each of these cases, provide the correlation
matrix. Since it is not easy to always read through hundreds of pairs of observations, I have provided a graphical aid to display these correlations. The function is
called displaycorr: note, however, that you will need to remove columns which
have the same value for the variable before you proceed. Remove, or collapse
the effects of the highly (positively or negatively) correlated variables. Use the
resulting summarized variables to test for the significance among the means in
the two pairs of groups that you have identified. If the means are significantly
different, identify, after controlling for false discoveries, which of the variables are
significantly different in each case. [20 points]
Download