Homework 5 – Due March 25, 2011

Homework 5 – Due March 25, 2011 1. Consider the Iris dataset. For each of the three classes, evaluate quantitatively the feasibility of the assumption of the multivariate normal distribution. [5 points] (a) For any class for which the assumption is suspect, try to find a multivariate BoxCox transform that will transform the variables to be from a multivariate normal distribution. Since there a range of parameters λi s that can transform the data in each dimension, you may consider those λi s that are closest to 1. [13 points] (b) Evaluate whether there is a difference between the means of the sepal/petal lengths for the I. versicolor and the I. virginica classes. Use also Bonferroni (α = 0.05) and Benjamini and Hochberg’s method for controlling the false discovery rates at 5% level to determine which of the petal or sepal lengths and widths distinguish the two classes. [12 points] 2. Our next goal will be a limited investigation of the performance of the practical simulation-based approach discussed in class to determining multivariate normality of a sample. Our basic approach in this assignment will be to investigate performance in terms of distinguishing samples from fatter-tailed distributions from the normal, and to see how this does as the “fat” parameter is tamped down. To see this, we will need to have some more discussions. (a) The Multivariate t-distribution with ν degrees of freedom. The p-variate t-distribution with degrees of freedom ν (see Lange, et al., 1989) with mean parameter µ and dispersion Σ is denoted by tp (µ, Σ, ν) and has density 1 fν (x | µ, Σ) ∝ |Σ|− 2 {ν + (x − µ)0 Σ−1 (x − µ)}− ν+p 2 , x ∈ <p . A random vector V has the tp (µ, Σ, ν) density, if, analogous to the univariate √ 1 case, V = µ + Σ 2 Z/ U , where Z ∼ Np (0, I p ) is independent of U ∼ χ2ν . This representation makes it possible to obtain realizations from the multivariate t distribution. The above provides for an approach to simulating random vector-valued realizations from the multivariate t-distribution. You may use the above to develop your own function, or use the functions (e.g., rmvt) available in contributed R packages (e.g., mvtnorm). Our strategy therefore will be as follows: generate 100 samples each from tp (µ, Σ, ν) for ν = 1, 2, 3, 10, 30, 50, 100, 1000 and p = 5 and evaluate our simulation-based approach on each of these datasets. We will assume that µ = (1, 1, 1, 1, 1)0 while Σ is randomly generated from the Wishart distribution Wp (ν, I p ), where ν = 6. To do so, we can make use of the following result: (b) Generating realizations from the Wishart distribution using Bartlett’s decomposition. Bartlett (1939) provided the following decomposition of a realization from Stat 501, Spring 2011 – Maitra 2 the Wishart distribution Wp (ν, I p ). Let B = ((bij )) be √ the lower triangular matrix of independent random variables with entries bii = Xi , with Xi ∼ χ2(ν−i+1) , and bij ∼ N (0, 1), for i > j. Then W = BB 0 is a realization from Wp (ν, I p ). Since for any W ∼ Wp (ν, I p ) implies that T W T 0 ∼ Wp (ν, Σ), where Σ = T T 0 , the above provides an easy approach to simulating realizations from the Wishart distribution. The above provides a practical approach to generating samples from the Wishart distribution. Once again, you may use the above to write your own R function, or use functions (such as rwishart or rWishart) from contributed R packages (such as mixAK or bayesm). For each of the simulated datasets, use the practical simulation-based approach to decide if the dataset is normally distributed or not. Then use the following function mvnorm.etest from the energy package in R) with R = 99,999 to test, using an alternative approach to deciding multivariate normality. Summarize the results for the different values of ν in a table. (The mvnorm.etest function uses an alternative approach to testing for multivariate normality using energy functions proposed by Székely and Rizzo, 2005; its exact details are outside the purview of the syllabus for this class.) [50 points] 3. The file image.dat available on the datasets section of the class website consists of 19 attributes on 30 observations each on seven types of image types. These twenty attributes (region type plus the 19 attributes) are provided in the file. The first row of the file, contains the names of the region types and the attributes. The fourth attribute of the file (REGION.PIXEL.COUNT) is redundant. So please remove this column from your calculations. It is absolutely important that you do this. Answer the following questions: (a) From the discussion of the first mid-term exam, we have identified one pair of image types that stands out from each other and one pair of image types that is hard to distinguish from the other. For each of these cases, provide the correlation matrix. Since it is not easy to always read through hundreds of pairs of observations, I have provided a graphical aid to display these correlations. The function is called displaycorr: note, however, that you will need to remove columns which have the same value for the variable before you proceed. Remove, or collapse the effects of the highly (positively or negatively) correlated variables. Use the resulting summarized variables to test for the significance among the means in the two pairs of groups that you have identified. If the means are significantly different, identify, after controlling for false discoveries, which of the variables are significantly different in each case. [20 points]

Homework 5 – Due March 25, 2011

Related documents

Products

Support

Homework 5 – Due March 25, 2011

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib