Q1a) From the simulation-based approach, I could conclude that types versicolor, virginica satisfy the assumption of the multivariate normal distribution, while type setosa does not. We can do the multivariate Box-Cox transformation c (0.42, 1.27, 0.73, 0.02) on the type setosa. Q1b) Evaluate whether there is a difference between the means of the sepal/ petal lengths. Hotelling's two sample T2-test T.2 = 111.7179, df1 = 2, df2 = 97, p-value < 2.2e-16 So there is significant difference between the means of the sepal/ petal lengths for the I. versicolor and the I. virginica classes. The simultaneous tests of hypotheses after controlling for false discovery rates at q = 0.05: Test 1 2 3 4 p-value 1.725e-07 0.001819 <2.2e-16 <2.2e-16 Order (1) (2) (3) (4) p-value < 2.2e-16 <2.2e-16 1.725e-07 0.001819 Alpha/i 0.05/4 0.1/4 0.15/4 0.05 So both Bonferroni method (comparing p-values with 0.05/4) and B-H method would reject all the four tests. Q2a) To determine multivariate normality of a sample by simulation-based approach, I first simulated samples of size 100 from the multivariate t-distribution for each of the following degrees of freedom: 1, 2, 3, 10, 30, 50, 100, and 1000. For each of these samples, I ran the simulation-based test for multivariate normality, as well as the test for multivariate normality based on the function mvnorm.etest in the energy package. Below is the result. df Simulation-based p-value from mvnorm.etest 1 1.68e-20 < 2.2e-16 2 3.87e-12 < 2.2e-16 3 3.35e-17 < 2.2e-16 10 0.0358 0.00068 30 0.4459 0.3526 50 0.2675 0.1835 100 0.9994 0.5225 1000 0.9999 0.5077 The results of simulation-based test and mvnorm.etest are consistent. Both q-value generated by simulation based approach and p-value generated by energy test tend to increase as the degree of freedom goes up. It suggests that the multivariate t-distribution becomes closer to multivariate normal distribution as the degree of freedom increases. The multivariate normality holds when the degrees of freedom are equal to 10, 30, 50, 100, and 1000. Q3) The pair is easy to separate is path and grass, while the pair is hard to distinguish is brickface and window. The correlation heat map plots for these two sets are shown below. PATH/GRASS Brickface/Window After checking the correlations, I delete variables 10, 11, 13, and 16 for pair PATH/GRASS, and variables 5, 9, 11, 12, 13, 18 and 19 for pair BRICKFACE/WINDOW. Then use the variables left to test for the significance among the means in the two pairs of groups. PATH/GRASS Hotelling's two sample T2-test T.2 = 1312.365, df1 = 14, df2 = 45, p-value < 2.2e-16 So the means of path and grass are significantly different. The simultaneous tests of hypotheses after controlling for false discovery rates at q = 0.05: Test 1 2 3 4 5 6 7 p-value 0.3342 0.00331 0.1715 0.02745 0.0003443 0.8157 0.0006125 test 8 9 10 11 12 13 14 p-value 0.05669 <2.2e-16 1.617e-09 < 2.2e-16 < 2.2e-16 9.875e-10 < 2.2e-16 The results from B-H procedure show that all variables except for the variable REGION.CENTROID.COL, SHORT.LINE.DENSITY.5, VEDGE.SD, HEDGE.SD are significantly different between PATH and GRASS. BRICKFACE/WINDOW Hotelling's two sample T2-test T.2 = 50.1939, df1 = 10, df2 = 49, p-value < 2.2e-16 The means of BRICKFACE and WINDOW are significantly different. The simultaneous tests of hypotheses after controlling for false discovery rates at q = 0.05: Test 1 2 3 4 5 6 p-value 1.108e-05 0.6133 0.3215 0.5557 0.9395 0.6392 test 7 8 9 10 p-value 0.03679 1.857e4.739e-06 0.1188 08 The results from B-H procedure show that variables REGION.CENTROID.COL, EXRED.MEAN and HUE.MEAN are significantly different between BRICKFACE and WINDOW.