Q1a) From the simulation-based approach, I could conclude that types versicolor,... satisfy the assumption of the multivariate normal distribution, while type...

advertisement
Q1a)
From the simulation-based approach, I could conclude that types versicolor, virginica
satisfy the assumption of the multivariate normal distribution, while type setosa does not. We
can do the multivariate Box-Cox transformation c (0.42, 1.27, 0.73, 0.02) on the type setosa.
Q1b)
Evaluate whether there is a difference between the means of the sepal/ petal lengths.
Hotelling's two sample T2-test
T.2 = 111.7179, df1 = 2, df2 = 97, p-value < 2.2e-16
So there is significant difference between the means of the sepal/ petal lengths for the I.
versicolor and the I. virginica classes.
The simultaneous tests of hypotheses after controlling for false discovery rates at q = 0.05:
Test
1
2
3
4
p-value
1.725e-07
0.001819
<2.2e-16
<2.2e-16
Order
(1)
(2)
(3)
(4)
p-value
< 2.2e-16
<2.2e-16
1.725e-07
0.001819
Alpha/i
0.05/4
0.1/4
0.15/4
0.05
So both Bonferroni method (comparing p-values with 0.05/4) and B-H method would reject
all the four tests.
Q2a)
To determine multivariate normality of a sample by simulation-based approach, I first
simulated samples of size 100 from the multivariate t-distribution for each of the following
degrees of freedom: 1, 2, 3, 10, 30, 50, 100, and 1000. For each of these samples, I ran the
simulation-based test for multivariate normality, as well as the test for multivariate normality
based on the function mvnorm.etest in the energy package. Below is the result.
df
Simulation-based
p-value from mvnorm.etest
1
1.68e-20
< 2.2e-16
2
3.87e-12
< 2.2e-16
3
3.35e-17
< 2.2e-16
10
0.0358
0.00068
30
0.4459
0.3526
50
0.2675
0.1835
100
0.9994
0.5225
1000
0.9999
0.5077
The results of simulation-based test and mvnorm.etest are consistent. Both q-value
generated by simulation based approach and p-value generated by energy test tend to increase
as the degree of freedom goes up. It suggests that the multivariate t-distribution becomes closer
to multivariate normal distribution as the degree of freedom increases. The multivariate
normality holds when the degrees of freedom are equal to 10, 30, 50, 100, and 1000.
Q3)
The pair is easy to separate is path and grass, while the pair is hard to distinguish is brickface
and window. The correlation heat map plots for these two sets are shown below.
PATH/GRASS
Brickface/Window
After checking the correlations, I delete variables 10, 11, 13, and 16 for pair PATH/GRASS,
and variables 5, 9, 11, 12, 13, 18 and 19 for pair BRICKFACE/WINDOW. Then use the variables
left to test for the significance among the means in the two pairs of groups.
PATH/GRASS
Hotelling's two sample T2-test
T.2 = 1312.365, df1 = 14, df2 = 45, p-value < 2.2e-16
So the means of path and grass are significantly different.
The simultaneous tests of hypotheses after controlling for false discovery rates at q = 0.05:
Test
1
2
3
4
5
6
7
p-value 0.3342 0.00331 0.1715
0.02745
0.0003443 0.8157
0.0006125
test
8
9
10
11
12
13
14
p-value 0.05669 <2.2e-16 1.617e-09 < 2.2e-16 < 2.2e-16 9.875e-10
< 2.2e-16
The results from B-H procedure show that all variables except for the variable
REGION.CENTROID.COL, SHORT.LINE.DENSITY.5, VEDGE.SD, HEDGE.SD are significantly different
between PATH and GRASS.
BRICKFACE/WINDOW
Hotelling's two sample T2-test
T.2 = 50.1939, df1 = 10, df2 = 49, p-value < 2.2e-16
The means of BRICKFACE and WINDOW are significantly different.
The simultaneous tests of hypotheses after controlling for false discovery rates at q = 0.05:
Test
1
2
3
4
5
6
p-value 1.108e-05
0.6133
0.3215
0.5557
0.9395 0.6392
test
7
8
9
10
p-value 0.03679
1.857e4.739e-06 0.1188
08
The results from B-H procedure show that variables REGION.CENTROID.COL, EXRED.MEAN
and HUE.MEAN are significantly different between BRICKFACE and WINDOW.
Download