Assignment 5 Solutions Stat 557 Fall 2000 Problem 1

advertisement
Stat 557
Fall 2000
Assignment 5 Solutions
Problem 1
(a) Given the level of factor B , factors A and C are conditionally independent. So the AC
and ABC interactions should not be included. At each level of factor C , the odds ratio
for the other two factors is 2=3,so factors A and B are not conditionally independent.
At each level of factor A, the odds ratio for the other two factors is 0:25, so factors B
and C are not conditionally independent. Hence the model is
BC
log(mijk ) = + Ai + Bj + Ck + AB
ij + jk
(b) The relative distribution across the joint categories of factors A and B is the same for
each level of factor C . Hence, factor C is independent of the joint distribution of factors
A and B . Consequently, the model should not have AC , BC , or ABC interactions.
Notice that A and B are conditionally independent at each level of C , so the model
should not have the AB interaction. We end up with the complete independence model
log(mijk ) = + Ai + Bj + Ck :
(c) At each level of factor C , the odds ratio is 6. This indicates that the 3-factor interaction
should not be in the model. No conditional independence can be found, as the odds
ratios in all conditional 2x2 tables dier from 1. Consequently, the model is
BC
AC
log(mijk ) = + Ai + Bj + Ck + AB
ij + jk + ik
(d) Following the reasoning used in part (c), the model is
BC
AC
log(mijk ) = + Ai + Bj + Ck + AB
ij + jk + ik
(e) The means in the subtable when k=2 are each 3 times as large as the means in the
corresponding cells of the subtable when k=1. This implies that the joint distribution
of factors A and B is independent of the level of factor C . Consequently, the loglinear
model should not have AC , BC , and ABC interactions. Notice that factors A and
B are not conditionally independent given the level of factor C , and the model must
include the AB interaction. We have
log(mijk ) = + Ai + Bj + Ck + AB
ij
1
(f) We can nd the following by checking the subtables:
(1) Given the level of C and the level of D, A and B are conditionally independent.
(2) Given the level of A and the level of B , C and D are conditionally independent.
(3) Given the level of A and the level of D, B and C are conditionally independent.
(4) Given the level of B and the level of D, A and C are conditionally independent.
From (1), the AB , ABC , ABD, ABCD interactions should be excluded. From (2),
the CD, ACD, BCD should be excluded. From (3) and (4), BC and AC interactions
should be excluded. Note that A and D are not conditionally independent given the
levels of B and C . Also, B and D are not conditionally independent given the levels of
A and C . Consequently, the model must include AD and BD interactions. We have
BD
log(mijk ) = + Ai + Bj + Ck + D` + AD
il + jl
Problem 2.
(a)
(b)
(c)
There are many correct answers. Here is one set.
k=1 j=1 j=2
i = 1 10
20
i = 2 20
40
k=2 j=1 j=2
i = 1 20
40
i = 2 40
80
i=1 k=1 k=2
j = 1 10
20
j = 2 30
40
i=2 k=1 k=2
j = 1 20
40
j = 2 60
80
k=1 j=1 j=2
i = 1 10
20
i = 2 20
40
k=2 j=1 j=2
i = 1 10
30
i = 2 40 120
Problem 3.
Model (a)
(i)
AD
BC
BD
CD
ABD
BCD
log(mijkl ) = + Ai + Bj + Ck + Dl + AB
ij + il + jk + jl + kl + ijl + jkl
(ii) df = 6
2
(iii)fYij lg, fY jklg
(iv) Given the time of day (D) and the number of vehicles involved in the accident (B ), the
involvement of alcohol (C ) has no association with the general direction of the road (A).
+
+
Model (b)
AD
BC
CD
(i) log(mijkl ) = + Ai + Bj + Ck + Dl + AB
ij + il + jk + kl
(ii) df = 12
(iii) fYij g, fYi lg, fY jk g, and fY klg.
(iv) Given the time of day (D) and the number of vehicles involved in the accident (B ),
involvement of alcohol (C ) has no association with the general direction of the road (A).
++
++
+
+
++
Given the general direction of the road (A) and the level of alcohol involvement (C ), the
number of vehicles involved in the accident (B ) has no association with the time of day (D).
.05in]
The association between the general direction of the road (A) and the number of vehicles
involved in the accident (B ) is the same for any time of day (D) and each level of alcohol
involvement (C ).
The associations between the accident time (D) and the general direction of the road (A)
are consistent across the number of vehicles involved in the accident (B ) and the status of
alcohol involvement (C ).
The association between the number of vehicles involved (B ) and the status of alcohol involvement (C ) is consistent across time of day and level of alcohol involvement.
The association between status of alcohol involvement (C ) and the time of the day (D) is
consistent across direction of the road (A) and the number of vehicles involved (B ).
Model (c)
(i)
AC
AD
BC
BD
CD
ABC
BCD
log(mijkl) = + Ai + Bj + Ck + Dl + AB
ij + ik + il + jk + jl + kl + ijk + jkl
(ii) df = 6
(iii) fYijk g, fY
+
jkl g,
+
and fYi lg.
++
3
(iv) All of the two-factor interactions are in the model, so no two factors are conditionally
independent given the levels of the other two factors. Since there is no three factor interaction involving factors A and D, however, the association between the time of day when the
accidents occurred (D) and the general direction of the road (A) is not aected by either the
number of vehicles involved in the accident (B ) or the status of alcohol involvement (C ), All
other two factor associations change across the levels of at least one other factor.
Model (d)
(i)
AC
AD
BC
BD
CD
BCD
log(mijkl) = + Ai + Bj + Ck + Dl + AB
ij + ik + il + jk + jl + kl + jkl
(ii) df = 7
(iii) fY jklg, fYij g, fYi k g, and fYi l g.
(iv) Since all of the possible two factor interactions are in this model, there is no conditional
independence between any pair of factors given the levels of the other two factors. Three two
factor interactions are not involved in any higher order interaction, however, which implies
the following:
+
++
+ +
++
Associations between the general direction of the road (A) and the number of vehicles involved in the accident (B ) are homogeneous across time of day (D) and the status of alcohol
involvement (C ).
Associations between the status of alcohol involvement (C ) and the general direction of the
road (A) are homogeneous with respect to the number of vehicles involved in the accident
(B ) and the time of day (D).
Associations between the accident time (D) and the general direction of the road (A) are
consistent across the number of vehicles involved in the accident (B ) and the status of alcohol involvement (C ).
Model (e)
(i)
BD
CD
BCD
log(mijkl) = + Ai + Bj + Ck + Dl + BC
jk + jl + kl + jkl
(ii) df = 11
4
(iii) fY jklg, fYi g.
(iv) The general direction of the road, factor (A), is not involved in any two factor interaction, this model implies that the general direction of the road has no association with any
the other factors. Consequently, general direction of the road (A) is independent of the joint
distribution of the number of vehicles involved (B ), the involvement of alcohol (C ), and the
time of day when an accident occurs (D). This also implies the weaker condition in which
direction of the road (A) is conditionally independent of each of the other factors given any
set of levels of the remaining two factors.
+
++
problem 4
The best model we found is
BC
BD
CD
log(mijkl) = + Ai + Bj + Ck + Dl + AB
ij + jk + jl + kl
(i) This model was obtained by starting with the complete independence model as the initial model. Then, the Splus function "step" was used to perform a stepwise search. This
identied the nal model shown above. It had the best AIC value of the models this search
examined. It was also obtained from a backward elimination procedure. Goodness of t
tests are
= 12:49, df = 12, p-value=0.41.
G = 12:02, df = 12, p-value=0.44.
This model appears to provide an adequate summary of the accident data.
2
2
(ii) Maximum likelihood estimates of the terms in the model are
Coefficients:
(Intercept)
A
B
C
DAfternoon
DEvening
CDAfternoon
CDEvening
B:C
Value Std. Error
3.4119085 0.1448269
0.2613648 0.1516147
2.2184195 0.1537919
-2.2568801 0.2691621
-0.9464736 0.2096743
-0.7403832 0.1937870
1.2564227 0.2890848
2.1460464 0.2741173
-1.2642096 0.2292428
t value
23.5585341
1.7238752
14.4248116
-8.3848361
-4.5140183
-3.8206028
4.3462084
7.8289334
-5.5147190
5
BDAfternoon 0.1397284
BDEvening -0.7889321
A:B -0.5047110
0.2202773
0.2113118
0.1659459
0.6343295
-3.7334982
-3.0414199
These estimates were obtained from S-PLUS under the constraints that any main eect term
or any interaction term is zero when any factor involved is at its lowest level. There was an
error in typing the data table in the assignment. This answer was obtained from the data
stored in the le accidents.dat. Slightly dierent answers would be obtained from the table
of counts printed on the assignment.
Since there are no signicant two factor interactions between general direction of the road
(A) and either involvement with alcohol (C ) or time of day (D), the model suggests that
general direction of the road (A) is essentially conditionally independent of involvement with
alcohol (C ) given the levels of the other two factors, and general direction of the road (A) is
conditionally independent of time of day (D) given the levels of the other two factors.
The model also suggests that the odds that alcohol is involved in an accident are about
exp(1:2564) = 3:5 times greater in the morning than in the afternoon and about exp(2:146) =
8:5 times greater in the morning than in the evening. This would make sense if the morning
period includes hours just after midnight when bars close and patrons who have consumed
the most alcohol drive home. For this model, these estimated odds ratios are consistent
across the levels of the other two factors.
The models also suggests that the number of vehicles involved in an accident has some association with each of the other factors. The odds of a multiple vehicle accident are about
65The estimates of the intercept and main eect terms provide information on relative frequencies of levels of individual factors relative to a baseline. This information is often not
of great interest and I did not expect you to discuss it, but we will consider it here anyway. From the estimate of the intercept we have exp(3:4119) = 30:3 as the estimated mean
count when each factor is at its lowest level (north-south roads, single vehicle accident, in
the morning with alcohol involved). Then, exp(^ + ^A) = exp(3:41191 + 0:26136) = 39:4
is the estimated mean number of single vehicle accidents on east-west roads in the morning
with alcohol involved. Similarly, exp(^ + ^B ) = exp(3:41191 + 2:21842) = 278:8 is the estimated mean number of multi-vehicle accidents on north-south roads in the morning with
alcohol involved. and exp(^ + ^C ) = exp(3:41191 ; 2:25688) = 4:5 is the estimated mean
number of single vehicle accidents on east-west roads in the morning where alcohol is not
involved. Finally, exp(^ + ^D ) = exp(3:41191 ; :94647) = 11:8 is the estimated mean number of single vehicle accidents on east-west roads in the afternoon with alcohol involved, and
2
2
2
2
6
exp(^ + ^D ) = exp(3:41191 ; :74038) = 14:5 is the estimated mean number of single vehicle
accidents on east-west roads in the evening with alcohol involved.
3
Problem 5
Use the complete independence model as the initial model. Then use the Splus function
"step" or "stepAIC" to do a stepwise search. Starting with the complete independence
model this yields the model
BC
BD
CD
AB
AD
ABC
log(mijkl) = + Ai + Bj + Ck + D` + AC
ij + jk + j` + k` + ij + i` + ijk
The estimated -terms are
(Intercept)
A2
A3
C2
C3
C4
B2
B3
D2
D3
D4
D5
C2B2
C3B2
C4B2
C2B3
C3B3
C4B3
A2C2
A3C2
A2C3
A3C3
Value Std. Error
1.23856236 0.2738499
-0.53135358 0.3967743
-0.79826947 0.5090301
0.55744228 0.3170151
-0.35571045 0.3906370
2.46629784 0.2732102
1.75449872 0.2564484
-0.25997611 0.3267763
-1.17962762 0.2901775
-0.43383283 0.2541672
0.72702975 0.2068656
0.57192314 0.2145256
0.13288424 0.2882659
0.23444276 0.3678694
-1.93985744 0.2451301
0.52887171 0.3669414
0.49383403 0.4636653
-0.19847702 0.3185657
-0.09550353 0.4866494
-0.06550752 0.6244122
1.18190582 0.5204080
0.27328994 0.7268818
t value
4.52277797
-1.33918337
-1.56821649
1.75840950
-0.91059072
9.02710623
6.84152730
-0.79557831
-4.06519269
-1.70687959
3.51450266
2.66599054
0.46097803
0.63729895
-7.91358353
1.44129733
1.06506581
-0.62303314
-0.19624709
-0.10491069
2.27111372
0.37597578
7
A2C4
A3C4
A2D2
A3D2
A2D3
A3D3
A2D4
A3D4
A2D5
A3D5
B2D2
B3D2
B2D3
B3D3
B2D4
B3D4
B2D5
B3D5
A2B2
A3B2
A2B3
A3B3
C2D2
C3D2
C4D2
C2D3
C3D3
C4D3
C2D4
C3D4
C4D4
C2D5
C3D5
C4D5
A2C2B2
A3C2B2
0.18737090
0.05051041
0.28111119
0.13012669
0.17595853
-0.54236657
-0.19090204
-0.78503175
-0.53179558
-1.17205666
0.65167996
-0.25580425
1.07186414
0.55791147
0.84456071
0.19642261
1.58544418
0.84686839
-0.33361991
-1.28461938
0.98358699
0.65035587
0.66128121
1.03126012
0.84301367
-0.15440082
-0.51080721
-0.31153251
0.16633346
0.05404871
0.07198334
-0.24952236
-0.63160746
-0.37131448
0.88002265
1.19904452
0.4018535
0.5199221
0.1377156
0.1628727
0.1346599
0.1774174
0.1124046
0.1391720
0.1146129
0.1465214
0.1686581
0.1949826
0.1891964
0.2072840
0.1407807
0.1554210
0.1584673
0.1733061
0.4031492
0.5540023
0.4747678
0.6266494
0.2774680
0.2845323
0.2661616
0.2218129
0.2437382
0.2122377
0.1948750
0.2082427
0.1852515
0.1908319
0.2079394
0.1815955
0.5023210
0.6741340
0.46626668
0.09714996
2.04124439
0.79894735
1.30668833
-3.05700935
-1.69834741
-5.64073030
-4.63992667
-7.99921668
3.86391125
-1.31193394
5.66535139
2.69153119
5.99912456
1.26381018
10.00486364
4.88654803
-0.82753453
-2.31879799
2.07172221
1.03783053
2.38326984
3.62440383
3.16730050
-0.69608592
-2.09572068
-1.46784711
0.85353922
0.25954670
0.38857090
-1.30755037
-3.03745928
-2.04473383
1.75191277
1.77864411
8
A2C3B2 0.25138975
A3C3B2 1.72247616
A2C4B2 0.91234459
A3C4B2 1.68109250
A2C2B3 -0.14988308
A3C2B3 0.05845829
A2C3B3 -0.52928578
A3C3B3 0.40421695
A2C4B3 -0.21715966
A3C4B3 0.32171412
0.5381025 0.46717818
0.7717152 2.23201007
0.4209736 2.16722523
0.5767804 2.91461446
0.5838218 -0.25672745
0.7607387 0.07684412
0.6391600 -0.82809585
0.8729021 0.46307248
0.4950065 -0.43870065
0.6506797 0.49442777
Check of the t of the model:
X = 155:3, df = 112, p-value=0.0043.
G = 160:8, df = 112, p-value=0.0017.
These tests appear to reject the t of the model, but the p-values may not be reliable. We
found that 32% of the expected counts are smaller than 5 and 5% of the expected counts
are smaller than 1. Consequently, the large sample chi-square approximation to the null
distribution of these test statistics may not be accurate. We examined the Pearson residuals
and the deviance residuals to determine if there are combinations of levels of factors where
the model ts poorly. None of these residuals was larger than 3 or smaller than -3. The data
exhibit no gross departures from this model. The AIC value for this model is 296:8. This was
the model tah most students selected using a mindless adherence to the strategy of minimizing the AIC value. A few students continued the search by adding the most highly signicant
term they could nd, in this case the ABD
ij` term. This interaction is signicant at the .015
level, but none of the standardized values of these interaction parameters exceeded 1.91 or
-1.91. This interaction seems to be of limited importance. For this model, X = 122:22
and G = 130:26, both on 96 degrees of freedom, with p-values :04 and :01, respectively.
The AIC value is 298:26. A few more students continued to search and added the AxCxD
interaction.This interaction was signicant at the 0:13 level, and none of the standardized
values of these interaction parameters exceeded 1.96 in absolute value. This interaction also
seems to be of limited importance. For this model, X = 56:33 and G = 63:66, both on 48
degrees of freedom, with p-values :22 and :08, respectively. The AIC value is 326:66. A case
could be nade for any of these models. They all provide about the same description of the
data. Although, we might have selected one of the larger models, we will only interpret the
estimates shown above for the smaller of the three models.
2
2
2
2
2
9
2
Present the estimates for the CxD interactions parameters in a 4x5 table:
`=1 `=2 `=3 `=4 `=5
i=1 0
0
0
0
0
i = 2 0 0.66 -0.15 0.17 -0.25
i = 3 0 1.03 -0.51 0.05 -0.63
i = 4 0 0.84 -0.31 0.07 -0.37
This table indictes that incomes tend to be highest in the suburbs of Copenhagen and lowest
in the country side and the other three largest cities. This is consistant across levels of
marital status and alcohol consumption.
The estimates of the BxD interaction parameters shown in the following table indicate that
Copenhagen has the lowest proportion of married people while the rurak areas ahve the
highest proportions of married people. Copenhagen and its suburbs also have relatively low
levels of unmarried people. Consequently, Copenhagen has higher level of widowed people.
This is consistent across levels of income and alcohol consumption.
`=1 `=2 `=3 `=4 `=5
k=1 0
0
0
0
0
k = 2 0 0.65 1.07 0.84 1.59
k = 3 0 -0.26 0.56 0.20 0.85
The estimates of the AxD interaction parameters shown in the following table indicate that
Copenhagen and its suburbs have relatively high levels of alcohol consumption, and alcohol
consumption is lowest in small cities and rural areas. This is consistent across levels of
income and marital status.
`=1 `=2 `=3 `=4 `=5
i=1 0
0
0
0
0
i = 2 0 0.28 0.18 -0.19 -0.53
i = 3 0 0.13 -0.54 -0.79 -1.17
The other two factor interactions are involved in a three factor interaction. To examine
patterns, add estimates -terms for a 2-factor interaction to corresponding estimates of terms for the 3-factor interaction at each level of the third factor. For example, the estimates
of the AC
ik terms shown in the following table provide information on the associations between
income and alcohol consumption for widows and widowers. There is some tendency for
alcohol consumption among widows and widowers to be higher for larger incomes with the
highest level of moderate alcohol consumption among widows and widowers in the 100 ; 150
income category.
10
k=1 k=2 k=3 k=4
i=1 0
0
0
0
i = 2 0 -0.10 1.18 0.19
i = 3 0 -0.06 0.27 0.05
To examine the associations between alcohol consumption and income for married people,
add the estimates of ABC
i k terms to the previous table to obtain the following table.
2
k=1 k=2 k=3 k=4
i=1 0
0
0
0
i=2 0
0.78 1.33 2.09
i=3 0
1.14 1.99 1.73
This table indicates that the lowest income group has the lowest alcohol consumption among
married adults.
To examine the associations between alcohol consumption and income for unmarried people,
AC
add the estimates of ABC
i k terms to the corresponding estimates of ik The pattern in the
resulting table, shown below, is similar to the pattern for the widows and widowers..
2
k=1 k=2 k=3 k=4
i=1 0
0
0
0
i = 2 0 -0.25 0.65 -0.03
i=3 0
0.00 0.67 0.37
You can perform a correponding analysis to examine how associations between alcohol consumption and marital status change across income categories.
Problem 6
As in most sample size determinations, these questions were somewhat ill posed. You needed
to supply some additional information by making some reasonable assumptions.
(a) Let X and X be the numbers of voters who support a certain candidate in the
September and October samples, respectively. Let N and N be the survey sample
sizes in September and October, respectively. Let and be the true proportions
of people surporting that candidate in September and October, respectively. Let pi =
Xi=Ni (i = 1; 2).
When N and N are large, we have approximately
p _ N ( ; (1 ; )=N ) and
p _ N ( ; (1 ; )=N ).
1
2
1
2
1
1
1
2
2
1
1
1
1
2
11
2
2
2
2
Then,
!
(1
;
)
(1
;
)
+ N
:
p ; p _ N ; ; N
2
Then, under Ha
1
2
1
1
2
1
2
1
2
T qp ;1 p;1; (2;;2) _ N (0; 1)
N1 + N2
The test problem is H : = versus Ha : > . Under H ,
1 1
p ; p _ N 0; N + N (1 ; )
The mle of under H is
+X = N p +N p
p= X
N +N
N +N
So the test for H at signicance level is: "reject H if p =N1p2;=Np12 p ;p > Z".
To ensure this test to at least have power 1 ; , the following must hold.
0
1
p
;
p
Pr @ q
> ZjHaA 1 ; (1=N + 1=N )p(1 ; p)
This is equivalent to
1
0 q
+
)
p
(1
;
p
)
Z
;
(
;
)
(
N
N
1
q 2 1 ;1 2 ;2
jHa A 1 ; Pr @T >
N1 + N2
i.e.
q
( N1 + N2 )p(1 ; p)Z ; ( ; )
q 1 ;1 2 ;2
Z ; = ;Z
+
N1
N2
That is
s
s
1
1
( + )p(1 ; p)Z + (1 ; ) + (1 ; ) Z ; ( ; ) 0
N N
N
N
i.e.
s
N p + N p s (1 ; ) (1 ; )
1
1
N
p
+
N
p
( + )
1;
+
N N N +N
N + N Z+
N
N Z ;( ; ) 0
2
1
2
(1
0
2
1
)
(1
)
1
2
2
1
1
1
0
2
0
1
2
1
2
1
1
2 2
1
2
0
0
2
1
2
(1
)
1
2
(1
(1
)
1
2
1
2
1 1
1
1
1
1
2
2
1
2 2
)
)
)
1
1
) (1
2
(1
1
+1
1
1
1
(1
1
2
1
2 2
1
2
2
2
1
1
1
2
1
2
2
2
1
Notice that i can be estimated by pi (i = 1; 2). So the above inequality is approximately equivalent to
s
N p + N p s p (1 ; p ) p (1 ; p )
1
1
N
p
+
N
p
( + )
1;
+
N N N +N
N + N Z+
N
N Z ;(p ;p ) 0
1 1
1
2
1
2 2
2
1
1
2 2
1
2
12
1
1
1
2
2
2
2
1
For this problem, N = 1600, p = 0:48, p = 0:51, = 0:05, and = 0:10. You may
try to solve the inequality directly, but it is easier to show the left hand side of the
inequality is monotonically decreasing with respect to N . Unfortunately, even if we
pick N2=300000000 (greater than the adult population of USA), the left hand side of
the above inequality is 0.00656, still bigger than 0. Thus, the sample size needed to
achieve the desired power is not achievable if the survey is conducted in USA.
1
1
2
2
Many students consider the result from the rst survey as a xed, non-random result
and computed the sample size needed for a one sample test. A few other students computed the sample size needed for a two sample test, but in this process they computed
a new sample size for the rst sample. This makes no sense because the rst sample
has already been completed.
(b) The half length of 95% condence interval for is Z :
Z:
2
q
0 025
q
p (1 ; p )=N . From
2
2
2
p (1 ; p )=N = 0:01
0 025
2
2
2
we have
n = Z : p (1 ; p )=0:01 = 9599:8
Thus, the sample size for the October survey must be about 9600.
2
0 025
2
2
2
(c) Let Y be the number of people intending to vote for this candidate before and after
the debate; Let Y be the number of people intending to vote against this candidate
before and after the debate; Let Y be the number of people intending to vote for
this candidate before the debate but changing their mind after the debate; Let Y be
the number of people intending to vote against this candidate before the debate but
changing their mind after the debate. Let the sample proportion be pij = Yn and
let ij denote the corresponding true proportion of the population. The test problem
"H : = versus Ha : > " is equivalent to "H : = versus
Ha : > ".
Under Ha ,
11
22
12
21
ij
0
1+
21
+1
+1
1+
0
21
12
12
var(p ; p ) = var(p ; p ) = var(p ) + var(p ) ; 2cov(p ; p )
= (1 ; ) + n(1 ; ) + 2 +1
1+
21
12
21
21
13
12
21
12
21
12
12
21
12
Hence asymptotically
p ; p _ N ; ; (1 ; ) + n(1 ; ) + 2 21
and
12
21
12
21
21
12
12
21
12
!
p ; p ; ( ; )
_ N (0; 1)
( (1 ; ) + (1 ; ) + 2 )=n
Tq
21
21
12
21
Under H ,
0
21
12
12
12
21
12
p ; p N (0; 2 =n)
21
12
21
and the mle of is
^ = p +2 p :
Thus, under H , p pp2121;pp1212 =n has a large sample standard normal distribution. The
test of H at signicance level is
21
21
21
0
(
+
12
)
0
reject H if q p ; p
> Z :
(p + p )=n
0
21
12
21
12
To enable the test to have power 1 ; , we must have
0
1
p
;
p
> ZHaA = 1 ; :
Pr @ q
(p + p )=n
That is
21
12
21
12
q
1
Z (p + p )=n ; ( ; )
Ha A = 1 ; ( (1 ; ) + (1 ; ) + 2 )=n 0
Pr @T > q
21
21
21
12
21
12
12
12
21
12
and we must have
q
Z (p + p )=n ; ( ; )
q
= Z ; = ;Z :
( (1 ; ) + (1 ; ) + 2 )=n
21
12
21
12
1
21
21
12
12
21
12
Consequently,
q
p
Z p + p + Z (1 ; ) + (1 ; ) + 2 Z
n=
( ; )
Now substitute p and p for and , respectively, to obtain
q
p
Z p + p + Z p (1 ; p ) + p (1 ; p ) + 2p p
n=
(p ; p )
2
21
12
21
21
21
21
12
21
12
12
12
21
12
2
12
2
21
12
21
21
21
14
12
12
2
12
21 12
Now we know that
p +p
p +p
0:48
0:51
0:05
0:10 :
=
=
=
=
11
12
11
21
Then, p = 0:48 ; p and p = 0:51 ; p , where 0 p 0:48, and
12
11
p
21
11
p
11
0:99 ; 2p Z : + 0:9891 ; 2p Z :
n=
0:03
Hence n is related to p . Obviously n is a decreasing function of p for 0 p 0:48.
So the maximum value of n = 9147 is for the unlikely situation with p = 0 where
almost everyone changes their mind. At a more likely value of p = 0:45, for example,
n = 853. At p = 0:48 we would need a sample size of n = 282. Examine the required
sample sizes at several likely values for and make a decision based on what the
experts think is likely to be.
11
0 05
2
11
01
2
11
11
11
11
11
11
11
11
Problem 7
The test problem is "H : the two treatments provide the same results" versus "HA: the two
treatments provide dierent results".
The degrees of freedom for the Pearson test of the null hypothesis are df=2.
= 0:05, power=0.9.
P = (0:2; 0:2; 0:6; 0:3; 0:3; 0:4),
PA = (0:25; 0:25; 0:5; 0:25; 0:25; 0:5).
Using the program "chopow.ssc", we get n = 159.
0
2
0
15
Download