Assignment 1 Solutions Stat 557 Fall 2000 Problem 1

advertisement
Stat 557
Fall 2000
Assignment 1 Solutions
Problem 1
(a) population: Adults in the State of New York with telephones
unit of response: An adult from the populaltion
response variable: a binary (yes, no) nominal variable
explanatory variables: age (interval variable) and sex (nominal variable)
(b) population: Some population of patients with a cetain liver disease. The population
is not clearly identied.
unit of response: A patient from the populaltion.
response variable: disease severity measured as an ordinal variable on a ve point
scale.
explanatory variables: time
Alternatively, you could view the repeated measurements on the same subject as a
set of responses and not list time as an explanatory variable. In either case, an appropriate analysis of the data would have to account for correlations among repeated
measurements taken on an individual patient.
(c) population: Iowa State students
unit of response: A student and his or her parents.
response variable: There are six potentially correlated binary (nominal) response
variables corresponding to right (or left) handedness and right (or left) footedness
for the student and his or her parents. Alternatiely, you could think of this as a
single nominal variable with 64 categories.
explanatory variables: none are mentioned.
Alternatively, you could take right (or left) handedness and right (or left) footedness
of the student as a pair of binary response variables and use the information from the
parents as explanatory variables, The appropriate point of view would be dictated by
the objectives of the study.
1
(d) population: Schools in Iowa with sixth grade classes.
unit of response: This is a two stage sampling scheme where schools are sampled
from the population of schools in Iowa with with sixth grades students. Then
one class is randomly selected from the sixth grade classes in each of the selected
schools. An entire sixth grade class is the unit of response. Since students in a class
are inuenced by the same teacher and they largely share the same educational
background, two students from the same class will tend to respond in a more
similar way than two students from dierent schools. To use sixth grade students
as the units of response, correlations among responses from classmates would have
to be taken into account. This could be dicult because it would require a model
the would allow for dierent types or strengths of relationships between dierent
pairs of classmates.
response variable: For each of the 25 animals there are two ordinal response variables
(each with ve categories) corresponding to attitudes toward the animal before
and after the visit to the wildlife center.
explanatory variables: none are mentioned
Alternatively, you could consider the post visit attitudes as the response variables and
the pre-visit responses as explanatory variables. This would depend on the objectives
of the study, but they were not addressed in this problem.
(e) population: All orders for automobile parts lled by the warehouse center in a specic
time period.
unit of response: An order which may contain more than one line
response variable: An interval variable corrsponding to the number of errors made
in lling an order
explanatory variables: The number of lines in an order could be used as an explanatory variable. There may also be information on the types of errors of types of
items ordered.
(f) population: The population from which the 200,000 potential costumers were selected
was not identied.
unit of response: A person on the mailing list.
response variable: There are two binary (nominal) response variables. One is
whether or not a person responded to the mailing. The other is whether or not
a respondent used the new credit card within one year. You could think of this
2
as a single nominal response variable with three categories (no intial response,
responded and did not use the credit card, responded and used the credit card).
explanatory variables: sex(nominal), age(interval), income(interval), marital status(nominal), home ownership(nominal), credit history(nominal).
Problem 2
Let and be the success rate of the ACS program and the success rate of the psychologist's new program, respectively. Then the testing problem is "H : = : HA : <
". The alternative is one-sided because the objective of the study is to demonstrate that
the new program is better. Here, = 0:05, = 0:20, z = 1:64485, z = 0:84162, = 0:17,
= 0:24.
1
2
0
1
2
1
2
1
2
p = (v + )=2
u
u
; p)
r = t (1 ; 2p(1
= 1:0038
) + (1 ; )
1
2
1
1
2
2
; ) + (1 ; )) = 410:2
n = (z + z r) (((1 ;
)
2
1
1
1
2
2
2
2
Then, 411 smokers are needed in each program.
Problem 3
(a) The observed proportions of SIDS cases in the nine weight categories are 0.00358, 0.00726,
0.01327, 0.00423, 0.00321, 0.00208, 0.00156, 0.00155, 0.00097, respectively. Using binomial
models for the number of SIDS cases, conditioning on the number of births in each weight
category, the standard errors for the observed proportions are .00358, .00417, .00381, .00121,
.00053, .00029, .00027, .00043, .00069, respectively. The SIDS incidence rates in the two
lowest weight cateroies are poorly estimated, and it is not clear if SIDS incidence rates
really increase across the rst three weight categories. Given the accuracy of the observed
proportions, there appears to be a decreasing trend in the incidence rates of SIDS cases as
birth weight increases. The analysis in problem 4 does not reject the t of a monotonically
decreasing power (or exponential) curve to this trend.
(b) Without combining any birth weight categories, X = 71:25 with d.f.=8 and the chisquare approximation to the distribution of the Pearson statistic, when the null hypothesis
is true, yields p ; value < :0001. SIDS rates are not the same for all birth weight categories.
Since two of the estimated expected counts are smaller than one and four of the estimated
expected counts are smaller than 5, the large sample chi-square approximation to the distributions of the Pearson and deviance statistics may not provide accurate p-values. One
2
3
option would be to combine the results for the three smallest birth weight categories. This
yields a Pearson statistic that also rejects the hypothesis of equal SIDS rates across all birth
weight categories.
(c)
Let i denote the joint probability of birth in the ith birth weight category and a SIDS
case. Let i denote the joint probability of a birth in the ith birth weight category and
a non-SIDS case. Let Xi and Xi denote the corresponding random counts. The joint
distribution of X ; X ; ; X ; X is a multinomial distribution with total sample size
n = Pi Pj Xij . The probability function is
1
2
1
11
9
2
12
91
92
2
=1
=1
YY x
P (X = x ; X = x ; ; X = x ; X = x ) = Q Qn! x !
ij :
ij i j
i
j
9
2
ij
11
11
12
12
91
91
92
92
9
2
=1
=1
=1
=1
Let ni = Xi + Xi be the number of infants in the ith birth weight category. Then,
1
2
P (X = x jX + X = n )
X +X =n )
= P (X =P (xX and
+X =n )
= PP(Xn9 = x and X = n ; x )
(X = k; X = n ; k)
PkP8 P P
2
x n;n P (X = x ; X = x ; ; X = x ; X = x ; X = n ; x )
= Pn9 P=1 P8 =1P2 9
k
n;n9 P (X = x ; X = x ; ; X = x ; X = k; X = n ; k)
=1
=1 x
P P8 P2
Q Q x )x91 n9;x91
n
Q
(
8 Q2
ij
i
j
x
n
;
n
9
x91 n9 ;x91
=1
=1 x
= Pn9 =1P P=1 P
Q
Q
Q Q n
8
2
(
x )k n9 ;k
91
91
91
91
91
92
91
91
91
(
=0
(
9
92
j
ij =
i
j
9
9
92
91
i
92
92
91
=0
9
91
9
11
)
ij =
11
12
11
)
11
11
82
12
11
i
k
ij =
j
=0
(
i=1
) (
j =1
i
xij n;n9
=
ij !)
j
) (
8
i=1
( x9191 n992;x91 ) (P P8=1 P2=1 x
x91
= P
( k
n9
;x91
!(
)!
(
i
j
=
=
=0
!(
!(
1
n9
!
)!
(
2
j =1
n;n9
ij =
n9 ;k
k
P P8 P2
91
92
k n9 ;k ) (
x
i=1
j =1 ij
x91 n9 ;x91
91 92
x91 n9 ;x91
( + )n9
9
!(
!) !(
!
=
)
!
91
92
9
92
91
9
92
ij
ij
=1
91
92
Q Q x )
ij
i
j
Q Q x )
x
i=1
j =1 ij
n
n;n9 Q8i=1 Q2j=1 xij
)
j
=1
!
91
2
i
)!
82
=1
8
!
xij k n9 ;k
Q8 Qn 2
91
ij
2
=1
)!
91
82
8
!
(
82
8
!
ij
2
=1
=1
8
2
i
=1
j
=1
ij
ij
)!
91
92
n!
n9 ;x91
( )x91 (1 ; x !(n ; x )! + + )
9
91
9
91
91
91
91
92
91
92
Hence if you condition on the number of infants in the 9th (heaviest weight) category, then
X , the number of SIDS cases in that category, has a Binomial(n ; 919192 ) distribution.
Note that 919192 is the conditional probability that a baby in the heaviest birth weight
category becomes a SIDS case.
91
9
+
4
+
q
(d) The formula is p z: ( p n;p ). Here, n = 2061 and p = 2=2061. This yields: lower
limit=-0.00037, upper limit=0.00231. The large sample normal approximation to the binomial distribution is not a good approximation because the expected number of SIDS cases
is too small. Note that this is not the result produced by prop.test function in S-PLUS.
(e) lower limit=0.000118, upper limit=0.0035.
(1
)
025
Problem 4
(a) Note that
i = exp( + (i + 1)) = exp( ):
i
exp( + i)
Then, = log(i =i) is the log of the relative risk of SIDS for adjacent birth weight
categories. Since, exp(^ ) = 0:72 and 1=0:72 = 1:39, the relative risk of SIDS increases by
about 40 percent when the birth falls into the next lower birth weight category.
(b) The log-likelihood function is
Y
l = log( Y !(nni;! Y )! iY (1 ; i)n ;Y )
i i
i
i
X
ni ! ) + X Y log + X(n ; Y ) log(1 ; )
=
log(
i
i
i
i
Yi!(ni ; Yi)! i i
i
i
X
X
X
=
log( Y !(nni;! Y )! ) + Yi( + i) + (ni ; Yi) log(1 ; exp( + i))
i i
i
i
i
i
+1
0
1
1
0
1
1
+1
1
9
i
i
i
=1
9
9
=1
9
=1
9
=1
9
9
0
=1
1
=1
0
1
=1
(c) The likelihood equations are
@l = X Y ; X(n ; Y ) exp( + i) = 0
i
i
i
@
1 ; exp( + i)
i
i
@l = X iY ; X i(n ; Y ) exp( + i) = 0:
i
i
i
@
1 ; exp( + i)
i
i
9
9
0
0
=1
9
1
=1
=1
1
0
1
0
1
9
0
=1
1
The solution to these equations must be obtained numerically with an iterative algorithm.
There is no convenient algebraic formula for the solution.
(d) The data are consistent with the proposed model.
X = 10:79 with 9 ; 2 = 7 degrees of freedom and p ; value = 0:148.
G = 9:45 with 9 ; 2 = 7 degrees of freedom and p ; value = 0:222.
Under the alternative model, you must estimate a dierent incidence rate for each of the
nine binomial distributions. Under the null hypothesis you must estimate two parameters
and . Hence, the dierence in the dimensions of the parameter spaces is 9-7=2.
2
2
0
1
5
(e) The second partial derivatives of the log-likelihood function are
@ l = ; X(n ; Y ) exp( + i)
i
i
@
(1 ; exp( + i))
i
@ l = ; X i (n ; Y ) exp( + i)
i
i
@
(1 ; exp( + i))
i
@ l = ; X i(n ; Y ) exp( + i)
i
i (1 ; exp( + i))
@ @
i
9
2
0
2
0
1
0
=1
9
2
0
2
2
1
1
0
=1
2
1
9
2
0
2
1
0
1
1
0
=1
1
2
Since
E (Yi) = ni i = ni exp( + i);
the negative of the expectations of the second partial derivatives are
@ l ) = X n exp( + i)
E ( @
i
1 ; exp( + i)
i
@ l ) = X i n exp( + i)
E ( @
i
1 ; exp( + i)
i
l ) = X in exp( + i)
E ( @@ @
i
1 ; exp( + i)
i
0
1
9
2
0
2
0
1
0
=1
1
9
2
0
2
2
1
1
0
=1
1
9
2
0
0
1
1
0
=1
1
(f) The Fisher information matrix,
0
@ 22l ) E ( @ 2 l ) 1
E
(
@0 @1 A ;
@0
I = ;@
2l
@
@ 22l )
E ( @0 @1 ) E ( @
1
is obtained from the negative of the expectations of the second partial derivatives. By
substituting mle's for unknown parameters, we have an estimate
0
1
166
:
48
950
:
31
A
I^ = @
950:31 5780:38
The inverse of this matrix,
0
1
0
:
0977
;
0
:
0161
A
I^; = @
;0:0161 0:00281
1
provides an estimate of the covariance matrix for the large sample normal approximation to
the distribution of the mle's.
6
(g)
p
s0 = 0:0977 = 0:313
p
s1 = 0:0281 = 0:053
In this case, it is easy to obtain formulas for the expectations of the second partial derivatives
of the log-likelihood. In situations where the expectations are not easy to derive, the values
of the second partial derivations evaluated at the mle's of the parameters are used instead
of the expectations of those derivatives. This is often called the "local" estimator of the
information matrix.
#-----------------------------------------------------------------------;
#Xiao-Hu Liu used the following S-PLUS code to obtain the solutions to assignment 1.
#------------;
# problem 2
;
#------------;
alpha_0.05
beta_0.2
pi1_0.17
pi2_0.24
p_(pi1+pi2)/2
z.alpha_qnorm(1-alpha)
z.beta_qnorm(1-beta)
temp_pi1*(1-pi1)+pi2*(1-pi2)
r_sqrt(2*p*(1-p)/temp)
n_(z.beta+z.alpha*r)^2*temp/(pi1-pi2)^2
#------------;
# problem 3
;
#------------;
7
cat_1:9
x_c(1,3,12,12,37,52,34,13,2)
n_c(279, 413, 904, 2838, 11509, 24941, 21832, 8408, 2061)
#Merge the data;
cbind(cat, x, n)
#part (a);
p_x/n
plot(cat, p)
# a decreasing trend, a power( or exponential) curve may fit
# the scatter plots;
#part (b);
#method 1: Chi-square test for two way contigency table;
x1_n-x
X_cbind(x, x1)
chisq.test(X)
#method 2: use prop.test to test equal proportion of several
#
independent binomial distributions;
prop.test(x, n)
#part (d);
x9_x[9]; n9_n[9];
prop.test(x9, n9)$conf.int
#part (e);
#The following function uses F-distribution to construct a
#100*a% "exact" confidence interval for the probability of
#success in a binomial distribution;
binci.f<-function(x, n, a)
{
# x = observed number of successes;
8
# n = number of trials;
# a = level of confidence (e.g. 0.95);
p<-x/n
a2 <- 1-((1-a)/2)
if (x > 0) f1 <- qf(a2,2*(n-x+1),2*x)
else f1 <- 1
plower <- x/(x + (n-x+1)*f1)
if (n > x) f2 <- qf(a2,2*(x+1),2*(n-x)) else f2 <- 1
pupper <- (x+1)*f2/((n-x)+(x+1)*f2)
cbind(plower=plower, pupper=pupper)
}
binci.f(x9, n9, 0.95)
#------------;
# problem 4
;
#------------;
#The data were entered into S-PLUS in problem 3;
#part (d);
beta0_-4.10393
beta1_-0.329635
pi.hat_exp(beta0+beta1*cat)
m_n*pi.hat
m1_n-m
M_cbind(m, m1)
X.sqr_sum((X-M)^2/M)
df_7
p.value1_1-pchisq(X.sqr, df)
G.sqr_2*sum(X*log(X/M))
p.value2_1-pchisq(G.sqr, df)
#part (f);
o.r_pi.hat/(1-pi.hat)
a11_sum(n*o.r)
9
a12_sum(cat*n*o.r)
a22_sum(cat^2*n*o.r)
I_matrix(c(a11, a12, a12, a22), 2, 2)
I.inv_solve(I)
#part (g);
std_sqrt(diag(I.inv))
10
Download