Lab 1 Worksheet

advertisement
Applied Statistics II - Categorical Data Analysis
Data analysis using Genstat - Exercise 1 One and two-way tables
Analysis 1.1 Confidence Interval for Proportion
The data is in a file called EXER1_1.prn. This is a data file of HIV screening results
at a clinic in Mombasa from 1993 to 1998. There are 998 cases in the file. The
column named HIV is 0 for negative screens and 1 for positive. The columns in the
file are the ID (qrsid) the day, month and year of the test and the HIV status.
We wish to
(a) estimate the proportion with HIV in 1993,
(b) test the significance of the difference of the proportion from 0.5 and
(c) compute a 95% confidence interval for the proportion.
The data is first read in and year and hiv defined as factors:
units [997]
"Analysis of some data from file exer1_1.txt"
open 'exer1_1.txt';channel=2;filetype=input
read [channel=2] id,day,month,year,hiv
groups [redefine=yes] year,hiv
Next the data under consideration is restricted to that for 1993:
rest id,year,hiv ;condition=(year.eq.93)
(This above line, rest=restrict and year.eq.93 means year is equal to 1993)
The 1993 data is tabulated by hiv status in a table called hiv93, this table is printed
and the Pearson chisquare test of significance tests whether there is evidence that the
proportion is not 0.5:
tabulate [classification=hiv;counts=hiv93;margins=yes] id
print hiv93
chisquare hiv93 …chi-sqd test of independence
Compute p̂ the estimate of the proportion with HIV from the table and use the
normal approximation to the binomial to calculate a 95% confidence interval as:
pˆ (1  pˆ )
pˆ  1.96
n
Repeat for 1997. Note that in Genstat the effect of successive restrict statements is
cumulative. You can restore the full data set by executing an unconditional restrict
statement. I.e.:
rest id,year,hiv
“resetting it back to the original data set with no condition”
Analysis 1.2 Comparing two proportions: Chi-Squared test of Association
This example uses the procedure Chisquare which can test for independence in a
oneway table as above or an R x C table and uses Pearson’s goodness of fit criterion.
Another likelihood based method will be introduced later. For small counts the
procedure FEXACT2X2 performs an exact analysis of a 2 x 2 table.
Form the two-way table of HIV incidence in 1993 and 1997 and test whether the
proportion with HIV is the same for both years. This is the same as testing for an
association between year and HIV:
rest id,year,hiv
rest id,year,hiv ;condition=(year.eq.93.or.year.eq.97)
tabulate [classification=year,hiv;counts=hiv9397] id
print hiv9397
chisquare hiv9397
This gives the following table with zeros for the omitted years. The subsequent
execution of the chisquare directive does not work correctly as it is computed for the
whole table, zeros included and hence assumes 5 degrees of freedom.
hiv
year
93
94
95
96
97
98
Table hiv9397
0
1
165
0
0
0
102
0
225
0
0
0
102
0
There are two alternatives, either (i) we select the appropriate data from this table and
enter it in a table and compute the chisquare value or (ii) the data can be extracted
from the table above using Genstat programming. Both methods are illustrated.
(i) Use the following Genstat code.
factor [nvalues=4;levels=2;labels=!t('93','97')] year9397
table [classification=year9397,hiv; values=165,225,102,102] hi9397
11,12,21,22…positions
print hi9397
chisquare hi9397
This gives the following output
hiv
year9397
93
97
Table hi9397
0
1
165.0
102.0
225.0
102.0
Pearson chi-square value is 3.20 with 1 df.
Probability level (under null hypothesis) p = 0.074
(ii) Extract the data from the 6 x 2 table above using Genstat code:
scalar d[1...12]
“ The following line transfers the 12 values in the 6 x 2 table hivtest to the
twelve scalars d[1] to d[12]
“
equate hiv9397;!p(d[1...12])
print d[1...12]
“ The following lines create a 2 x 2 table h9397 classified by factors year9397
(which has been defined to have two levels labelled ’93’ and ‘97’ ) and by hiv.
The appropriate values are transferred from the scalars (d[1], d[2], d[9] and
d[10]) to this table and the chisquare test performed on it.
“
factor [nvalues=4;levels=2;labels=!t('93','97')] year9397
table [classification=year9397,hiv] h9397
equate !(d[1],d[2],d[9],d[10]);h9397
print h9397
chisquare h9397
This gives the following output
d[1]
165
d[2]
225
d[3]
0
d[4]
0
d[5]
0
hiv
year9397
93
97
d[6]
0
d[7]
0
d[8]
0
d[9]
102
d[10]
102
h9397
0
1
165.0
102.0
225.0
102.0
d[11]
0
d[12]
0
Pearson chi-square value is 3.20 with 1 df.
Probability level (under null hypothesis) p = 0.074
Repeat this analysis to compare the prevalence of HIV in the years 1994 and 1996.
Applied Statistics II - Categorical Data Analysis
Data analysis using Genstat - Exercise 1 One and two-way tables
ANSWER SHEET - TO BE SUBMITTED FOR GRADING
Name of student___________________
Date______________
Analysis 1.1 Confidence Interval for Proportion
Year
1993
p̂ 
1997
95% CI for p
Analysis 1.2 Comparing two proportions: Chi-Squared test of Association
Analysis for 1993 and 1997
From the output, complete the following:
HIV prevalence
1993
1997
# screened
Pearson 2 = ______
p-value = ______
What do you conclude regarding whether there has been a change in prevalence from
1993 to 1997: ________________________________________________________
Compute (by hand) the Z-statistic for comparing the 1993 and 1997 prevalences, and
compare to your 2result above:
Z = ________
Z2 = ________
2 = ________
Analysis for 1994 and 1996
From the output, complete the following:
HIV prevalence
# screened
1994
1996
Pearson 2 = ______
p-value = ______
What do you conclude regarding whether there has been a change in prevalence from
1994 to 1996: ________________________________________________________
Download