Applied Statistics II - Categorical Data Analysis Data analysis using Genstat - Exercise 1 One and two-way tables Analysis 1.1 Confidence Interval for Proportion The data is in a file called EXER1_1.prn. This is a data file of HIV screening results at a clinic in Mombasa from 1993 to 1998. There are 998 cases in the file. The column named HIV is 0 for negative screens and 1 for positive. The columns in the file are the ID (qrsid) the day, month and year of the test and the HIV status. We wish to (a) estimate the proportion with HIV in 1993, (b) test the significance of the difference of the proportion from 0.5 and (c) compute a 95% confidence interval for the proportion. The data is first read in and year and hiv defined as factors: units [997] "Analysis of some data from file exer1_1.txt" open 'exer1_1.txt';channel=2;filetype=input read [channel=2] id,day,month,year,hiv groups [redefine=yes] year,hiv Next the data under consideration is restricted to that for 1993: rest id,year,hiv ;condition=(year.eq.93) (This above line, rest=restrict and year.eq.93 means year is equal to 1993) The 1993 data is tabulated by hiv status in a table called hiv93, this table is printed and the Pearson chisquare test of significance tests whether there is evidence that the proportion is not 0.5: tabulate [classification=hiv;counts=hiv93;margins=yes] id print hiv93 chisquare hiv93 …chi-sqd test of independence Compute p̂ the estimate of the proportion with HIV from the table and use the normal approximation to the binomial to calculate a 95% confidence interval as: pˆ (1 pˆ ) pˆ 1.96 n Repeat for 1997. Note that in Genstat the effect of successive restrict statements is cumulative. You can restore the full data set by executing an unconditional restrict statement. I.e.: rest id,year,hiv “resetting it back to the original data set with no condition” Analysis 1.2 Comparing two proportions: Chi-Squared test of Association This example uses the procedure Chisquare which can test for independence in a oneway table as above or an R x C table and uses Pearson’s goodness of fit criterion. Another likelihood based method will be introduced later. For small counts the procedure FEXACT2X2 performs an exact analysis of a 2 x 2 table. Form the two-way table of HIV incidence in 1993 and 1997 and test whether the proportion with HIV is the same for both years. This is the same as testing for an association between year and HIV: rest id,year,hiv rest id,year,hiv ;condition=(year.eq.93.or.year.eq.97) tabulate [classification=year,hiv;counts=hiv9397] id print hiv9397 chisquare hiv9397 This gives the following table with zeros for the omitted years. The subsequent execution of the chisquare directive does not work correctly as it is computed for the whole table, zeros included and hence assumes 5 degrees of freedom. hiv year 93 94 95 96 97 98 Table hiv9397 0 1 165 0 0 0 102 0 225 0 0 0 102 0 There are two alternatives, either (i) we select the appropriate data from this table and enter it in a table and compute the chisquare value or (ii) the data can be extracted from the table above using Genstat programming. Both methods are illustrated. (i) Use the following Genstat code. factor [nvalues=4;levels=2;labels=!t('93','97')] year9397 table [classification=year9397,hiv; values=165,225,102,102] hi9397 11,12,21,22…positions print hi9397 chisquare hi9397 This gives the following output hiv year9397 93 97 Table hi9397 0 1 165.0 102.0 225.0 102.0 Pearson chi-square value is 3.20 with 1 df. Probability level (under null hypothesis) p = 0.074 (ii) Extract the data from the 6 x 2 table above using Genstat code: scalar d[1...12] “ The following line transfers the 12 values in the 6 x 2 table hivtest to the twelve scalars d[1] to d[12] “ equate hiv9397;!p(d[1...12]) print d[1...12] “ The following lines create a 2 x 2 table h9397 classified by factors year9397 (which has been defined to have two levels labelled ’93’ and ‘97’ ) and by hiv. The appropriate values are transferred from the scalars (d[1], d[2], d[9] and d[10]) to this table and the chisquare test performed on it. “ factor [nvalues=4;levels=2;labels=!t('93','97')] year9397 table [classification=year9397,hiv] h9397 equate !(d[1],d[2],d[9],d[10]);h9397 print h9397 chisquare h9397 This gives the following output d[1] 165 d[2] 225 d[3] 0 d[4] 0 d[5] 0 hiv year9397 93 97 d[6] 0 d[7] 0 d[8] 0 d[9] 102 d[10] 102 h9397 0 1 165.0 102.0 225.0 102.0 d[11] 0 d[12] 0 Pearson chi-square value is 3.20 with 1 df. Probability level (under null hypothesis) p = 0.074 Repeat this analysis to compare the prevalence of HIV in the years 1994 and 1996. Applied Statistics II - Categorical Data Analysis Data analysis using Genstat - Exercise 1 One and two-way tables ANSWER SHEET - TO BE SUBMITTED FOR GRADING Name of student___________________ Date______________ Analysis 1.1 Confidence Interval for Proportion Year 1993 p̂ 1997 95% CI for p Analysis 1.2 Comparing two proportions: Chi-Squared test of Association Analysis for 1993 and 1997 From the output, complete the following: HIV prevalence 1993 1997 # screened Pearson 2 = ______ p-value = ______ What do you conclude regarding whether there has been a change in prevalence from 1993 to 1997: ________________________________________________________ Compute (by hand) the Z-statistic for comparing the 1993 and 1997 prevalences, and compare to your 2result above: Z = ________ Z2 = ________ 2 = ________ Analysis for 1994 and 1996 From the output, complete the following: HIV prevalence # screened 1994 1996 Pearson 2 = ______ p-value = ______ What do you conclude regarding whether there has been a change in prevalence from 1994 to 1996: ________________________________________________________