chapter9

advertisement
STP 420 SUMMER 2005
STP 420
INTRODUCTION TO APPLIED STATISTICS
NOTES
PART 3 – TOPICS IN INFERENCE
CHAPTER 9
INFERENCE FOR TWO-WAY TABLES
9.1
Inference for two-way tables
Freq. Binge drinker
Yes
No
Total
Gender
Men
1630
5550
7180
Female
1684
8232
9916
Total
3314
13782
17096
This is a 2  2 table and has 2 rows and 2 columns.
The table presents two categories variables, each with two categories.
Each value is in a cell.
A 3  2 table has 3 rows and 2 columns.
An r  c table has r rows and c columns.
Describing relations in two-way tables
Joint distribution – distribution formed by expressing the count in each cell as a percent
of the grand total.
Conditional distributions – distributions formed by expressing the count in each cell
express as a percent of the row total. The same applies when it is a percent of the column
total.
Expected cell count = row total  column total
n
Two-way table for Frequent Binge drinkers and Gender
1
STP 420 SUMMER 2005
Freq. Binge drinker
Yes
Expected
Total
Row
Column
No
Expected
Total
Row
Column
Total
Gender
Men
1630
1391.82
9.53%
49.19%
22.70%
5550
5788.18
32.46%
40.27%
77.30%
7180
100.00%
42.00%
Female
1684
1922.18
9.85%
50.81%
16.98%
8232
7993.82
48.15%
59.73%
83.02%
9916
100.00%
58.00%
Total
3314
19.38%
100.00%
13782
80.62%
100.00%
17096
100.00%
Marginal Distribution for Frequent Binge drinker
Freq. Binge drinker
Yes
No
Total
Frequency
3314
13782
17096
Percent
19.38%
80.62%
100.00%
Marginal Distribution for Gender
Gender
Male
Female
Total
Frequency
7180
9916
17096
Percent
42.00%
58.00%
100.00%
Conditional distribution of Frequent Binge drinkers given Gender is Male.
2
STP 420 SUMMER 2005
Gender = male
Freq. Binge drinkers
Yes
No
Total
Frequency
1630
5550
7180
Percent
22.70%
77.30%
100.00%
Conditional distribution of Frequent Binge drinkers given Gender is Female.
Gender = female
Freq. Binge drinkers
Yes
No
Total
Frequency
1684
8232
9916
Percent
16.98%
83.02%
100.00%
Conditional distribution of Gender given frequent Binge drinker is Yes.
Freq. Binge d. = yes
Gender
Male
Female
Total
Frequency
1630
1684
3314
Percent
49.19%
50.81%
100.00%
Conditional distribution of Gender given frequent Binge drinker is No.
Freq. Binge d. = yes
Gender
Male
Female
Total
Frequency
5550
8232
13782
Percent
40.27%
59.73%
100.00%
Joint distribution of frequent Binge drinkers and Gender
3
STP 420 SUMMER 2005
Freq. Binge drinker
Yes
No
Total
Gender
Men
1630
9.53%
5550
32.46%
7180
Female
1684
9.85%
8232
48.15%
9916
Total
3314
19.38%
13782
80.62%
17096
42.00%
58.00%
100.00%
Simpson’s paradox – an association or comparison that holds for all of several groups
can reverse direction when the data are combined to form a single group.
- reversal of direction by aggregation of data
Example of three-way table – presenting information on three variables, one two-way
table for each level (value) of the third variable.
Died
Survived
Total
Good Condition
Hosp. A
Hosp. B
6
8
594
592
600
600
Died
Survived
Total
Poor Condition
Hosp. A
Hosp. B
57
8
1443
192
1500
200
Condition variable – good and poor
Hospital variable – A and B
Survival variable – Died and survived
Aggregation of data – adding up across one variable (elimination of one variable)
Eg. eliminating condition (ignoring condition)
Died
Survived
Total
Hosp. A
63
2037
2100
Hosp. B
16
784
800
4
STP 420 SUMMER 2005
9.2 Inferences for Two-Way Tables
The hypothesis: no association
H0 : There is no association between the row variable and the column variable
Expected cell counts – calculated under the assumption that H0 is true.
The chi-square test
The chi-square statistic is a measure of how much the observed cell counts in a two-way
table diverge from the expected cell counts. The recipe for the statistic is
X 
2
( observed count  exp ected count ) 2
exp ected count
where “observed” represent an observed sample count, “expected” represents the
expected count for the same cell, and the sum is over all r  c cells in the table.
Chi-square distribution (2)– distribution of the X2 statistic above
- density curve (area under the curve is 1)
- infinitely many curves and are identified by the degrees of freedom (similar to t
distribution)
- right skewed
- begins at 0 and extends infinitely to the right
Chi-square Test for Two-way Tables
The null hypothesis H0 is that there is no association between the row and column
variables in a two-way table. The alternative is that these variables are related.
If H0 is true, the chi-square statistic X2 has approximately a 2 distribution with
(r – 1)(c – 1) degrees of freedom.
The P-value for the chi-square test is
P(2  X2)
5
STP 420 SUMMER 2005
Where 2 is a random variable having the 2(df) distribution with df = (r – 1)(c – 1).
This test is always only a one-tailed test (right-tailed),since we are only testing to see if
there is an association between the two variables where H0 is that there is no association
between the two variables.
The chi-square test and the z test
In a 2  2 table, the comparison of proportion of successes in two populations can be
done by either the chi-square test or the two-sample z test for a proportion. The two tests
give the same result.
The square of the N(0, 1) z values equal to the chi-square values.
Eg. 2(1) critical values equals the square of the N(0, 1) critical values.
Beyond the basics – meta-analysis
Meta-analysis – collection of statistical techniques designed to combine information
from different but similar studies.
Relative risk – ratio of two proportions where the second is usually a reference/control.
9.3
Formulas and Models for Two-Way Tables*
Computations for Two-Way Tables
1.
Calculate descriptive statistics that convey the important information in the table.
Usually these will be column or row percents.
2.
Find the expected counts and use these to compute the X2 statistic.
3.
Use chi-square critical values from Table F to find the approximate P-value.
6
STP 420 SUMMER 2005
4.
Draw a conclusion about the association between the row and column variables.
Computing expected cell counts
Computing the chi-square statistic
Models for two-way tables
Each unit/subject must only be counted once.
Will compare two population proportions.
Model for Comparing Several Populations using Two-Way Tables
Select independent SRSs from each of c populations, sizes n1, n2, … , nc. Classify each
individual in a sample according to a categorical response variable with r possible values.
There are c different probability distributions, one for each population.
The null hypothesis is that the distributions of the response variable are the same in all c
populations. The alternative hypothesis says that these c distributions are not all the same.
Joint distribution – probability distribution of all the r  c possible outcomes in an r  c
two-way table.
Marginal distributions – overall distribution for each of the two categorical variables
(either summing over rows to give the marginal distribution for the column variable or
summing over columns to give the marginal distributions for the row variable).
Model for Examining Independence in Two-Way Tables
Select and SRS of size n from a population. Measure two categorical variables for each
individual.
The null hypothesis is that the row and column variables are independent.
7
STP 420 SUMMER 2005
The alternative hypothesis is that the row and column variables are dependent.
8
Download