1129 Chi-Square Distributions

advertisement
Chi-Square
Distributions
Recap
• Analyze data and test hypothesis
• Type of test depends on:
• Data available
• Question we need to answer
• What do we use to examine patterns between categorical
variables?
• Gender
• Location
• Preferences
t-distribution
Distribution Plot
T
0.4
df = 4
Density
0.3
df = 100
0.2
0.1
0.0
-5.0
-2.5
0.0
X
2.5
5.0
F-distribution
Distribution Plot
F, df1=6, df2=10
0.7
0.6
Density
0.5
0.4
0.3
0.2
0.1
0.0
0
1
2
3
X
4
5
6
χ-square distribution
Distribution Plot
Chi-Square
0.5
Density
0.4
df = 2
0.3
0.2
df = 4
0.1
df = 10
0.0
0
5
10
15
X
20
25
30
Cumulative Probability
2
πœ’
distribution
• Goodness of fit
• Test for homogeneity
• Test for independence
Goodness of Fit
• Testing one categorical value from a single population
• Example:
• A manufacturer of baseball cards claims
• 30% of all cards feature rookies
• 60% feature veterans
• 10% feature all-stars
Reference: http://stattrek.com/Lesson3/ChiSquare.aspx
2
πœ’
2
πœ’
•
•
•
•
Assumptions
Data is collected from a simple random sample (SRS)
Population is at least 10 times larger than sample
Variable is categorical
Expected value for each level of the variable is at least 5
Steps in the Process
•
•
•
•
State the hypothesis
Form an analysis plan
Analyze sample data
Interpret results
2
πœ’
Goodness of Fit
• State the hypothesis
• Null: The data are consistent with a specified distribution
• Alternative: The data are not consistent with a specified
distribution
• At least one of the expected values is not accurate
• Baseball card example
• 𝐻0 : 𝑃𝑅 = 0.3, 𝑃𝑉 = 0.6, 𝑃𝐴𝑆 = 0.1
• π»π‘Ž : At least one of the probabilities in inaccurate
2
πœ’
Goodness of Fit
• Analysis Plan
• Specify the significance level
• Determine the test method
• Goodness of fit
• Independence
• Homogeneity
2
πœ’
Goodness of Fit
• Analyze the sample data
• Find the degrees of freedom
• d.f.= k-1, where k=the number of levels for the distribution
• Determine the expected frequency counts
• Expected frequency (E) = sample size x hypothesized proportion
• 𝐸𝑖 = 𝑛 x 𝑝𝑖
• Determine the test statistic
•
Χ2
=Σ
𝑂𝑖 −𝐸𝑖 2
𝐸𝑖
• Interpret the results
Goodness of fit example
Problem
• Acme Toy Company prints baseball cards. The company claims
that 30% of the cards are rookies, 60% veterans, and 10% are
All-Stars. The cards are sold in packages of 100.
• Suppose a randomly-selected package of cards has 50 rookies,
45 veterans, and 5 All-Stars. Is this consistent with Acme's
claim? Use a 0.05 level of significance.
2
Using Excel to find πœ’
• Determine 𝐸𝑖
• Create 2 columns: n and p, and enter appropriate values
• In the 3rd column: 𝐸𝑖 = 𝑛 x 𝑝𝑖
n
100
100
100
p
0.6
0.3
0.1
E sub i
60
30
10
Another G of F problem
• Poisson Distribution
• Automobiles leaving the paint department of an assembly
plant are subjected to a detailed examination of all exterior
painted surfaces.
• For the most recent 380 automobiles produced, the number
of blemishes per car is summarized below.
Blemishes
# of cars
0
1
2
3
4
242
94
38
4
2
• Level of significance: ∝= .05
2
Using Excel to find πœ’
• Determine πœ’ 2
• Add a 4th column to the spreadsheet: 𝑂𝑖
• In the 5th column, calculate each element of the πœ’ 2 statistic
• =(D2-C2)^2/C2
• Sum the values of the 5th column
n
100
100
100
p
0.6
0.3
0.1
E sub i
60
30
10
O sub i
45
50
5
3.75
13.333
2.50
19.583
• This is the πœ’ 2 value, the test statistic
• Use the πœ’ 2 calculator to find the value of p, and interpret the
test results.
Test for homogeneity
• Single categorical variable from 2 populations
• Test if frequency counts are distributed identically across both
populations
• Example: Survey of TV viewing audiences. Do viewing
preferences of men and women differ significantly?
• We make the same assumptions we did for the goodness of fit
test
•
•
•
•
Data is collected from a simple random sample (SRS)
Population is at least 10 times larger than sample
Variable is categorical
Expected value for each level of the variable is at least 5
• We use the same approach to testing
State the hypothesis
• Data collected from r populations
• Categorical variable has c levels
• Null hypothesis is that each population has the same
proportion of observations, i.e.:
H0: Plevel 1, pop 1 = Plevel 1, pop 2 =… = Plevel 1. pop r
H0; Plevel 2, pop 1 = Plevel 2, pop 2 - … = Plevel 2, pop r
…
H0: Plevel c, pop 1 = Plevel c, pop 2=…=Plevel c, pop r
• Alternative hypothesis: at least one of the null statements if
false
Analyze the sample data
• Find
•
•
•
•
Degrees of freedom
Expected frequency counts
Test statistic (πœ’ 2 )
p-value or critical value
Analyze the sample data
• Degrees of freedom
• d.f.=(r-1) x (c-1)
• Where
• r= number of populations
• c= number of categorical values
Analyze the sample data
• Expected frequency counts
• Computed separately for each population at each categorical
variable
• πΈπ‘Ÿ,𝑐 =
π‘›π‘Ÿ x 𝑛𝑐
𝑛
• Where:
• πΈπ‘Ÿ,𝑐 = expected frequency count of each population
• π‘›π‘Ÿ = number of observations from each population
• 𝑛𝑐 = number of observations from each category/treatment level
Analyze the sample data
• Determine the test statistic
Χ2
=Σ
𝑂𝒓,𝒄 − 𝐸𝒓,𝒄
𝐸𝒓,𝒄
2
• Determine the p-value or critical value
Test for homogeneity
Problem
• In a study of the television viewing habits of children, a
developmental psychologist selects a random sample of 300
fifth graders - 100 boys and 200 girls. Each child is asked which
of the following TV programs they like best.
Family Guy
South Park
The
Simpsons
Total
Boys
50
30
20
100
Girls
50
80
70
200
Total
100
110
90
300
State the hypotheses
• Null hypothesis: The proportion of boys who prefer Family Guy is
identical to the proportion of girls. Similarly, for the other programs.
Thus:
H0: Pboys who like Family Guy = Pgirls who like Family Guy
H0: Pboys who like South Park = Pgirls who like South Park
H0: Pboys who like The Simpsons = Pgirls who like The Simpsons
• Alternative hypothesis: At least one of the null hypothesis
statements is false.
Analysis plan
• Compute
• Degrees of freedom
• Expected frequency counts
• Chi-square test statistic
• Degrees of freedom
𝑑. 𝑓. = π‘Ÿ − 1 x (c − 1)
• Where:
• π‘Ÿ = number of population elements
• 𝑐 = number of categories/treatment levels
• In this case
𝑑. 𝑓. = 2 − 1 x 3 − 1 = 2
Analysis plan
Compute the expected frequency counts
Er,c = (nr * nc) / n
E1,1 = (100 * 100) / 300 = 10000/300 = 33.3
E1,2 = (100 * 110) / 300 = 11000/300 = 36.7
E1,3 = (100 * 90) / 300 = 9000/300 = 30.0
E2,1 = (200 * 100) / 300 = 20000/300 = 66.7
E2,2 = (200 * 110) / 300 = 22000/300 = 73.3
E2,3 = (200 * 90) / 300 = 18000/300 = 60.0
Family Guy
South Park
The
Simpsons
Total
Boys
50
30
20
100
Girls
50
80
70
200
Total
100
110
90
300
Analysis plan
• Determine the test statistic
Χ2
=Σ
𝑂𝒓,𝒄 − 𝐸𝒓,𝒄
𝐸𝒓,𝒄
2
Family Guy
South Park
The
Simpsons
Total
Boys
50 (33.3)
30 (36.7)
20 (30.0)
100
Girls
50 (66.7)
80 (73.3)
70 (60.0)
200
Total
100
110
90
300
Analysis plan
• p-value
• use the Chi-Square Distribution Calculator to find
P(Χ2 > 19.91) = 1.0000
• Interpret the results
Test for independence
• Almost identical to test for homogeneity
• Test for homogeneity: Single categorical variable from 2
populations
• Test for independence: 2 categorical variables from a single
population
• Determine if there is a significant association between the 2
variables
• Example
• Voters are classified by gender and by party affiliation (D,R,I).
• Use X2 test to determine if gender is related to voting preference
(are the variables independent?)
Test for independence
• Same assumptions
• Same approach to testing
• Hypotheses
• Suppose variable A has r levels and variable B has c levels. The
null hypothesis states that knowing the level of A does not help
you predict the level of B. The variables are independent.
• H0: Variables A and B are independent
• Ha: Variables A and B are not independent
• Knowing A will help you predict B
• Note: Relationship does not have to be causal to show
dependence
Test for independence
Problem
• A public opinion poll surveyed a simple random sample of
1000 voters. Respondents were classified by gender (male or
female) and by voting preference (Republican, Democrat, or
Independent). Do men’s preferences differ significantly from
women’s?
Republican
Democrat
Independent
Total
Men
200
150
50
400
Women
250
300
50
600
Total
450
450
100
1000
Test for independence
• Hypotheses
• H0: Gender and voting preferences are independent.
• Ha: Gender and voting preferences are not independent.
• Analyze sample data
•
•
•
•
Degrees of freedom
Expected frequency counts
Chi-square statistic
p-value or critical value
• Interpret the results
Download