990408: Analysis of Variance

advertisement
Analysis of Variance
Lecture 19, Dec 4, 2006
A. Introduction
1. If we observe an association in sample data between a dichotomous variable and an
interval-level variable, we used a difference between means test to decide whether those
variables are associated in the population from which our sample was drawn or whether the
observed sample association was due to some random process
2. What do we do when we want to decide whether we have sample data and want to know
whether we can infer from them whether a multi-category independent variable is associated
with an interval-level variable in the population the sample came from?
3. For example, you want to know if some of the faculty who teach required courses are
tougher graders than others. You find data showing the 203 grades given by 9 faculty members
in required courses.
a. how can we tell if some of these faculty (a multi-category nominal variable)
tend to be tougher grading than others?
b. we could use a difference in means test to compare the grades of each possible pair
of professors, testing H0 that µ1 = µ2.
c. with 9 categories, this would require 36 difference-in-means tests
d. it would be difficult to draw any general conclusions from 36 tests
B. One-way ANOVA is a more straightforward way to assess an association between a multicategory variable and an interval-level variable and it allows us to directly test the H0 that µ1 = µ2
= µ3 = µg (where g = number of categories in the categorical variable)
1. Illustration with just three faculty—Professors H, M, and L.
a. What if H’s grades = 4.0, 3.9, 3.8, 3.8, 3.7, 3.7, 3.6, 3.6, 3.5, 3.5, 3.5, 3.4,
3.4, 3.3, 3.3, 3.2, 3.2, 3.1, and 3.0, and M and L had the same distributions?
(1) Given these data, could we reject H0 that µh = µm = µ? <no> Why not?
(a) their means would be identical
(b) their dispersions would be identical
b. What if H gave every student a grade between 3.7 and 4.0 with a mean of 3.85; M
gave every student a grade between 3.3 and 3.6 with a mean of 3.45, and L gave every
student a grade between 2.9 and 3.2 with a mean of 3.15?
1
2
(1) Based just on these data would you conclude that these professors differ in
their propensities to give high or low grades? In other words, do you
suspect
that we could reject the H0 that µh = µm = µl? <yes> Why?
(2) Their means differ, which is inconsistent with H0 that h = m = L
(3) their distributions do not overlap, suggesting that they differ in how tough their
grading is, so we would suspect that h  m  l
2. These two extremes illustrate two sources of variation in Y (interval variable)
a. variation across the category (professors’) means ( Y H, Y M, Y L) around Y
b. variation in the spread of each Yi around its category-specific mean ( Y H, Y M, Y L)
3. How can we measure these difference sources of grade dispersion?
a. We can measure the dispersion of each prof’s mean grade around the grand
mean
( Y ) with a version of the formula of the variance: ( Y p - Y )2 / (ng - 1)
b. We can measure the dispersion of the grades for each professor around that professor’s
mean grade ( Y g) and then pool these dispersions for all the professors with a variant of the
formula for the variance:  g [(Yp - Y p)2 / (np-1)]
4. The sum of these two sources of variation in Y = the total variance in Y
( Y p - Y )2 / (ng - 1) +  g [(Yp - Y p)2 / (np-1)] = [(Yi - Y )2/(n - 1)]
5. We use these two estimates of variation in Y—the between variance and the
within
variance in statistical inference regarding the association between X and Y
a. the between variance refers to the variation in Y between the means of Y for each
category of X around Y . Because knowing X explains this proportion of the variation
in Y it is the explained variance
b. the within variance is the sum of how much Y varies within each category of X
around that category mean for Y, summed over each category of X. Here we are pooling
estimates of variation for each value of X, and since knowing X does not explain this
variation in Y we also call it the unexplained variance.
6. These estimates of the variation in Y are actually sums of squares because they don’t
take into account the number of cases (actually, degrees of freedom) on which each is based.
Now let’s turn them into estimates of the variance in Y.
3
a. the between sum of squares estimate is based on the number of categories of X
so it has g – 1 df.
b. the within sum of squares estimate is based on the size of the sample minus
the
number of df we have already used; so it has n – g df.
c. notice that the between and within df = the df for the sample variance:
n - 1 = (n – g) + (g – 1)
7. Formulae for variance estimates when g = 3 (H, M, L)
a. total variance = Σ(Yi - Y )2 / n - 1
b. between variance = Σ( Y g - Y i)2 / (g – 1)
c. pooled estimate of within variance = Σg [Σ(Yx - Y x1)2 / n - g]]
8. The quotient from dividing an estimated sum of squares by its df is called the
mean
squared deviations (or mean square)
9. Sum of squared devs in Y
df
Mean squared deviation
(1) TSS = (Yi - Y )2
n-1
(Y - Y )2 / n-1
(2) BSS = ( Y g - Y )2
g-1
( Y g - Y )2 / g - 1
(3) WSS = (Y - Y g)2
n- g
(Y – Y g)2 / n - g
C. The logic of one-way ANOVA
1. ANOVA compares two independent estimates of the variation in Y to assess
whether their ratio is more consistent with a H0 that X and Y are not
related in the population or Ha that they are related in the population
2. If X and Y are not related, then we would expect σ2y.x1 = σ2y.x2 = σ2 y.x3 = σ2 y.xg
pooled estimate of the within-X variance in Y will approximately equal
and the
the total variance in
Y, while the between variance will  0
3. In contrast, according to Ha, enough of the variation in Y will be explained by X
so the
between variance will be large compared to the within variance
4. We compare the between and within estimates of the variance in Y to see if the
sample-level association between X and Y is more consistent with H0 or Ha
5. The statistical test for the association between X and Y is a ratio of the estimates of the
between to the within variance in Y
4
a. if the population means of Y for each category of X were identical, then any variation
in sample values of Y within these categories would result from random variation of Y
around Y x
b. WSS estimates the variation in Y that X cannot explain. If X and Y are unrelated,
then the estimate of the unexplained variance in Y ((Yxi – Y x)2 should not exceed the
estimate of the explained variance, and the ratio of between- to within-category sum of
squares should  1.
c. The greater the explained (between) variance relative to the unexplained (within)
variance, the less likely the association between X and Y in the sample data stemmed
from some random process, and the more likely it reflects a real association between X
and Y in the population from which the sample was drawn.
d. The stronger the association between X and Y in the sample, the larger BSS
relative to WSS. And the larger the ratio of BSS to WSS, the less likely
observed in the sample stemmed from some random process
will be
that the association
and the more likely it reflects
an association between X and Y in the population
D. The F distribution and the F test of statistical significance
1. The F distribution is the sampling distribution of the ratio of two independent
estimates of the same variance
a. this is what we do when we use F when testing whether a multiple
regression
equation is statistically significant
2. Table D shows the probability of getting any particular ratio, given its number of
categories and the number of cases
3. As Table D shows, F has a distinct sampling distribution for each combination of df1 (the
between estimate) and df2 (the within estimate)
4. Test statistic for F test = the ratio of between mean square/within mean square
[ Y g - Y )2/g – 1] / [(Y - Y g)2/n – g]
5. Illustration of decomposition of variation in Y by revisiting our earlier examples in which
the professor accounted for none of the variance in grades or all of the variation in grades around
her/his mean grade ( Y g).
5
a. when the means and variances for each professor were equal, X doesn’t
any of the variance in grades and the between mean squared deviation is
explain
very small compared
to the within mean squared deviation
b. when the 3 professors differed in their mean grade and their distributions did
overlap, X explained much of the variation in grading, and the ratio of
not
between mean squares
to within mean squares would exceed 1 and probably large
E. Example
1. Hypotheses
a. H0: xa = xb = xc so X and Y are not associated
If H0 is true, the BSS  WSS, and the ratio of BSS:WSS will  1
b. Ha: µs not equal; X and Y are associated
2. Alpha level: We’ll take a 5% risk of concluding that two variables are unrelated when
they are in fact related in the population
a. Our text simplifies finding the cutoff point for the region of rejection by presenting 3
tables, one for α = .05, one for α = .01, and α = .001
b. For α = .05, we’ll use the table on p. 871. These sampling distributions give the
probability of getting any particular F ratio if H0 is true.
c. We can’t determine the cutoff for the region of rejection until we know the sample
size and the number of categories of X.
3. Data
a. To find the value that marks off the critical region, we must calculate the dfs
associated with the BSS and the WSS.
(1) the dfs associated with the BSS (df1 = g – 1) appears across top of table.
(2) the dfs associated with the WSS (df2 = n - g) is on left side of table.
b. we’ll reject H0 if F > 3.88 [draw rejection region]
c. estimate of between mean square = 30/2
d. estimate of within mean square = 16/12
4. Calculate test statistic—F ratio
F = (30/2)/(16/12) = 15/1.333 = 11.25
5. Decision: 11.25 > 3.88 so we reject H0 that X and Y are not related with a 5% chance of a
type I error
6
If H0 were true, the means for the three categories should be closer together. Instead, the
between-category mean square was much larger than within-category mean square, and the F
ratio > 1
F. ANOVA can be applied whenever you are comparing the variance across multiple
independent groups
1. Difference between two means
a. The number of df associated with BSS will be 1, so you will look in the first column
of F table to find the cutoff for the region of rejection associated with WSS df.
b. If you look closely at this column, you will see that these are the same
numbers associated with the df in the t-distribution table
c. F1,df2 = t2df2
2. We can use the F test more generally to compare any two or more variances—we
can use it to test for homoskedasticity (σ2x1 = σ2x2 = σ2x3 = σ2x4 = σ2x5 = . . . σ2xg )
3. We use the F test when assessing whether a multiple regression model
significantly
explains variation in Y
G. Assessing the strength of the association between a categorical independent variable and an
interval-level dependent variable
1. a related measure, eta sq. (2) assesses the strength of the association between a
categorical variable with more than two categories and an interval-level variable. eta
square (2) is also called the correlation ratio. Not used much, but you should just
know it exists
2. 2= between SS/total SS: the ratio of the amount of variance in the dep var. that is due
to X to the total amount of variance in the dep. var.
Download