INDEPENDENCE PROBABILITY &

advertisement
INDEPENDENCE & PROBABILITY
A. Before beginning t h i s section, i t i s i m p o r t a n t t h a t a common
misunderstanding i s prevented.
I f t h i s s e c t i o n i s n o t c a r e f u l l y followed,
i t i s easy t o m i s t a k e n l y conclude t h a t
(i.e.,
Pr( A & B ) = Pr( A )
*
Pr( B )
t h a t t h e p r o b a b i l i t y t h a t t h e events, A and B, j o i n t l y occur equals
t h e product of t h e p r o b a b i l i t y t h a t event A occurs times t h e p r o b a b i l i t y
t h a t event B occurs).
I n f a c t , f o r any g i v e n sample estimates o f these
p r o b a b i l i t i e s a r e u n l i k e l y t o e x a c t l y y i e l d t h i s equivalence.
So, f o r t h e
record, please keep t h e f o l l o w i n g i n mind:
1. It i s always t r u e t h a t
Pr( A & B ) = Pr( A
I
B )
*
Pr( B )
(i.e.,
that
t h e p r o b a b i l i t y t h a t t h e events, A and B, j o i n t l y occur equals t h e
product o f t h e p r o b a b i l i t y t h a t event A occurs g i v e n t h a t t h e event B
has occurred times t h e p r o b a b i l i t y t h a t event B o c c u r s ) .
2. I n t h e s o e c i a l
case when
A and B a r e independent events, however,
P r ( A & B ) = P r ( A ) * P r ( B ) .
B. That said, l e t us now begin w i t h an imaginary p o p u l a t i o n .
We draw two
u n i t s ( c a l l them Joe and Mary) from t h e p o p u l a t i o n a t random and e v a l u a t e
them according t o some c h a r a c t e r i s t i c (e.g.,
age).
(Note t h a t we a r e
a s s i g n i n g t h e names, Joe and Mary, i r r e s p e c t i v e o f t h e u n i t s ' a c t u a l names
o r genders.)
1. A t t h i s p o i n t we might ask ourselves, "What i s t h e p r o b a b i l i t y t h a t Joe
i s o l d e r than t h e mean aqe f o r t h e p o o u l a t i o n ? "
On t h e one hand, t h e
age d i s t r i b u t i o n might be skewed t o t h e r i g h t ( i . e . ,
more younger people than o l d e r people).
t h e r e may be many
I f couples a r e having fewer
c h i l d r e n , t h e d i s t r i b u t i o n might be more symmetric (i.e.,
fewer babies
may be "balanced" by t h e number o f o l d e r people from l a r g e b i r t h c o h o r t s
On t h e o t h e r hand, i f t h e p o p u l a t i o n i s o f people
who have d i e d o f f ) .
who a b s t a i n from sex a l t o g e t h e r (e.g.,
a r e l i g i o u s community l i k e t h a t
o f t h e Shakers), t h e age d i s t r i b u t i o n might be skewed t o t h e l e f t .
In
t h e f i r s t case, t h e p r o b a b i l i t y would be l e s s than . 5 (because t h e
p o p u l a t i o n ' s mean age, pA, would be l a r g e r than t h e p o p u l a t i o n ' s
median).
I n t h e l a s t case, i t would be g r e a t e r than . 5 (because i t
would be s m a l l e r than t h e p o p u l a t i o n ' s median).
Because t h e median i s
t h e value o f a v a r i a b l e such t h a t . 5 o f t h e d i s t r i b u t i o n has h i g h e r
values (and . 5 has lower values) on t h e v a r i a b l e , i t i s o n l y i n t h e
second case (when t h e d i s t r i b u t i o n i s symmetric and mean=median) t h a t we
may conclude t h e f o l l o w i n g :
Pr( J )
=
.5
,
where J i s t h e event t h a t
Joe's age
>
pA
.
a. The p r o b a b i l i t y t h a t a s i n g l e event occurs i s sometimes r e f e r r e d t o
as t h e MARGINAL PROBABILITY o f t h e event.
It i s the p r o b a b i l i t y o f
an event i r r e s p e c t i v e o f whether o r n o t any o t h e r events may have
occurred.
b. Note t h a t t h i s use o f t h e P r ( * ) n o t a t i o n i s somewhat d i f f e r e n t from
our e a r l i e r usage.
The use o f J here (and M, A, B, e t c . l a t e r on)
corresponds t o a s p e c i f i c event, n o t u n l i k e t h e d i s c r e t e a t t r i b u t e s
o f nominal-level variables.
P r i o r references t o
Pr( X > k )
allow
X t o t a k e values along a continuum, n o t u n l i k e t h e a t t r i b u t e s o f
r a t i o - l e v e l variables.
2. O.K.
Then l e t ' s assume t h a t Joe and Mary were randomly sampled from a
p o p u l a t i o n w i t h a symmetric age d i s t r i b u t i o n .
A JOINT PROBABILITY i s
t h e p r o b a b i l i t y t h a t two o r more events occur.
Accordingly, we might
40
ask, "What i s t h e j o i n t p r o b a b i l i t y t h a t
g r e a t e r t h a n t h e p o p u l a t i o n ' s mean age?"
both
Joe's and Mary's ages are
To h e l p i n answering t h i s
question, we can use a 2 x 2 t a b l e t o d e p i c t t h e f o u r r e l e v a n t
combinations o f events r e g a r d i n g Joe's and Mary's ages:
Joe
above
be1 ow
above
Mary
below
I f Joe and Mary were sampled a t random, then t h e p r o b a b i l i t y t h a t Mary's
age i s g r e a t e r than t h e p o p u l a t i o n mean i s independent o f ( i .e.,
u n e f f e c t e d by) whether o r n o t Joe's age i s g r e a t e r than t h e mean.
Note
t h a t whenever two events are independent (as are u n i t s o f a n a l y s i s ' s
values on v a r i a b l e s a r e when t h e y a r e sampled a t random), t h e j o i n t
p r o b a b i l i t i e s o f t h e events equals t h e product o f t h e marginal
p r o b a b i l i t i e s o f t h e events.
Pr( J & M )
=
Pr( J )
Using p r o b a b i l i t y n o t a t i o n ,
Pr( M ) = .5
. 5 = .25
.
Accordingly, t h e j o i n t p r o b a b i l i t y represented by each c e l l i n t h e t a b l e
equals . 2 5
.
Thus " t h e p r o b a b i l i t y t h a t EITHER Joe's OR Mary's ( b u t NOT
BOTH'S) ages are g r e a t e r than t h e mean" can be c a l c u l a t e d as f o l l o w s :
NOTICE t h a t t h i s ONLY HOLDS WHEN Joe's and Mary's ages a r e RANDOM
events, as t h e y would n o t be were Mary and Joe m a r r i e d t o each o t h e r ,
f o r example.
3. DEFINITION: Two random variables are s t a t i s t i c a l l y independent i f the
"conditional distribution" o f one I s the same within each l e v e l o f the
other variable.
Thus, i f Joe has one l e v e l o f t h e age v a r i a b l e (e.g.,
being o l d e r than
t h e mean age), t h e p r o b a b i l i t i e s a r e .5 t h a t Mary (when selected) w i l l
be younger than t h e mean age and .5 t h a t Mary w i l l be o l d e r than t h i s .
I f Joe's l e v e l on t h e age v a r i a b l e i s "younger than t h e mean age," these
p r o b a b i l i t i e s remain t h e same (i.e.,
the conditional d i s t r i b u t i o n o f
Mary's age i s t h e same no m a t t e r what Joe's age l e v e l i s ) .
Got i t ?
C. Knowing a STATISTICAL DEPENDENCE when you see one
L e t ' s now consider an example t h a t s h i f t s our t h i n k i n g from p r o b a b i l i t y
d i s t r i b u t i o n s f o r two u n i t s o f a n a l y s i s , t o e m p i r i c a l d i s t r i b u t i o n s f o r two
groups ( o r c l u s t e r s ) o f these u n i t s .
I n examining people's general a n x i e t y
about l i f e , you might hypothesize t h a t t h e more r e l i g i o u s people are, t h e
l e s s anxious they are.
This can be shown g r a p h i c a l l y as f o l l o w s :
r e l i ious
4
Not anxious
not
r e l i ious
4
Very anxious
Since t h e d i s t r i b u t i o n o f a n x i e t y i s d i f f e r e n t f o r d i f f e r e n t r e l i g i o s i t y
l e v e l s , a n x i e t y and re1 i g i o s i t y a r e s t a t i s t i c a l l v d e ~ e n d e n t .
NOTE: I f you assume t h a t each o f these superimposed d i s t r i b u t i o n s are
normal, you might wish t o t e s t whether t h e mean a n x i e t y scores o f t h e i r
corresponding subpopulations are significantly different.
This is what is
done when you use a t-test, which we shall discuss at length later in the
semester.
D.
random s a m ~ l i n q
so
im~ortant
1. Random sampling is a data-collection strategy for ensuring statistical
independence of observations among units of analysis (like Mary and
Joe).
As a consequence, any dependence found in one's data matrix can
only have originated in one of two ways:
a. On the one hand, the dependence may be due to nonrepresentative
peculiarities among the units of analysis randomly selected into the
sample.
This is what statisticians mean when they speak of results'
being due to s a m ~ l i n gerror. Always recognizing this possibility,
but armed with the central limit theorem, the statistician estimates
the probability that this has occurred.
If this probability is
sufficiently small (commonly, less than .05), the dependence may have
an a1 ternati ve, "nonerroneous" source.
b. On the other hand, if one's units of analysis are representative of
the population from which they were sampled (i .e., if sampling error
is not the culprit), then any statistical dependencies detected in
one's sample must reflect de~endenciesamonq variables in this larger
population.
Accordingly, given that their random sampl ing allows
them to assume observations among subjects to be statistically
independent, statisticians generally speak of independence and
dependence among variables (not subjects).
2. Statisticians tend to think of statistical independence and dependence
i n terms o f t h e "random v a r i a b l e s " o f a data m a t r i x .
For example,
consider t h e f o l l o w i n g m a t r i x w i t h data on "n" u n i t s o f a n a l y s i s and "k"
variables:
V a r l Var2 Var3
4
2
person $1 2
5
7 .
person $2 1
2
4
person $3 3
. . . Vark
....
...
....
..
3
2
3
Each number i n t h i s m a t r i x i s t h e value taken by a RANDOM VARIABLE.
Imagine t h a t you have n o t y e t c o l l e c t e d y o u r data and t h a t you have j u s t
made plans t o i d e n t i f y t h e a t t r i b u t e s o f "n" persons on "k" v a r i a b l e s .
Before c o l l e c t i n g y o u r d a t a you can t h i n k o f y o u r s e l f as having an empty
d a t a m a t r i x w i t h n - t i m e s - k place-holders f o r values t h a t w i l l be s e t
o n l y once you have randomly selected t h e persons t o be i n c l u d e d i n your
sample.
These n - t i m e s - k place-holders are t h e random v a r i a b l e s t h a t you
have a t y o u r d i s p o s a l .
Note how t h e values taken by a " v a r i a b l e " (i.e.,
t h e values down a column o f t h e m a t r i x ) a r e t h e values taken by t h e
"random v a r i a b l e s " associated w i t h i t .
Because random sampling ensures t h a t d a t a on persons ( o r , more
g e n e r a l l y , on u n i t s o f a n a l y s i s ) a r e s t a t i s t i c a l l y independent, any
dependencles found among one's data are most likely due to statistical
DEPENDENCE among one's
variables and not unong one's subjects! T h i s i s
important because researchers a r e g e n e r a l l y NOT i n t e r e s t e d i n p a r t i c u l a r
u n i t s o f a n a l y s i s (except i n t h a t they are r e p r e s e n t a t i v e o f t h e i r
populations-of-interest).
Instead, t h e researcher's o b j e c t i v e i s t o
understand a s s o c i a t i o n s ( o r dependencies) amonq v a r i a b l e s .
When u n i t s
o f a n a l y s i s have been sampled a t random, n o t o n l y does t h e c e n t r a l l i m i t
theorem h o l d b u t you have ensured t h a t a s s o c i a t i o n s found i n your d a t a
are ones among v a r i a b l e s .
I cannot overemphasize t h e importance o f t h i s
conclusion.
E. Thus f a r i t has been claimed t h a t when two v a r i a b l e s a r e s t a t i s t i c a l l y
independent,
Pr(A&B)
=
Pr(A)*Pr(B)
.
I f A & B are two persons' values on t h e same v a r i a b l e (e.g.,
age) then t h i s
e q u a l i t y h o l d s whenever these persons have been RANDOMLY sampled.
F. Now l e t us imagine t h a t we a l s o know t h a t Joe ( t h e f i r s t person sampled) i s
C a t h o l i c and t h a t t h e C a t h o l i c s i n o u r p o p u l a t i o n a r e (on t h e average)
younger t h a n o t h e r subjects.
(Say, t h e y have l o t s o f babies.)
Then
aiven
t h at Joe
- i s C a t h o l i c , t h e p r o b a b i l i t y t h a t Mary i s o l d e r than Joe i s
g r e a t e r t h a n t h e p r o b a b i l i t y t h a t she i s not-assuming
s e l e c t e d a t random from o u r population.
t h a t Mary i s
L e t ' s t a k e t h i s example more
seriously:
Each o f t h e f o l l o w i n g i s a CONDITIONAL PROBABILITY ( i . e . ,
a probability o f
an event's occurrence g i v e n your knowledge t h a t an o t h e r event[s] has
occurred).
Taken together, t h e y b o t h comprise t h e c o n d i t i o n a l d i s t r i b u t i o n
o f respondents' above- o r below-mean ages g i v e n t h a t t h e y have Roman
Catholic religious a f f i l i a t i o n :
ASIDE: The
"I"
I
pA I
Pr( X > pA
X
=
Catholic )
Pr( X <
X
=
C a t h o l i c ) = .6
=
.4
i n these expressions i s read as "given."
For example, t h e
f i r s t e q u a l i t y i s read, "Given ( I ) t h a t X i s C a t h o l i c , t h e p r o b a b i l i t y i s . 4
t h a t X has an age g r e a t e r than (>) t h e average age i n t h e p o p u l a t i o n (bA).Ig
We can i l l u s t r a t e t h i s u s i n g a VENN DIAGRAM:
I f you know t h a t X=Catholic, then you are o n l y l o o k i n g a t a subpopulation
o f t h e respondents.
The c o n d i t i o n a l p r o b a b i l i t y r e f e r s t o how many o f t h e
C a t h o l i c s are o l d vs. young.
To f i n d t h e j o i n t p r o b a b i l i t y t h a t someone i s
BOTH o l d and C a t h o l i c r e q u i r e s t h a t we know something more t h a n i s d e p i c t e d
i n t h e diagram.
I f you are drawing a sample from a p o p u l a t i o n o f
C a t h o l i c s , then t h e p r o b a b i l i t y o f sampling a C a t h o l i c i s one.
In this
case you would be "given" t h a t each u n i t o f a n a l y s i s i s C a t h o l i c , l e a v i n g
t h e j o i n t p r o b a b i l i t y o f being b o t h o l d and C a t h o l i c equal t o t h e
c o n d i t i o n a l p r o b a b i l i t y o f being o l d g i v e n C a t h o l i c a f f i l i a t i o n .
Instead, l e t ' s assume t h a t we a r e s t u d y i n g r e s i d e n t s o f Antwerp and t h a t
70% o f t h e people i n t h i s B e l g i a n c i t y are C a t h o l i c .
I.e.,
we have t h e
f o l l o w i n g marginal d i s t r i b u t i o n o f r e l i g i o n :
Pr( X
=
Catholic )
=
.7
Pr( X t Catholic ) = .3
Now, l e t us consider t h e p r o b a b i l i t y t h a t someone i s BOTH o l d and C a t h o l i c :
1. I f we make t h e u n l i k e l y assumption t h a t age i s s y m m e t r i c a l l y d i s t r i b u t e d
among Antwerp's r e s i d e n t s , we can conclude ( g i v e n t h e d i s c u s s i o n a t t h e
o u t s e t o f t h i s s e c t i o n ) t h a t t h e marginal d i s t r i b u t i o n o f age i s as f o l l o w s :
2. B l i n d l y a p p l y i n g t h e formula f o r s t a t i s t i c a l l y independent v a r i a b l e s , we
calculate that
P r ( X > pA )
Pr( X
=
C a t h o l i c ) = .5
.7
=
.35
.
But t h i s i s NOT t h e p r o b a b i l i t y t h a t a respondent i s an o l d C a t h o l i c ,
equals . 4 !
X > pA
i f X = C a t h o l i c , then t h e p r o b a b i l i t y t h a t
s i n c e we know t h a t
I.e.,
i t i s NOT
.5
Pr( X > pA )
as i s
i n the
population.
So we need t o change o u r formula s l i g h t l y :
P r ( X = o l d and C a t h o l i c )
=
Pr( X = C a t h o l i c )
= .7
.4
=
.28
Pr( X
=
old
I
X
=
Catholic)
(which i s c l e a r l y l e s s than .35)
We can i l l u s t r a t e these p r o b a b i l i t i e s i n a t a b l e :
Table 1: Table o f J o i n t and Marginal P r o b a b i l i t i e s o f R e l i g i o n
and Age o f Residents o f Antwerp, Belgium.
Re1 i g i o n
Catholic
other
01d
Age
Young
COMMENTS:
a. CONDITIONAL PROBABILITIES are c a l c u l a t e d as f o l l o w s :
P r ( X > pA
I
X
=
C a t h o l i c ) = . 4 = .28/.70 =
Pr( Old a C a t h o l i c )
Pr( Catholic )
X
(Recall t h a t i f you know t h a t
=
Catholic
,
then you a r e o n l y
l o o k i n g a t a subpopulation o f 70% o f t h e t o t a l .
To f i n d t h e
c o n d i t i o n a l p r o b a b i l i t y o f being o l d g i v e n being C a t h o l i c , you need
o n l y ask, "How many o f these 70% are o l d ? " )
Thus, t h e r e l a t i o n among c o n d i t i o n a l , marginal, and j o i n t
probabilities i s
Conditional
.
Joint
Marginal
=
NOTE: Be sure you can d i s t i n g u i s h among " j o i n t , "
"conditional,"
and
"marginal d i s t r i b u t i o n s " ! ! !
b. P r ( 0 & C )
=
Pr( 0 )
Pr( C
=
Pr( C )
Pr( 0
BUT, n o t e t h a t
c. When
Pr( 0 & C )
t h e two events,
I
Pr( C
=
0 )
t
Pr( C )
0 & C
,
are
1
I
0 )
=
.5
(.28)/.5
=
.28
C )
=
.7
(.28)/.7
=
.28
Pr( 0
Pr( 0
not
I
I
!!!
C )
C ) t Pr( C )
Pr( 0 )
,
then
s t a t i s t i c a l l y independent:
RECALL t h a t s t a t i s t i c a l independence r e q u i r e s t h a t t h e CONDITIONAL
DISTRIBUTION o f one v a r i a b l e (e.g.,
DISTRIBUTION (e.g.,
(e.s.,
I
Pr( 0
C ) ) equals i t s MARGINAL
Pr( 0 ) ) f o r a l l l e v e l s o f a second v a r i a b l e
c).
d. F i n a l l y , when one v a r i a b l e i s independent o f t h e o t h e r , t h e o t h e r i s
s t a t i s t i c a l l y independent o f i t .
That i s ,
Pr( C 1 0 )
Pr( 0
=
Pr( C )
implies
I
Pr( 0 )
.
Pr(C)*Pr(O
I
C )
=
T h i s f o l l o w s , s i n c e by d e f i n i t i o n ,
Pr(O&C)
And i f
=
Pr(O)*Pr(C
Pr( C 1 0 )
=
Pr( C )
1 0 )
,
then
=
Pr( 0
I
C ) .
C ) = Pr( 0 ) ! ! !
G. NOW, THE BIG QUESTION:
What i s a knowledge o f marginal, j o i n t , and
c o n d i t i o n a l d i s t r i b u t i o n s good f o r ?
1. Imagine t h a t you wish t o draw a m u l t i s t a g e c l u s t e r sample o f r e s i d e n t s
from a c i t y o f 60,000.
The sample i s t o be drawn i n two stages and w i t h
p r o b a b i l i t y ~ r o o o r t i o n a lt o s i z e ( i . .
i n a manner e n s u r i n g t h a t each
r e s i d e n t has t h e same p r o b a b i l i t y o f being i n c l u d e d i n t h e sample).
A
sample o f 300 i s t o be obtained by randomly sampling 10 r e s i d e n t s w i t h i n
each o f 30 randomly sampled blocks.
2. I n t h e f i r s t stage o f your sampling, you wish t o sample each b l o c k (B)
w i t h a p r o b a b i l i t y t h a t ensures t h a t each o f i t s r e s i d e n t s (R) has t h e
same chance o f being i n c l u d e d i n t h e f i n a l sample.
p r o b a b i l i t i e s can be c o r r e c t l y assigned (e.g.,
Assuming t h a t these
based on census records)
and g i v e n your m o t i v a t i o n t o ensure t h a t each o f t h e c i t y ' s 60,000
r e s i d e n t s has t h e same p r o b a b i l i t y o f being i n c l u d e d i n a sample o f s i z e
300, t h i s j o i n t p r o b a b i l i t y o f sampling a r e s i d e n t ' s b l o c k and a
r e s i d e n t w i t h i n t h a t b l o c k would be
3. Now, imagine t h a t a b l o c k w i t h 100 r e s i d e n t s i s selected.
Given t h a t 10
r e s i d e n t s a r e t o be sampled from each sampled block, t h e p r o b a b i l i t y o f
t h e event t h a t one o f these 100 r e s i d e n t s i s i n c l u d e d i n t h e f i n a l
sample i s
We can g e n e r a l i z e t h i s r e s u l t t o r e f e r t o any b l o c k w i t h an a r b i t r a r y
number (say, k ) o f r e s i d e n t s , t o y i e l d t h e c o n d i t i o n a l p r o b a b i l i t y o f
t h e event t h a t one o f a b l o c k ' s "k" r e s i d e n t s i s i n c l u d e d i n t h e f i n a l
49
4. Now l e t ' s c o n s i d e r t h e " p r o b a b i l i t y p r o p o r t i o n a t e t o s i z e " question:
Given t h e p r o b a b i l i t y a t which you wish each r e s i d e n t t o be s e l e c t e d and
t h e p r o b a b i l i t y a t which a r e s i d e n t i s randomly sampled from a b l o c k o f
s i z e , k, a t what p r o b a b i l i t y must a b l o c k o f t h i s s i z e be sampled? Here
we a r e asking f o r t h e marginal p r o b a b i l i t y a t which t h e 30 b l o c k s a r e t o
be sampled.
The question's answer f o l l o w s d i r e c t l y from o u r knowledge
o f t h e r e l a t i o n among marginal, j o i n t , and c o n d i t i o n a l d i s t r i b u t i o n s :
5. F i n a l comments r e g a r d i n g m u l t i s t a g e c l u s t e r sampling:
Once you have
found t h e p r o b a b i l i t y a t which blocks a r e t o be sampled, t h e r e i s
sampling software t o h e l p you i n sampling b l o c k s "weighted" according t o
these p r o b a b i l i t i e s .
Also n o t e t h a t b e f o r e doing t h i s , you must ensure
t h a t each b l o c k has a t l e a s t 10 r e s i d e n t s ( c a n ' t sample 10 i f t h e r e are
o n l y 6) and no more than 2000 r e s i d e n t s (doesn't make sense t o sample a
b l o c k a t a p r o b a b i l i t y g r e a t e r than one).
l a t t e r p o i n t , t r y c a l c u l a t i n g Pr(B) f o r any
I f you a r e u n c l e a r on t h e
k > 2000
,
and y o u ' l l see
what I mean.
H. Marginal, j o i n t , and c o n d i t i o n a l p r o b a b i l i t i e s a r e a l s o fundamental t o
understanding t h e chi-square s t a t i s t i c .
L e t ' s assume t h a t we have obtained
a random sample o f 800 Antwerp r e s i d e n t s , and t h a t o u r d a t a are as d e p i c t e d
i n Table 2.
Based on these d a t a we should be a b l e t o address research
questions such as, "Are C a t h o l i c s younger ( o r o l d e r ) than o t h e r r e s i d e n t s ? "
Table 2: C a t h o l i c and Non-Catholic Residents by Age.*
Religion
Catholic
Other
01d
Age
Young
560
*
240
800
H y p o t h e t i c a l data.
1. We can begin answering t h i s q u e s t i o n by determining whether C a t h o l i c s '
ages are d i f f e r e n t from what one would EXPECT ON THE BASIS OF THE
MARGINAL AGE DISTRIBUTION OF ALL RESIDENTS.
We can e s t i m a t e t h e
marginal p r o b a b i l i t i e s associated w i t h being o l d and C a t h o l i c as
follows:
A
Pr( 0 ) =
A
P r ( C
400
800
=
.5
,
560
800 - ' 7 '
=--
A
where
Pr( 0 ) i s an e s t i m a t o r o f P r ( 0 )t h e p r o b a b i l i t y t h a t an Antwerp
r e s i d e n t i s o l d , and
where
P r ( C ) i s an e s t i m a t o r o f P r ( C )t h e p r o b a b i l i t y t h a t an Antwerp
resident i s Catholic.
A
2. J o i n t p r o b a b i l i t i e s a r e obtained using numbers w i t h i n t h e c e l l s o f the
table.
For example,
3. Now t o t h e q u e s t i o n a t hand:
" I s t h i s j o i n t p r o b a b i l i t y d i f f e r e n t from
what one would expect t o f i n d by chance alone (i.e.,
from what one would
expect i f age and r e l i g i o n were i n f a c t u n r e l a t e d among Antwerp
residents)?"
We can answer t h i s i n p a r t s :
a. We know t h a t i f acle and r e l i c l i o n
are
u n r e l a t e d (a.k.a.,
independent),
P r ( O & C ) = P r ( O ) * P r ( C ) .
b. Given t h i s , we can now e s t i m a t e how many o l d C a t h o l i c s would one
expect t o f i n d i n t h i s sample ( o f s i z e ,
n = 800 ) i f age and
r e l i g i o n were u n r e l a t e d among a l l Antwerp r e s i d e n t s :
1) Note t h a t t h e j o i n t p r o b a b i l i t y assumed here (i.e.,
assumption o f independence) equals .35 ( .5
*
given the
. 7 ) , whereas t h e
j o i n t p r o b a b i l i t y estimated from Table 2 i s c o n s i d e r a b l y s m a l l e r
than t h i s (namely, .28 as c a l c u l a t e d above).
2) As a d i r e c t consequence, t h i s expected frequency (f,),
280, i s 56
l a r g e r t h a n t h e observed frequency ( f o ) o f 224 i n Table 2.
3) Given one expected frequency (e.g.,
fe = 280 ) , we can
immediately determine a l l t h e o t h e r EXPECTED CELL FREQUENCIES by
ensuring t h a t c e l l frequencies add up t o t h e t a b l e ' s marginal
frequencies:
Table 3: Expected Frequencies o f C a t h o l i c and Non-Catholic Residents by Age.
Re1 i g i o n
Other
Cathol ic
01d
Age
Young
4. O.K.
Now we know b o t h t h e expected and observed c e l l frequencies
associated w i t h o u r sample.
BUT how d i f f e r e n t do these frequencies have
t o be t o be SIGNIFICANTLY DIFFERENT? T h i s q u e s t i o n i s answered w i t h t h e
CHI-SQUARE s t a t i s t i c .
C a l c u l a t i n g chi-square r e q u i r e s t h a t f o r each c e l l o f your t a b l e , you
f i r s t s u b t r a c t t h e expected from t h e observed c e l l size, t h e n square t h e
d i f f e r e n c e , and d i v i d e t h i s by t h e expected c e l l size.
these CONTRIBUTIONS
distribution.
TO
The sum o f a l l
CHI-SQUARE has a s p e c i f i c p r o b a b i l i t y
I n p a r t i c u l a r , t h e sum i s d i s t r i b u t e d as c h i - s q u a r e w i t h
t h e number o f degrees o f freedom i n your t a b l e .
The formula f o r c h i -
square i s as f o l l o w s :
# cells
Chi-square
=
x2
( f o - fe)
2
=
a. There i s o n l y one degree o f freedom i n a 2x2 t a b l e .
As mentioned
p r e v i o u s l y , t h i s i s because i f you know t h e frequency i n 1 c e l l , you
can determine t h e frequencies o f a l l o t h e r c e l l s based on b o t h t h i s
frequency and t h e marginal frequencies o f t h e t a b l e .
I n general, you
can decide how many degrees o f freedom you have by u s i n g t h e formula,
( r - 1)
where
*
(c - 1)
,
r = t h e number o f rows i n t h e t a b l e
columns i n t h e t a b l e .
and
c = t h e number o f
I f you added a t h i r d v a r i a b l e w i t h "d"
a t t r i b u t e s t o t h e t a b l e , t h e r e s u l t i n g 3-dimensional t a b l e would have
( r - 1)
*
(c - 1)
*
(d - 1) degrees o f freedom.
T h i s can be
general i z e d f u r t h e r .
b. O.K.,
so we c a l c u l a t e i t and d i s c o v e r t h a t c h i - s q u a r e = 74.67
.
Now
what?
-
Well, we want t o know i f C a t h o l i c s a r e younger t h a n we would
To f i n d out, we l o o k a t
expect on t h e b a s i s o f t h e marginals alone.
Table C from among t h e t a b l e s handed o u t i n c l a s s .
There i t
i n d i c a t e s t h a t i n a t a b l e w i t h 1 degree o f freedom,
Pr(
x12 >
10.827 ) = .001
.
1) Whereas i n t h e standard normal t a b l e (Table A) t h e body o f t h e
t a b l e c o n t a i n s p r o b a b i l i t i e s and values o f t h e z - s t a t i s t i c head
t h e rows and columns, i n your c h i - s q u a r e t a b l e (Table C) t h e body
o f t h e t a b l e c o n t a i n s values o f t h e c h i - s q u a r e - s t a t i s t i c , columns
a r e headed by p r o b a b i l i t i e s , and rows a r e headed by "degrees o f
freedom" f o r your t a b l e .
2) Because our c h i - s q u a r e o f 74.67 i s (considerably) l a r g e r than
t h i s , t h e p r o b a b i l i t y i s l e s s than .001 t h a t sampling e r r o r
accounts f o r why o u r observed frequencies d i f f e r so much from t h e
frequencies we would have expected i f age and r e l i g i o n were
s t a t i s t i c a l l y independent.
T h i s i s because i t i s v e r y u n l i k e l y
t h a t t h e c h i - s q u a r e o f 74.67 r e f l e c t s a chance occurrence due t o
p e c u l i a r i t i e s o f o u r sample.
I.STATISTICAL SIGNIFICANCE versus THEORETICAL CONFIRMATION
1. Once you know t h a t a t a b l e c o n t a i n s a s t a t i s t i c a l l y s i g n i f i c a n t
a s s o c i a t i o n , you s t i l l do n o t know i f t h e a s s o c i a t i o n i s i n t h e
hvoothesized d i r e c t i o n .
For example, c h i -square c o u l d equal 74.67
because C a t h o l i c s a r e s i g n i f i c a n t l y
older
than non-Catholics.
So i t i s
o n l y by r e t u r n i n g t o t h e t a b l e t h a t we can d e f i n i t i v e l y e s t a b l i s h
whether C a t h o l i c s are s i g n i f i c a n t l y younger than non-Catholics.
above case we can conclude t h a t t h e y are, because
54
I n the
2. One o f t h e most common among mistakes made by s o c i a l s c i e n t i s t s i s t o
conclude t h a t a s t a t i s t i c a l l y s i g n i f i c a n t f i n d i n g supports t h e i r t h e o r y
d e s p i t e t h e f a c t t h a t ( w i t h o u t t h e i r having n o t i c e d i t ) t h e d i r e c t i o n o f
t h e s i g n i f i c a n t a s s o c i a t i o n i s o ~ o o s i t et o t h a t suggested i n t h e i r
theory.
Please
be
c a r e f u l n o t t o mistake s t a t i s t i c a l s i c l n i f i c a n c e
for
theoretical confirmation.
J. STATISTICAL SIGNIFICANCE versus SUBSTANTIVE IMPORTANCE
IMAGINE
that o u r
d a t a on t h e Belgian c i t y a r e as f o l l o w s :
Table 4: C a t h o l i c and Non-Catholic Residents by Age.*
Religion
Cathol i c
Other
Old
Age
Young
*
560
H y p o t h e t i c a l data.
240
800
The t a b l e shows C a t h o l i c r e s i d e n t s seven percent (7%) more l i k e l y t o be
young t h a n r e s i d e n t s w i t h o t h e r r e l i g i o u s a f f i l i a t i o n s .
1. I s t h i s s t a t i s t i c a l l v s i q n i f i c a n t ?
Yes, i t i s ( a t
c a l c u l a t e d value o f chi-square equals
than
x12
4.02
,
a = .05 ) .
The
which ( s i n c e i t i s l a r g e r
.05 = 3.841 [see Table C]) i n d i c a t e s t h a t a r e s u l t t h i s s t r o n g
o r s t r o n g e r would occur by chance i n o n l y one same-sized random sample
i n twenty.
2. I s t h e r e s u l t SUBSTANTIVELY IMPORTANT? What do you t h i n k ?
a. THERE I S NO STATISTICAL BASIS FOR DECIDING how much o f a d i f f e r e n c e
i s s u b s t a n t i v e l y important.
You must c o n s u l t your theory, your
colleagues, and your common sense.
b. Using t h e .05 s i g n i f i c a n c e l e v e l and c o n s i d e r i n g a 10% d i f f e r e n c e t o
be s u b s t a n t i v e l y important, t h e f i n d i n g s i n Table 4 a r e s t a t i s t i c a l l y
significant, but
not
s u b s t a n t i v e l y important.
Because s t a t i s t i c i a n s
( l i k e you) must decide what i s s u b s t a n t i v e l y important, t h e i r
s t a t i s t i c s a r e unable t o "speak f o r themselves."
K. RELATING CHI-SQUARE BACK TO THE CONCEPT OF "STATISTICAL INDEPENDENCE":
L e t ' s see j u s t what we have done here.
We have two events, 0 = t h e event
o f being " o l d " and C = t h e event o f being C a t h o l i c .
data, t h e y a r e i n t h i s form:
When we c o l l e c t o u r
Table 5: Dummy Table o f C a t h o l i c s and Old Residents.
-
C
C
The question i s , then, whether o r n o t 0 and C a r e s t a t i s t i c a l l y
independent.
To decide t h i s , we need o n l y f i n d o u t i f
Pr( 0
(i.e.,
a
C )
=
Pr( 0 )
Pr( C )
i f t h e j o i n t p r o b a b i l i t y o f 0 and C equals t h e product o f t h e
marginal p r o b a b i l i t i e s o f t h e events 0 and C).
1. The marginal p r o b a b i l i t i e s o f 0 and C are estimated as f o l l o w s :
2. The j o i n t p r o b a b i l i t y of 0 and C i s estimated as f o l l o w s :
3. Thus, a good t e s t o f s t a t i s t i c a l independence should e v a l u a t e whether
Pr(O&C)=Pr(O)*Pr(C)
.
I n terms o f our dummy t a b l e , i t should evaluate t h e e x t e n t t h a t
P(;
0
o r that
a
C ) = P(;
a
7
=
0 )
a + b
n
P(;
C )
a + c
n
RECALL t h a t i n c a l c u l a t i n g chi-square, we sum up
For t h e a - c e l l i n Table 4, t h e observed frequency i s
t h e expected frequency i s
So, t h e c o n t r i b u t i o n
to
fe =
a + b
n
a + c
n
* n
fo = a
and
.
chi-sauare o f t h e a - c e l l i s
M u l t i p l y i n g by
we g e t
l/n2
T h i s i s t h e CONTRIBUTION o f c e l l "a" t o chi-square i n a t w o - v a r i a b l e
table.
A few COMMENTS about t h i s " c o n t r i b u t i o n " are i n order:
a. The magnitude o f t h e c o n t r i b u t i o n meets one important c r i t e r i o n o f a
good t e s t o f s t a t i s t i c a l independence, since i t equals zero when
A
Pr(08C)
=
a
7
=
a + b
n
a + c
n
A
=
Pr( 0 )
A
Pr( C )
b. I n a t h r e e - v a r i a b l e t a b l e t h i s c o n t r i b u t i o n would l o o k l i k e t h i s :
.
T h i s can be g e n e r a l i z e d f u r t h e r .
c. NOTICE how t h e 'n'
i n t h e formula f o r t h e c o n t r i b u t i o n suggests t h a t
t h e l a r g e r one's sample size, t h e more l i k e l y a s t a t i s t i c a l l y
A
s i q n i f i c a n t d i f f e r e n c e between
Pr( 0 )
*
A
Pr( C )
and
A
Pr( 0 & C )
w i l l be detected.
L. About t h e chi-square d i s t r i b u t i o n
1. Chi-square and sample s i z e
a. We have a 2 x 2 t a b l e :
We now can estimate t h e p r o b a b i l i t y t h i s happened by chance.
Table C, we f i n d t h a t
Pr(
x12 >
1.074 )
=
.30
.
From
(Note t h a t 1.074 i s
as c l o s e t o our c h i - s q u a r e o f .95 as we can f i n d i n Table C.)
Thus
( u s i n g a b i t o f mental i n t e r p o l a t i o n ) , t h e p r o b a b i l i t y o f g e t t i n g a
c h i - s q u a r e as l a r g e as .95
& chance i s about .34
.
That i s , one
would expect a c h i - s q u a r e t h i s l a r g e o r l a r g e r i n about one o u t o f
(l.e.,
every t h r e e t a b l e s t h i s s i z e .
i t i s VERY probable.)
b. Now, a l l o t h e r t h i n g s remaining equal, imagine t h a t we have a sample
59
t e n times as l a r g e :
From Table C we f i n d t h a t
Pr(
x12 >
10.827 ) = .001
and
Pr(
x12 >
6.635 )
=
.O1
.
Again u s i n g a b i t o f mental i n t e r p o l a t i o n , we can conclude t h a t
Pr(
x12 >
9.5 ) = .004 ( o r SO). That i s , t h e p r o b a b i l i t y o f g e t t i n g a
t h i s large
& chance i s about 4 i n 1,000 samples ( i e . , NOT
probable a t a l l ) .
NOTE t h a t t h i s chi-square i s e x a c t l y 10 times as l a r g e as t h e f i r s t
one and i s based on a sample e x a c t l y 10 times l a r g e r .
coincidence.)
I f two t a b l e s have t h e
same
( T h i s i s no
r e l a t i v e c e l l sizes, b u t
one i s based on a sample k times as l a r g e , t h e c h i - s q u a r e f o r t h e
l a r g e r t a b l e w i l l be k times t h a t o f t h e s m a l l e r .
(An a l g e b r a i c
p r o o f o f t h i s statement i s g i v e n on page 58.)
c. We can make use o f t h i s i n s i g h t by addressing a new question:
How
l a r g e a sample would we need f o r t h e same r e l a t i v e c e l l s i z e s t o be
detected as s t a t i s t i c a l l y s i g n i f i c a n t a t t h e .05 l e v e l ?
We know t h e f o l l o w i n g :
20 i s t h e o r i g i n a l sample s i z e .
.95 i s t h e chi-square f o r t h i s sample.
3.841 i s t h e s i z e o f chi-square we need t o d e t e c t .
Since
20
k = n
and
.95
k = 3.841
,
then
k = 4.04
and
n = 81.
2. Chi-square should o n l y be used when fe
5 i n a l l c e l l s o f a 2x2 t a b l e .
When one's t a b l e i s l a r g e r than 2x2, 75% o f t h e t a b l e ' s c e l l s should
have fe25 and a1 1 c e l l s should have f,>l.l
If
fe < 5
f o r t o o many
c e l l s , then F i s h e r ' s exact t e s t ( f o r 2x2 t a b l e s ) o r an e x t e n s i o n o f t h i s
t e s t ( n e i t h e r covered i n t h i s course) should be used.
Moreover, l a r g e r
expected c e l l s i z e s can be ensured by dropping o r c o l l a p s i n g t h e t a b l e ' s
c a t e g o r i e s , o r by c o l l e c t i n g more data.
3. T r i v i a
a. The shaoes o f c h i -sauare d i s t r i b u t i o n s change w i t h d i f f e r e n t degrees
o f freedom.
When t h e degrees o f freedom g e t l a r g e r than 20 o r so,
chi-square takes on t h e shape o f a normal d i s t r i b u t i o n .
b. The MEAN o f a chi-square d i s t r i b u t i o n equals i t s degrees o f freedom;
i t s VARIANCE equals t w i c e i t s degrees o f freedom ( i .e.,
Var(
2
xdf
)
=
2*df ) .
c. I f Z
-
N(0,l)
, then Z2 -
c h i - s q u a r e w i t h one degree o f freedom.
can v e r i f y t h i s by comparing Table A w i t h Table C.
Pr(
IZI > 1.96
)
=
.05
=
Pr(
x12
-= Z 2 > 3.84
=
For example,
11.961 2 )
.
Alan A g r e s t i and Barbara F i n l a y . 1986. S t a t i s t i c a l Methods f o r t h e
S o c i a l Sciences, 2nd e d i t i o n . San Francisco, CA: Dellen, p. 207.
61
You
(NOTE:
"1"
means "equals by d e f i n i t i o n . " )
ALSO t h e sum o f two
squared standard normal random v a r i a b l e s i s d i s t r i b u t e d as a
c h i - s q u a r e random v a r i a b l e w i t h TWO degrees o f freedom.
T h i s can be
g e n e r a l i z e d f u r t h e r t o sums o f more squared standard normal random
variables.
M. One f i n a l example:
You have a contingency t a b l e w i t h f o u r v a r i a b l e s :
Religious a f f i l i a t i o n :
Gender:
Political affiliation:
Ethnicity:
C a t h o l i c , P r o t e s t a n t , Jewish, Buddhist.
Ma1e, Female .
Republican, Democrat, Independent, Other.
Black, White, Other.
26
We c a l c u l a t e a c h i - s q u a r e f o r t h e t a b l e and i t equals
c h i - s q u a r e does NOT p r o v i d e s i g n i f i c a n t evidence ( a t
.
This value o f
a = .05) t h a t t h e
d a t a i n t h i s t a b l e v a r y s i g n i f i c a n t l y from what you would expect by chance.
Drawing t h i s c o n c l u s i o n begins by determining t h a t t h e a p p r o p r i a t e degrees
o f freedom f o r t h i s t a b l e equal 18 (
] ).
=
=
[ 4-1 ]
2
x18,
a xi8 t h i s
Then c o n s u l t i n g Table C, we f i n d t h a t
28.869
samples.
.
Accordingly, one would expect
[ 2-1 ]
-
[ 4-1 ]
25.989
and
[ 3-1
x2 ~
~
l a r g e i n about 1 i n 10
,
.
~
N. GENERAL CONCLUSIONS
1. Normal d i s t r i b u t i o n
a. I n d i c a t e s t h e p r o b a b i l i t i e s o f random f l u c t u a t i o n s around a
p o p u l a t i o n parameter.
b. The l a r g e r one's sample (n), t h e more c l o s e l y t h e sampling
d i s t r i b u t i o n o f each unbiased p o i n t - e s t i m a t e - s t a t i s t i c w i l l
approximate a normal d i s t r i b u t i o n w i t h a mean o f t h e p o p u l a t i o n
parameter--the
one t h a t t h e s t a t i s t i c estimates-and
w i t h a variance
t h a t i s i n v e r s e l y p r o p o r t i o n a l t o t h e sample s i z e .
2. Chi-square d i s t r i b u t i o n
a. I n d i c a t e s t h e p r o b a b i l i t i e s o f SQUARED random f l u c t u a t i o n s around a
p o p u l a t i o n parameter.
(Recall t h a t
Z2 - c h i - s q u a r e w i t h one degree
o f freedom, t h a t t h e sum o f two squared standard normal random
v a r i a b l e s i s d i s t r i b u t e d as a c h i -square random v a r i a b l e w i t h two
degrees o f freedom, e t c . )
I n t h i s sense, t h e chi-square d i s t r i b u t i o n
a l l o w s you t o determine whether you have a "normal" amount o f
variation.
D i f f e r e n t l y put, c h i -square measures t h e degree t o which
y o u r a c t u a l data vary from what you would expect knowing o n l y your
marginal d i s t r i b u t i o n s .
b. The l a r g e r one's sample (n), t h e l a r g e r the value o f chi-square f o r
any g i v e n set o f j o i n t p r o b a b i l i t i e s .
I n f a c t a l l t h i n g s equal,
i n c r e a s i n g one's sample s i z e by a f a c t o r o f "k" w i l l increase c h i square by e x a c t l y t h i s amount.
That i s , whereas p o i n t - e s t i m a t e -
s t a t i s t i c s approach p o p u l a t i o n parameters' values f o r i n c r e a s i n g l y
l a r g e r samples, c h i - s q u a r e - s t a t i s t i c s approach i n f i n i t y .
63
These two d i s t r i b u t i o n s a r e t h e most important i n a l l o f s t a t i s t i c a l
theory.
As do a l l p r o b a b i l i t y d i s t r i b u t i o n s , they a l l o w us t o make
p r o b a b i l i s t i c statements about i n t e r r e l a t i o n s among v a r i a b l e s when these
v a r i a b l e s measure a t t r i b u t e s o f randomly s a m ~ l e dsubjects ( o r o t h e r u n i t s
o f analysis).
Download