Sampling and Inference

advertisement
Sampling and Inference
The Quality of Data and Measures
March 23, 2006
1
Why we talk about sampling
• General citizen education
• Understand data you’ll be using
• Understand how to draw a sample, if you
need to
• Make statistical inferences
2
Why do we sample?
Cost/
benefit
Benefit
(precision)
Cost
(hassle factor)
N
3
How do we sample?
• Simple random sample
– Variant: systematic sample with a random start
• Stratified
• Cluster
4
Stratification
• Divide sample into subsamples, based on
known characteristics (race, sex,
religiousity, continent, department)
• Benefit: preserve or enhance variability
5
Cluster sampling
Block
HH Unit
Individual
6
Effects of samples
• Obvious: influences marginals
• Less obvious
– Allows effective use of time and effort
– Effect on multivariate techniques
• Sampling of independent variable: greater precision
in regression estimates
• Sampling on dependent variable: bias
7
Sampling on Independent
Variable
y
y
x
x
8
Sampling on Dependent Variable
y
y
x
x
9
Sampling
Consequences for Statistical
Inference
10
Statistical Inference:
Learning About the Unknown From the
Known
• Reasoning forward: distributions of sample
means, when the population mean, s.d., and
n are known.
• Reasoning backward: learning about the
population mean when only the sample,
s.d., and n are known
11
Reasoning Forward
12
Exponential Distribution
Example
Fraction
.271441
Mean = 250,000
Median=125,000
s.d. = 283,474
Min = 0
Max = 1,000,000
0
0
500000
inc
1.0e+06
13
Consider 10 random samples, of
n = 100 apiece
mean
1
253,396.9
2
198.789.6
3
271,074.2
4
238,928.7
5
280,657.3
6
241,369.8
7
249,036.7
8
226,422.7
.271441
Fraction
Sample
0
9
210,593.4
10
212,137.3
0
250000
500000
inc
1.0e+06
14
Consider 10,000 samples of n =
100
.275972
Fraction
N = 10,000
Mean = 249,993
s.d. = 28,559
Skewness = 0.060
Kurtosis = 2.92
0
0
250000
500000
(mean) inc
1.0e+06
15
Consider 1,000 samples of
various sizes
10
100
0
Fraction
.731
Fraction
.731
Fraction
.731
1000
0
0
250000
500000
(mean) inc
Mean =250,105
s.d.= 90,891
Skew= 0.38
Kurt= 3.13
1.0e+06
0
0
250000
500000
(mean) inc
Mean = 250,498
s.d.= 28,297
Skew= 0.02
Kurt= 2.90
1.0e+06
0
250000
500000
(mean) inc
Mean = 249,938
s.d.= 9,376
Skew= -0.50
16
Kurt= 6.80
1.0e+06
Difference of means example
State 1
Mean = 250,000
Fraction
.280203
0
0
250000
500000
inc
1.0e+06
State 2
Mean = 300,000
Fraction
.251984
0
0
250000
500000
inc2
1.0e+06
17
Take 1,000 samples of 10, of
each state, and compare them
First 10 samples
Sample
State 1
State 2
1
311,410
<
365,224
2
184,571
<
243,062
3
468,574
>
438,336
4
253,374
<
557,909
5
220,934
>
189,674
6
270,400
<
284,309
7
127,115
<
210,970
8
253,885
<
333,208
9
152,678
<
314,882
10
222,725
>
152,312
18
1,000 samples of 10
(mean) inc2
1.1e+06
0
0
1.1e+06
(mean) inc
State 2 > State 1: 673 times
19
1,000 samples of 100
(mean) inc2
1.1e+06
0
0
1.1e+06
(mean) inc
State 2 > State 1: 909 times
20
1,000 samples of 1,000
(mean) inc2
1.1e+06
0
0
1.1e+06
(mean) inc
State 2 > State 1: 1,000 times
21
Another way of looking at it:
The distribution of Inc2 – Inc1
n = 10
n = 100
0
-400000
Fraction
.565
Fraction
.565
Fraction
.565
n = 1,000
0
050000
diff
Mean = 51,845
s.d. = 124,815
600000
-400000
0
050000
diff
Mean = 49,704
s.d. = 38,774
600000
-400000
0
600000
diff
Mean = 49,816
s.d. = 13,932
22
Play with some simulations
• http://www.ruf.rice.edu/~lane/stat_sim/sam
pling_dist/index.html
• http://www.kuleuven.ac.be/ucs/java/index.h
tm
23
Reasoning Backward
When you know n, X, and s,
but want t o say something about 
24
Central Limit Theorem
As the sample size n increases, the
distribution of the mean X of a random
sample taken from practically any
population approaches a normal
distribution, with mean  and standard
deviation  n
25
Calculating Standard Errors
In general:
s
std. err. 
n
26
Most important standard errors
Mean
Proportion
Diff. of 2 means
Diff. of 2
proportions
Diff of 2 means
(paired data)
Regression
(slope) coeff.
s
n
p(1  p)
n
s12 s22

n1 n2
p1 (1  p1 ) p2 (1  p2 )

n1
n2
sd
n
s.e.r. 1

n  1 sx
27
Using Standard Errors, we can
construct “confidence intervals”
• Confidence interval (ci): an interval
between two numbers, where there is a
certain specified level of confidence that a
population parameter lies
• ci = sample parameter +
• multiple * sample standard error
28
Constructing Confidence Intervals
• Let’s say we draw a sample of tuitions from 15
private universities. Can we estimate what the
average of all private university tuitions is?
• N = 15
• Average = 29,735
• S.d. = 2,196
2,196
• S.e. = s
n

15
 567
29
N = 15; avg. = 29,735; s.d. = 2,196; s.e. = s/√n = 567
The Picture
.398942
29,735+567=30,302
y
29,735-567=29,168
29,735-2*567=
28,601
29,735+2*567=
30,869
29,735
.000134
 4
 3
 2

68%
Mean
95%
99%

2
3
4
30
Confidence Intervals for Tuition
Example
• 68% confidence interval = 29,735+567 =
[29,168 to 30,302]
• 95% confidence interval = 29,735+2*567 =
[28,601 to 30,869]
• 99% confidence interval = 29,735+3*567 =
[28,034 to 31,436]
31
What if someone (ahead of time) had
said, “I think the average tuition of
major research universities is $25k”?
• Note that $25,000 is well out of the 99%
confidence interval, [28,034 to 31,436]
• Q: How far away is the $25k estimate from
the sample mean?
– A: Do it in z-scores: (29,735-25,000)/567 =
8.35
32
Constructing confidence intervals of
proportions
• Let us say we drew a sample of 1,000 adults and asked
them if they approved of the way George Bush was
handling his job as president. (March 13-16, 2006 Gallup
Poll) Can we estimate the % of all American adults who
approve?
• N = 1000
• p = .37
• s.e. = p(1  p)
.37(1  .37)
n

1000
 0.02
33
N = 1,000; p. = .37; s.e. = √p(1-p)/n = .02
The Picture
.398942
.37+.02=.39
y
.37-.02=.35
.37-2*.02=.33
.37+2*.02=.41
.37
.000134
 4
 3
 2

68%
Mean
95%
99%

2
3
4
34
Confidence Intervals for Bush
approval example
• 68% confidence interval = .37+.02 =
[.35 to .39]
• 95% confidence interval = .37+2*.02 =
[.33 to .41]
• 99% confidence interval = .37+3*.02 =
[ .31 to .43]
35
What Gallup said about the
confidence interval
• Results are based on telephone interviews
with 1,000 national adults, aged 18 and
older, conducted March 13-16, 2006. For
results based on the total sample of national
adults, one can say with 95% confidence
that the maximum margin of sampling error
is ±3 percentage points [because the actual
standard error is 1.5%, not 2%].
36
What if someone (ahead of time) had
said, “I think Americans are equally
divided in how they think about
Bush.”
• Note that 50% is well out of the 99%
confidence interval, [31% to 43%]
• Q: How far away is the 50% estimate from
the sample proportion?
– A: Do it in z-scores: (.37-.5)/.02 = -6.5 [-8.7 if
we divide by 0.15]
37
Constructing confidence intervals of
differences of means
• Let’s say we draw a sample of tuitions from 15
private and public universities. Can we estimate
what the difference in average tuitions is between
the two types of universities?
•
•
•
•
N = 15 in both cases
Average = 29,735 (private); 5,498 (public); diff = 24,238
s.d. = 2,196 (private); 1,894 (public)
s.e. = s 2 s 2
4,822,416 3,587,236
1
n1

2
n2

15

15
 749
38
N = 15 twice; diff = 24,238; s.e. = 749
The Picture
.398942
24,238+749=24,987
y
24,238-749=
23,489
24,238-2*749=
22,740
24,238+2*749=
25,736
24,238
.000134
 4
 3
 2

68%
Mean
95%
99%

2
3
4
39
Confidence Intervals for difference
of tuition means example
• 68% confidence interval = 24,238+749 =
[23,489 to 24,987]
• 95% confidence interval = 24,238+2*749 =
[22,740 to 25,736]
• 99% confidence interval =24,238+3*749 =
• [21,991 to 26,485]
40
What if someone (ahead of time) had
said, “Private universities are no
more expensive than public
universities”
• Note that $0 is well out of the 99%
confidence interval, [$21,991 to $26,485]
• Q: How far away is the $0 estimate from the
sample proportion?
– A: Do it in z-scores: (24,238-0)/749 = 32.4
41
Constructing confidence intervals of
difference of proportions
• Let us say we drew a sample of 1,000 adults and asked
them if they approved of the way George Bush was
handling his job as president. (March 13-16, 2006 Gallup
Poll). We focus on the 600 who are either independents or
Democrats. Can we estimate whether independents and
Democrats view Bus differently?
• N = 300 ind; 300 Dem.
• p = .29 (ind.); .10 (Dem.); diff = .19
• s.e. =
p1 (1  p1 ) p2 (1  p2 )
.29(1  .29) .10(1  .10)
n1

n2

300

300
 .03
42
diff. p. = .19; s.e. = .03
The Picture
.398942
.19+.03=.22
y
.19-.03=.16
.19-2*.03=.13
.19+2*.03=.25
.19
.000134
 4
 3
 2

68%
Mean
95%
99%

2
3
4
43
Confidence Intervals for Bush
Ind/Dem approval example
• 68% confidence interval = .19+.03 =
[.16 to .22]
• 95% confidence interval = .19+2*.03 =
[.13 to .25]
• 99% confidence interval = .19+3*.03 =
[ .10 to .28]
44
What if someone (ahead of time) had
said, “I think Democrats and
Independents are equally
unsupportive of Bush”?
• Note that 0% is well out of the 99%
confidence interval, [10% to 28%]
• Q: How far away is the 0% estimate from
the sample proportion?
– A: Do it in z-scores: (.19-0)/.03 = 6.33
45
Constructing confidence intervals of
differences of means in a paired
sample
• Let’s say we draw a sample of tuitions from 15
private universities, in 2003 and again in 2004.
Can we estimate what the difference in average
tuitions is between the two years?
•
•
•
•
N = 15
Averages = 28,102 (2003); 29,735 (2004); diff = 1,632
s.d. = 2,196 (private); 1,894 (public); 886 (diff)
s.e. = sd 886
n

15
 229
46
N = 15; diff = 1,632; s.e. = 229
The Picture
.398942
1,632+229=1,861
y
1,632-229=1,403
1,632-2*229=
1,174
1632+2*229=
2,090
1,632
.000134
 4
 3
 2

68%
Mean
95%
99%

2
3
4
47
Confidence Intervals for second
difference of tuition means example
• 68% confidence interval = 1,632+ 229=
[1,403 to 1,861]
• 95% confidence interval = 1,632+ 2*229 =
[1,174 to 2,090]
• 99% confidence interval = 1,632+3*229 =
[945 to 2,319]
48
What if someone (ahead of time) had
said, “Private university tuitions did
not grow from 2003 to 2004”
• Note that $0 is well out of the 99%
confidence interval, [$1,174 to $2,090]
• Q: How far away is the $0 estimate from the
sample proportion?
– A: Do it in z-scores: (1,632-0)/229 = 7.13
49
The Stata output
. gen difftuition=tuition2004-tuition2003
. ttest diff=0 in 1/15
One-sample t test
-----------------------------------------------------------------------------Variable |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------difftu~n |
15
1631.6
228.6886
885.707
1141.112
2122.088
-----------------------------------------------------------------------------mean = mean(difftuition)
t =
7.1346
Ho: mean = 0
degrees of freedom =
14
Ha: mean < 0
Pr(T < t) = 1.0000
Ha: mean != 0
Pr(|T| > |t|) = 0.0000
Ha: mean > 0
Pr(T > t) = 0.0000
50
Constructing confidence intervals of
regression coefficients
• Let’s look at the relationship between the midterm seat loss by the President’s party at midterm
and the President’s Gallup poll rating
2002
-20
0
1998
1990
1970
1978
1962
1986
1954
-40
1982
1950
1942
-60
19741966 1958
1994
1946
-80
1938
30
40
50
Gallup approval rating (Nov.)
loss
Fitted values
60
Fitted values
70
Slope = 1.97
N = 14
s.e.r. = 13.8
sx = 8.14
s.e.slope =
s.e.r. 1 13.8
1
 

 0.47
n  1 sx
13 8.14
51
N = 14; slope=1.97; s.e. = 0.45
The Picture
.398942
1.97+0.47=2.44
y
1.97-0.47=1.50
1.97-2*0.47=1.03
1.97+2*0.47=2.91
1.97
.000134
 4
 3
 2

68%
Mean
95%
99%

2
3
4
52
Confidence Intervals for regression
example
• 68% confidence interval = 1.97+ 0.47=
[1.50 to 2.44]
• 95% confidence interval = 1.97+ 2*0.47 =
[1.03 to 2.91]
• 99% confidence interval = 1.97+3*0.47 =
[0.62 to 3.32]
53
What if someone (ahead of time) had
said, “There is no relationship
between the president’s popularity
and how his party’s House members
do at midterm”?
• Note that 0 is well out of the 99%
confidence interval, [0.62 to 3.32]
• Q: How far away is the 0 estimate from the
sample proportion?
– A: Do it in z-scores: (1.97-0)/0.47 = 4.19
54
The Stata output
. reg loss gallup if year>1948
Source |
SS
df
MS
-------------+-----------------------------Model | 3332.58872
1 3332.58872
Residual | 2280.83985
12 190.069988
-------------+-----------------------------Total | 5613.42857
13 431.802198
Number of obs
F( 1,
12)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
14
17.53
0.0013
0.5937
0.5598
13.787
-----------------------------------------------------------------------------loss |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gallup |
1.96812
.4700211
4.19
0.001
.9440315
2.992208
_cons | -127.4281
25.54753
-4.99
0.000
-183.0914
-71.76486
------------------------------------------------------------------------------
55
z vs. t
56
If n is sufficiently large, we know the
distribution of sample means/coeffs. will
obey the normal curve
y
.398942
.000134
 4
 3
 2

68%
Mean
95%
99%

2
3
4
57
If n is sufficiently large, we know the
distribution of sample means/coeffs. will
obey the normal curve
y
.398942
.000134
 4
 3
 2

68%
Mean
95%
99%

2
3
4
58
If n is sufficiently large, we know the
distribution of sample means/coeffs. will
obey the normal curve
y
.398942
.000134
 4
 3
 2

68%
Mean
95%
99%

2
3
4
59
If n is sufficiently large, we know the
distribution of sample means/coeffs. will
obey the normal curve
y
.398942
.000134
 4
 3
 2

68%
Mean
95%
99%

2
3
4
60
Therefore….
• When the sample size is large (i.e., > 150),
convert the difference into z units and
consult a z table
Z = (H1 - H0) / s.e.
61
Reading a z table
62
63
Therefore….
• When the sample size is small (i.e., <150),
convert the difference into t units and
consult a t table
t = (H1 - H0) / s.e.
64
t
(when the sample is small)
.003989
z (normal) distribution
t-distribution
.000045
-4
-2
0
z
2
4
65
Reading a t table
66
A word about standard errors and
collinearity
• The problem: if X1 and X2 are highly
correlated, then it will be difficult to
precisely estimate the effect of either one of
these variables on Y
67
How does having another collinear
independent variable affect standard
errors?

s. e.( 1 ) 
S 1 R
1
N  n  1 S 1 R
2
Y
2
X1
2
Y
2
X1
R2 of the “auxiliary regression” of X1 on all
the other independent variables
68
Example: Effect of party, ideology,
and religiosity on feelings toward
Quincy Bush
Bush
Feelings
Conserv.
Repub.
Relig.
Bush
Feelings
1.0
Conserv.
Repub.
Religious
.39
.57
.16
1.0
.46
.18
1.0
.06
1.0
69
Regression table
(1)
(2)
(3)
(4)
Intercept
32.7
(0.85)
32.9
(1.08)
32.6
(1.20)
29.3
(1.31)
Repub.
6.73
(0.244)
5.86
(0.27)
6.64
(0.241)
5.88
(0.27)
Conserv.
---
2.11
(0.30)
---
1.87
(0.30)
Relig.
---
---
7.92
(1.18)
5.78
(1.19)
N
1575
1575
1575
1575
R2
.32
.35
.35
.36
70
Download