slides with notes

advertisement
Statistics and Inference
Paul Schrimpf
January 14, 2016
Contents
1
Properties of estimators
1.1 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Variance of estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2
3
2 Finite sample inference
4
3 Asymptotics
3.1 Convergence in probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Convergence in distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
5
7
4 Hypothesis testing
4.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Asymptotic interpretation of p-values . . . . . . . . . . . . . . . . . . . . . . . . . .
10
11
15
References
• Wooldridge (2013) appendix C
• Abbring (2001) section 2.4
• Diez, Barr, and Cetinkaya-Rundel (2012) chapters 4, 5, & 6
• Grinstead and Snell (2003) chapters 8 & 9
1
Statistics
• Interested in parameters of population, e.g.
– Moments, functions of moments
– Conditional expectation functions
– Distribution functions
• Learn about parameters by observing sample from the population
• Use sample to form estimates of parameters
1
1.1
Properties of estimators
Unbiasedness
Unbiasedness
• Setup:
– Sample {y1 , ..., yn }
– Parameter θ
– Estimator W , some function of sample
• The bias of W is Bias(W ) = E[W ] − θ
• W is unbiased if E[W ] = θ
• Examples:
1 ∑n
i=1 yi is unbiased
n
∑
s2y = n1 ni=1 (yi − ȳ)2 is
– Sample mean: ȳ =
– Sample variance:
biased
An example of a parameter
∑ is the population mean, E[Y ]. An estimator for the population mean is the
sample mean, ȳ = n1 ni=1 yi . Another (less useful) estimator of the population mean is just the first
observation, y1 . Samples are random, and estimators are functions of a sample, so estimators are random variables. Therefore, it makes sense to think about the distribution of an estimator, the mean of an
estimator, etc.
For an estimator to be useful, it should be related to a parameter in some nice way. One thing we might
want to know about an estimator is whether on average, it will equal the parameter we want. If it does,
the estimator is unbiased.
2
1.2
Variance of estimators
Bias is not the only property we should want our estimators to have. Both the sample mean and just the first
observation are unbiased estimators of the population mean. However, we should expect (and it is true)
that the sample mean is more likely to be close to the population mean than just the first observation. The
variance and mean square error of an estimator are two ways of quantifying how variable is an estimator.
Variance of estimators
• The variance of W is Var(W )
√
• The standard error of W is Var(W )
• If W1 and W2 is are two unbiased estimators, then W1 is relatively efficient if Var(W1 ) ≤
Var(W2 )
• The mean square error of W is MSE(W ) = E[(W − θ)2 ]
– MSE(W ) = Bias(W )2 + Var(W )
• Prefer estimators with that are relatively efficient and/or have low MSE
Variance of sample mean
• Sample mean: ȳ =
1
n
∑n
i=1 yi
• Variance of sample mean:
(
)
n
1∑
Var(ȳ) =Var
yi
n
i=1
)
( n
n
∑
1
1∑
=Cov
yi ,
yi
n
n
=
1
n2
i=1
n
n
∑∑
i=1
Cov(yi , yj )
i=1 j=1
assume independent observations
n
1 ∑
= 2
Var(yi )
n
i=1
1
= Var(Y )
n
• Standard error of sample mean: SE(ȳ) =
√
Var(Y )/n
3
2
Finite sample inference
Knowing estimators’ variances and MSEs is useful for choosing which estimator to use. Once we have
chosen an estimator and calculated it for a given sample, we would like to know something about how
“close” our estimate is to the parameter we’re interested. Statistical inference is the tool for doing this.
Estimators are functions of a sample of random variables, so estimators themselves are random variables. If
we draw different samples, or repeat our experiment, our estimator will take on different values. Therefore,
we can think about describing the distribution of an estimator.
Inference
• Inference: (roughly) how close are our sample estimates to the population parameters
• Example: N independent observations distributed Bernoulli(p), Xi = 1 if success, else 0
∑
∑
∏
– Joint distribution f(x1 , ..., xn ) = ni=1 pxi (1 − p)1−xi = p xi (1 − p)N− xi
– Could use to compute
sample
( ∑ how surprising
) x̄ that we observe is if the true p = p0 ,
e.g. compute P n1 ni=1 Xi − p0 ≥ |x̄ − p0 |
• Example: xi normally distributed, can do inference on sample mean using t-tests
• Problem: usually do not know distribution of sample and/or distribution of estimator intractable
In a few cases, such as when our sample consists of Bernoulli or normal random variables, we can
calculate the exact distribution of some estimators. Unfortunately, most of the time, calculating the exact
distribution of an estimator is either very complicated or requires too many strong assumptions.
3
Asymptotics
When we cannot calculate the exact finite sample distribution of an estimator, we must try to approximate
the distribution somehow. Fortunately, in large samples, most estimators have a distribution that can be
calculated. We can use this large sample distribution to approximate the finite sample distribution of an
estimator.
Asymptotic inference
• Idea: use limit of distribution of estimator as N→∞ to approximate finite sample distribution
of estimator
• Notation:
– Sequence of samples of increasing size n, Sn = {y1 , ..., yn }
– Estimator for each sample Wn
• References:
4
– Wooldridge (2013) appendix C
– Menzel (2009) especially VI-IX
3.1
Convergence in probability
Loosely speaking you can think of probability limits as the large sample version of expectations.
Convergence in probability
• Wn converges in probability to θ if for every ε > 0,
lim P (|Wn − θ| > ε) = 0
n→∞
p
denote by plim Wn = θ or Wn → θ
• Show using law of large numbers: if y1 , ..., yn are i.i.d. with mean µ, or if y1 , ..., yn have finite
∑
p
Var(yi )
expectations and ∞
is finite, then ȳ → µ
i=1
i2
• Properties:
– plim g(Wn ) = g(plim Wn ) if g is continuous (continuous mapping theorem (CMT))
p
p
– If Wn → ω and Zn → ζ, then (Slutsky’s lemma)
p
* Wn + Zn → ω + ζ
p
* Wn Zn → ωζ
*
Wn
Zn
p ω
ζ
→
p
• Wn is a consistent estimate of θ if Wn → θ
Probability limits are much easier to work with than expectations, mostly because of the continuous
mapping theorem and the second and third parts of Slutsky’s lemma. These properties of probability limits
do not apply to expectations.
Demonstration of LLN
1
2
3
4
5
6
7
8
n <− seq ( 1 , 2 0 0 ) ## sample s i z e s
sims <− 1 0 ## number o f s i m u l a t i o n s f o r each sample s i z e
rand <− f u n c t i o n (N) { r e t u r n ( r u n i f (N ) ) }
## s i m u l a t i o n
sampleMean <− m a t r i x ( 0 , nrow= l e n g t h ( n ) , n c o l = sims )
for ( i in 1 : length (n ) ) {
N = n[ i ];
x <− m a t r i x ( rand (N* sims ) , nrow=N, n c o l = sims )
5
sampleMean [ i , ] = colMeans ( x )
9
10
11
12
13
14
15
16
17
18
19
20
21
}
## p l o t sample means
l i b r a r y ( ggplot2 )
l i b r a r y ( reshape )
d f <− data . frame ( sampleMean )
d f $n <− n
d f <− melt ( df , i d = ”n” )
l l n P l o t <− g g p l o t ( data = df , aes ( x =n , y = value , c o l o u r = v a r i a b l e ) ) +
geom_ l i n e ( ) + s c a l e _ c o l o u r _ brewer ( p a l e t t e = ” P a i r e d ” ) +
s c a l e _ y _ continuous ( name= ” sample mean” ) +
g u i d e s ( c o l o u r = FALSE )
llnPlot
Demonstration of LLN
1.00
sample mean
0.75
0.50
0.25
0
50
100
150
n
Convergence in probability: examples
p
• ȳ → E[y] by LLN
6
200
• Sample variance:
n
σ̂ 2 =
1∑
(yi − ȳ)2
n
i=1
n
=
1∑ 2
yi −
n
i=1
| {z }
ȳ2
|{z}
p
→ E[y]2 by LLN and CMT
p
→ E[y2 ] by LLN
p
→ E[y2 ] − E[y]2 = Var(y)
• Mean divided by variance:
∑n
1 ∑n
yi
yi
n
∑n i=1
= 1 ∑n i=1
2
2
(y
−
ȳ)
i=1 i
i=1 (yi − ȳ)
n
ȳ
σ̂ 2
p E[y]
→
by above and Slutsky’s lemma
Vary
=
Convergence in probability: examples
• Sample correlation:
ρ̂
1 ∑n
(xi − x̄)(yi − ȳ)
n
= √ ∑ i=1
∑n
n
1
21
i=1 (xi − x̄) n
i=1 (yi −
n
ȳ)2
∑
p
p
− x̄)2 → Var(X ) and n1 ni=1 (yi − ȳ)2 → Var(Y ) as above
∑
p
–
− x̄)2 n1 ni=1 (yi − ȳ)2 → Var(X )Var(Y ) by Slutsky’s lemma
√ ∑
√
∑n
n
1
21
2 p
–
Var(X )Var(Y ) by CMT
i=1 (xi − x̄) n
i=1 (yi − ȳ) →
n
∑
p
– n1 ni=1 (xi − x̄)(yi − ȳ) → Cov(X , Y ) by similar reasoning as for sample variance
–
1
n
1
n
∑n
i=1 (xi
∑n
i=1 (xi
p
– ρ̂ → Corr(X , Y ) by Slutsky’s lemma
3.2
Convergence in distribution
Convergence in probability tells us that an estimator will eventually be very close to some number (hopefully the parameter we’re trying to estimate), but it does not tell us about the distribution of an estimator.
From the law of large numbers, we know that ȳn − µ converges to the constant 0. The central limit
√
theorem tells us that if we scale up the difference by n, then ȳn − µ converges to normal distribution
with variance Var(y) = σ 2 .
7
Convergence in distribution
• Let Fn be the CDF of Wn and W be a random variable with CDF F
d
• Wn converges in distribution to W , written Wn → W , if limn→∞ Fn (x) = F (x) for all x where F
is continuous
√
• Central limit theorem: Let {y1 , ..., yn } be i.i.d. with mean µ and variance σ 2 then Zn = n ȳnσ−µ
converges in distribution to a standard normal random variable
– As with the LLN, the i.i.d. condition can be relaxed if additional moment conditions are
added
– Demonstration
• Properties:
d
d
– If Wn → W , then g(Wn ) → g(W ) for continuous g (continuous mapping theorem (CMT))
d
p
d
d
– Slutsky’s theorem: If Wn → W and Zn → c, then (i) Wn + Zn → W + c, (ii) Wn Zn → cW ,
d
and (iii) Wn /Zn → W /c
The central limit theorem can be paraphrased as “scaled and centered sample means converge in dis√
tribution to normal random variables.” “Scaling” refers to the fact that we multiple by n. “Centering”
refers to the subtraction of the population mean. If an estimator converges in distribution to a normal
random variable, then we often say that the estimator is asymptotically normal.
Demonstration of CLT
1
2
3
4
5
6
7
8
9
10
N <− c ( 1 , 2 , 5 , 1 0 , 2 0 , 5 0 , 1 0 0 )
s i m u l a t i o n s <− 5000
means <− m a t r i x ( 0 , nrow= s i m u l a t i o n s , n c o l = l e n g t h (N ) )
f o r ( i i n 1 : l e n g t h (N ) ) {
n <− N[ i ]
dat <− m a t r i x ( r b e t a ( n * s i m u l a t i o n s ,
shape1 = 0 . 5 , shape2 = 0 . 5 ) ,
nrow= s i m u l a t i o n s , n c o l =n )
means [ , i ] <− ( a p p l y ( dat , 1 , mean ) − 0 . 5 ) * s q r t ( n )
}
11
12
13
14
15
16
17
18
19
20
21
# Plotting
d f <− data . frame ( means )
d f $n <− N
d f <− melt ( d f )
c l t P l o t <− g g p l o t ( data = df , aes ( x = value , f i l l = v a r i a b l e ) ) +
geom_ histogram ( alpha = 0 . 2 , p o s i t i o n = ” i d e n t i t y ” ) +
s c a l e _ x _ continuous ( name= e x p r e s s i o n ( s q r t ( n ) ( bar ( x)−mu ) ) ) +
s c a l e _ f i l l _ brewer ( t y p e = ” d i v ” , p a l e t t e = ” RdYlGn ” ,
name= ”N” , l a b e l =N)
cltPlot
8
Demonstration of CLT
Histogram of sample mean for Beta(0.5,0.5) distribution
600
N
1
400
2
count
5
10
20
50
200
100
0
−1
0
1
^ − µ)
n (µ
CLT examples
• Mean:
√
d
n ȳ−µ
σ → N(0, 1) by CLT
• What if σ estimated instead of known?
√ ȳ − µ
n
σ̂
p
– σ̂ → σ from before
√
d
– n(ȳ − µ) → N(0, σ 2 ) by CLT
– Slutsky’s theorem with Wn =
√
n(ȳ − µ), Zn = σ̂ gives
√ ȳ − µ d
→ N(0, 1)
n
σ̂
CLT examples
• Variance:
9
–
n
1∑
σ̂ =
(yi − ȳ)2
n
2
i=1
n
1∑
(yi − µ + µ − ȳ)2
=
n
i=1
]
[
n
1∑
2
(yi − µ) − (ȳ − µ)2
=
n
i=1
– So
]
[
n
∑
√
√
√
1
n(σ̂ 2 − σ 2 ) = n
(yi − µ)2 − σ 2 − n(ȳ − µ)(ȳ − µ)
n
i=1
√
p
d
d
n(ȳ − µ) → N(0, σ 2 ), (ȳ − µ) → 0, so n(ȳ − µ)(ȳ − µ) → 0
] d
√ [ ∑
– CLT implies n n1 ni=1 (yi − µ)2 − σ 2 → N(0, V )
–
4
√
Hypothesis testing
The main way we will use the central limit theorem is to use the asymptotic distribution of an estimator
to approximate the estimator’s finite sample distribution. We do this so that we can make (approximate)
statements about how uncertain we are about a given estimate.
Hypothesis testing is one way we make statements about how random is an estimate. The logic hypothesis testing is slightly awkward. To understand it, keep in mind that we are trying to learn about a
parameter. Parameters are fixed characteristics of a population. They are not random. Therefore, it makes
no sense to talk about the probability that a parameter is a certain value or in a certain range. What is
random, is the sample that we observe and the estimate that we calculate from the sample. Therefore, we
can calculate the probability that our estimate is a certain value in some range. The probability that an
estimate takes on a value depends on the true population parameter. Therefore, we generally quantify
the randomness of an estimate by making statements like, assuming the true parameter is θ0 , then the
probability of getting an estimate as far or farther from θ0 than the one we observe is p.
Hypothesis testing
• Null hypothesis, H0
• p-value: if null hypothesis what is the probability of a sample as or more extreme than what
is observed
√
• From 325: if yi ∼ N(µ, σ 2 ), then n ȳ−µ
∼ t(n − 1)
σ̂ 2
10
4.1
Examples
Example: testing mean
• From 325:
– If yi ∼ N(µ, σ 2 ), then
√
n ȳ−µ
σ̂ ∼ t distribution with n − 1 degrees of freedom
– p-value for H0 : µ = µ0 vs Ha : µ ̸= µ0 is
( ∑
)
1 n y − µ ȳ − µ i
0
0
p =P n i=1
≥ sy
σ̂ ( )
ȳ − µ0 ;n − 1
=2 Ft − σ̂ |
{z
}
CDF of t distribution
• What if distribution of yi unknown?
– Use CLT to approximate distribution of t-statistic
Example: testing one mean
• From earlier slide,
• So,
√ ȳ − µ d
n
→ N(0, 1)
σ̂
)
( ∑
)
( 1 n y − µ ȳ − µ ȳ − µ0 0
n i=1 i
0
→2 Φ − P ≥
sy
σ̂ σ̂ |
{z
}
standard normal CDF
Data on deworming treatment and school participation from Miguel and Kremer (2003)
Treatment Control
873
352
Sample size1
Infection
0.32
0.54
School attendance
0.808
0.684
11
Example: testing one mean
• Test H0 : E[infected|control] = 0.5 againt Ha : E[infected|control] ̸= 0.5
– Let Ii = 1 if infected, else 0
– What is sample variance of Ii ?
)
n
1∑ 2
Ii − Ī 2
n
i=1
( n )
∑
1
=
Ii − Ī 2 (Ii2 = Ii )
n
n
σ̂I2
1∑
=
(Ii − Ī)2 =
n
i=1
(
i=1
2
=Ī − Ī = Ī(1 − Ī)
– Test statistic
z=
√
√
Ī − 0.5
0.54 − 0.5
n√
= 352 √
= 1.506
0.54(1
−
0.54)
Ī(1 − Ī)
– P-value = 2(1 − Φ(z)) = 0.132
Example: testing two means
• Often want to test difference between two means
– H0 : treatment has no effect ∼ H0 treatment mean = control mean
• Let µT = treatment mean, µC = control mean
• H0 : µT = µC
• CLT implies
√
d
nT (ĪT − µT ) → N(0, σT2 ) and
√
d
nC (ĪC − µC ) → N(0, σC2 )
i.e. ĪT is approximately N(µT , σT2 /nT )
• Sum of two normals is normal
• If treatment and control groups independent, then ĪT − ĪC is approximately N(0, σT2 /nT +
σC2 /nC ) formally we could show
√
ĪT − ĪC
σ̂T2 /nT
+
σ̂C2 /nC
12
d
→ N(0, 1)
Example: testing two means
• Test H0 : E[infected|control] = E[infected|treatment] against Ha : E[infected|control] >
E[infected|treatment]
– Test statistic:
z =√
ĪT − ĪC
σ̂T2 /nT + σ̂C2 /nC
=√
0.32 − 0.54
= −7.12
0.32(1 − 0.32)/873 + 0.54 ∗ (1 − 0.54)/352
– P-value = Φ(z) = 5 × 10−13
• Test H0 : E[school|control] = E[school|treatment] against Ha : E[school|control] < E[school|treatment]
– Test statistic:
S̄T − S̄C
z =√
σ̂T2 /nT + σ̂C2 /nC
0.808 − 0.684
=√
= 4.41
0.808(1 − 0.808)/873 + 0.684 ∗ (1 − 0.684)/352
– P-value = 1 − Φ(z) = 5 × 10−6
The RAND health insurance experiment randomly assigned 3985 Americans to one of 14 different insurance plans from 1974-1982. All insurance plans included catastrophic coverage — families total healthcare spending beyond a proportion of their income was completely covered. The least generous plan in the
experiment only had catastrophic coverage. The most generous plan offered comprehensive care for free.
There were also plans that had a deductible of $150 per person or $450 per family, after which everything
was covered. Finally, there were plans with coinsurance that required families to pay between 25% and
50% of the cost of health care out of pocket, up to the catastrophic limit. You can read more about the
RAND health insurance experiment in chapter 1 of Angrist and Pischke (2014) or in Aron-Dine, Einav,
and Finkelstein (2013).
Example: RAND health insurance experiment
13
Example: RAND health insurance experiment
14
p-value pitfalls
• Hypothesis tests and p-values are a tool for quantifying uncertainty, but not the only tool
• Significance thresholds often over-emphasized
• Difference in significance is not necessarily a significant difference
– A statistically significantly different from 0 and B not statistically significantly different
from 0 does not imply that A − B is statistically significant
• When testing many hypotheses, likely to find significant results by chance alone
– Researcher degrees of freedom and garden of forking paths can introduce many tests
inadvertently
These ideas are frequently discussed by the statistician Andrew Gelman on his blog.
http://andrewgelman.com/2005/06/14/the_difference/
http://andrewgelman.com/2011/09/09/the-difference-between-significant-and-not-significant/
http://andrewgelman.com/2016/01/04/plausibility-vs-probability-prior-distributions-garden-fork
http://andrewgelman.com/2013/11/06/marginally-significant/
4.2
Asymptotic interpretation of p-values
When we calculate p-values using the asymptotic distribution of an estimator, then in a given sample, we
know that our p-value is not exactly correct. However, we also know that for a sufficiently large sample,
the error in the p-value will be arbitrarily small. Exactly how large of a sample we need for the error to be
negligible depends on the true distribution of our sample and the number of parameters we estimated to
calculate the p-value.
Asymptotic interpretation of p-values
• CLT: under H0 ,
zn =
√ ȳ − µ d
→ N(0, 1)
n
σ̂
or equivalently,
lim P(zn ≤ x) = Φ(x)
n→∞
• P-values are only correct asymptotically (for large samples)
• Small finite sample p-values will be somewhat off
• With exact p-values, under H0 , p-values from distributed U[0, 1]
15
• Asymptotic p-values, under H0 , p-values only U[0, 1] in large samples
Convergence of P(zn ≤ x)
1.00
0.75
N
2
CDF
5
10
0.50
15
20
infty
0.25
0.00
−2
0
2
z
Convergence of p-values
1.00
0.75
N
actual size
2
5
10
0.50
15
20
infty
0.25
0.00
0.00
0.25
0.50
0.75
1.00
nominal size
Code for graphs
16
References
References
Abbring, Jaap. 2001. “An Introduction to Econometrics: Lecture notes.” URL http://jabbring.
home.xs4all.nl/courses/b44old/lect210.pdf.
Angrist, Joshua D and Jörn-Steffen Pischke. 2014. Mastering ’Metrics: The Path from Cause to Effect.
Princeton University Press.
Aron-Dine, Aviva, Liran Einav, and Amy Finkelstein. 2013. “The RAND Health Insurance Experiment,
Three Decades Later.” Journal of Economic Perspectives 27 (1):197–222. URL http://www.aeaweb.
org/articles.php?doi=10.1257/jep.27.1.197.
Diez, David M, Christopher D Barr, and Mine Cetinkaya-Rundel. 2012. OpenIntro Statistics. OpenIntro. URL http://www.openintro.org/stat/textbook.php.
Grinstead, Charles M and J. Laurie Snell. 2003. Introduction to Probability. American Mathematical Society. URL http://www.dartmouth.edu/~chance/teaching_aids/books_articles/
probability_book/book.html.
Menzel, Konrad. 2009.
“14.30 Introduction to Statistical Methods in Economics.”
Massachusetts Institute of Technology: MIT OpenCourseWare. URL http://ocw.mit.edu/courses/
economics/14-30-introduction-to-statistical-methods-in-economics-spring-2009/
lecture-notes/.
Miguel, E. and M. Kremer. 2003. “Worms: identifying impacts on education and health in the
presence of treatment externalities.” Econometrica 72 (1):159–217. URL http://onlinelibrary.
wiley.com/doi/10.1111/j.1468-0262.2004.00481.x/abstract.
Wooldridge, J.M. 2013. Introductory econometrics: A modern approach. South-Western.
17
Download