Empirical Research Methods in
Computer Science
Lecture 2, Part 1
October 19, 2005
Noah Smith
Some tips
Perl scripts can be named encode instead of encode.pl
encode foo ≢ encode < foo chmod u+x encode
Instead of making us run java Encode, write a shell script:
#!/bin/sh cd `dirname $0` java Encode
Check that it works on (say) ugrad10.
Assignment 1
If you didn’t turn in a first version yesterday, don’t bother – just turn in the final version.
Final version due Tuesday 10/25, 8pm
We will post a few exercises soon.
Questions?
Today
Standard error
Bootstrap for standard error
Confidence intervals
Hypothesis testing
Notation
P is a population
S = [s
1
, s
2
, ..., s n
] is a sample from P
Let X = [x
1
, x
2
, ..., x n
] be some numerical measurement on the s i distributed over P according to unknown F
We may use Y, Z for other measurements.
Mean
What does mean mean?
μ x
is population mean of x
(depends on F)
μ x is in general unknown
How do we estimate the mean?
Sample mean x n
i 1 n x i
Gzip compression rate usually < 1, but not always
Gzip compression rate
Accuracy
How good an estimate is the sample mean?
Standard error (se) of a statistic:
We picked one S from P.
x samples from P?
There is some “true” se value!
Extreme cases
n → ∞
n = 1
Standard error (of the sample mean)
Known: se ( x )
x n true standard deviation of x under F
“Standard error” = standard deviation of a statistic
Gzip compression rate
Central Limit Theorem
The sampling distribution of the sample mean approaches a normal distribution as n increases.
x N
μ , n
2 x
How to estimate σ x
?
“Plug-in principle”
n
1 n
i 1
x i
x
2
Therefore: se
n
n
i 1
x i
n x
2
Plug-in principle
We don’t have (and can’t get) P
We don’t know F, the true distribution over X
We do have S (the sample)
We do know , the sample distribution over X
Fˆ
Good and Bad News
We have a formula to estimate the standard error of the sample mean!
We have a formula to estimate only the standard error of the sample mean!
variance median trimmed mean ratio of means of x and y correlation between x and y
Bootstrap world unknown distribution F empirical distribution
Fˆ observed random sample X statistic of interest s ( X ) bootstrap random sample X* bootstrap replication * s ( X *) statistics about the estimate (e.g., standard error)
Bootstrap sample
X = [3.0, 2.8, 3.7, 3.4, 3.5]
X* could be:
[2.8, 3.4, 3.7, 3.4, 3.5]
[3.5, 3.0, 3.4, 2.8, 3.7]
[3.5, 3.5, 3.4, 3.0, 2.8]
...
Draw n elements with replacement.
Reflection
Imagine doing this with a pencil and paper.
The bootstrap was born in 1979.
Typically, sampling is costly and computation is cheap.
In (empirical) CS, sampling isn’t even necessarily all that costly.
Bootstrap estimate of se
Let s( · ) be a function for computing an estimate
True value of the standard error:
Ideal bootstrap estimate: se
B
se
B se
Fˆ
se
F se
Fˆ
Bootstrap estimate with B boostrap samples:
*
Bootstrap estimate of se lim
B se
B
se
Fˆ se
B
B
i 1
* [
B
]i
1
*
2
Bootstrap, intuitively
We don’t know F.
We would like lots of samples from P, but we only have one (S).
We approximate F by Fˆ
Plug-in principle!
Easy to generate lots of “samples” from Fˆ
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P, sample S
Two values, x of the sample i and y i for each element
Correlation coefficient: ρ
Sample correlation coefficient: r n
i 1
x i
x
y i
y
n
i 1
x i
x
2 n
i 1
y i
y
2
Example: gzip compression r = 0.9616
Accuracy of r
No general closed form for se(r).
If we assume x and y are bivariate
Gaussian se normal
( r )
1 r 2
n 3
0.5
1 se normal se normal
0 r
-0.5
0
0.5
1
10
20
30
40
50
60
70
80 n
90
100
Normality
Why assume the data are Gaussian?
Alternative: bootstrap estimate of the standard error of r se
B
B
i 1 r
* [ ]i
B
r
1
*
2
Example: gzip compression r = 0.9616
se normal
(r) = 0.0024
se
200
(r) = 0.0298
se bootstrap advice
Plot the data.
Runtime?
Efron and Tibshirani:
B = 25 is informative
B = 50 often enough seldom need B > 200 (for se)
Summary so far
A statistic is a “true fact” about the distribution F
We don’t know F
For some parameter θ, we want:
estimate “θ hat”
accuracy of that estimate (e.g., standard error)
For the mean, μ, we have a closed form
For other θ, the bootstrap will help!