October 1999
Marie desJardins
( mariedj@cs.umbc.edu
)
CMSC 601
April 9, 2012
Material adapted from slides by
Tom Dietterich, with permission
4/9/12
Given a set of measurements of a value, how certain can we be of the value?
Given a set of measurements of two values, how certain can we be that the two values are different?
Given a measured outcome, along with several condition or treatment values, how can we remove the effect of unwanted conditions or treatments on the outcome?
2
4/9/12
Here are 37 measurements of the CPU time required to compute C(10000, 500):
0.27 0.25 0.23 0.24 0.26 0.24 0.26 0.25 0.24 0.25
0.25 0.24 0.25 0.24 0.25 0.26 0.24 0.25 0.25 0.25
0.25 0.25 0.24 0.25 0.24 0.25 0.25 0.24 0.25 0.25
0.24 0.25 0.24 0.24 0.25 0.25 0.26
What is the “true” CPU cost of this computation?
Before doing any calculations,
3
4/9/12
4
4/9/12
Kernel density: place a small Gaussian distribution
(“kernel”) around each data point, and sum them
Useful for visualization; also often used as a regression technique
5
4/9/12
The data seems to have reasonably close to a normal (Gaussian or bell curve) distribution
Given this assumption, we can compute a sample mean: x
1 n
n i
1 x i
0.248
How certain can we be that this is the true value?
Confidence interval [min, max]:
n =37, and computed the sample means
95% of the time, this value would lie between max and min
6
4/9/12
We can simulate this process algorithmically
Draw 1000 random subsamples (with replacement) from the original 37 points
This process makes no assumption about a Gaussian distribution!
Sort the means of these subsamples
Choose the 26 th and 975 th values as min and max of a 95% confidence interval (includes 95% of the sample means!)
Result: The resampled confidence interval is [0.245, 0.251]
7
The Central Limit Theorem says that the distribution of the sample means is normally distributed,
If the original data is normally distributed with mean
μand standard deviation σ, then the sample means will be normally distributed with mean μ and standard deviation σ’ = σ/√n (but we don’t know the original μ and σ...):
4/9/12 p ( x )
1
2
' e
1
x
2
'
2
Note that it isn’t important to remember this formula, since Matlab, R, etc. will do this for you. But it is very important to understand why you are computing it!
8
4/9/12
Instead of assuming a normal distribution, we can use a t distribution (sometimes called a “Student’s t distribution”), which has three parameters: μ, σ, and the degrees of freedom (d.f. = n-1)
The probability distribution function looks somewhat like a normal distribution, but gives a tighter peak (with longer tails) as n increases
This distribution yields just slightly tighter confidence limits, using the central limit theorem:
9
4/9/12
We can use the mathematical formula for the t distribution to compute a p (typically, p=0.95) confidence interval:
The 0.025 t-value, t
0.025
, is the value such that the probability that μ-μ’ < t
0.025 is 0.975
The 95% confidence interval is then [μ’-t
0.025
, μ+t
0.025
]
For the CPU example, t
0.025 is 0.028, so the distributional confidence interval is [0.220, 0.276] -tighter than the bootstrapped CI of [0.245, 0.251]
10
4/9/12
The bootstrap method can be used to compute other sample statistics for which the distribution method isn’t appropriate:
median
mode
variance
Because the tails and outlying values may not be well represented in a sample, the bootstrap method is not as useful for statistics involving the “ends” of the distribution:
minimum
maximum
11
4/9/12
In CS, we often want to know how often something occurs:
How many times does a process complete successfully?
How many times do we correctly predict membership in a class?
How many times do we find the top search result?
Again, the sample rate θ’ is what we have observed, but we would like to know the “true” rate θ
12
4/9/12
Suppose we have observed 100 predictions of a decision tree, and 88 of them were correct
Draw many (say, 1000) samples of size n , with replacement, from the n observed predictions (here, n =100), and compute the sample classification rate
Sort the sample rates θ i in increasing order
Choose the 26 th and 975 th values as the ends of the confidence interval: here, the confidence interval is [0.81, 0.94]
13
4/9/12
If we assume that the classifier is a “biased coin” with probability θ of coming up heads, then we can use the binomial distribution to analytically compute the confidence interval
This requires a small correction because the binomial distribution is actually discrete, but we want a continuous estimate
14
4/9/12
Consider the CPU measurements of the earlier example, and suppose we have performed the same computation on a different machine, yielding these
CPU times:
0.21 0.20 0.20 0.19 0.20 0.19 0.18 0.20 0.19 0.19
0.19 0.19 0.20 0.18 0.19 0.20 0.22 0.20 0.20 0.20
0.19 0.20 0.18 0.19 0.19 0.20 0.20 0.22 0.18 0.29
0.21 0.23 0.20
These times certainly seem faster than the first machine, which yielded a distributional confidence interval of [0.220, 0.276] – but how can we be sure?
15
4/9/12
Visually, the second machine (Shark3) is much faster than the first (Darwin):
16
4/9/12
Bootstrap testing:
Repeat many times:
Draw a bootstrap sample from each of the machines, computer sample means
If Shark3 is faster than Darwin more than 95% of the time, we can be 95% confident that it really is faster
We can also compute a 95% bootstrap confidence interval on the difference between the means – this turns out to be
[0.0461, 0.0553]
If the samples are drawn from t distributions, then the difference between the sample means also has a t distribution
Confidence interval on this difference: [0.0463, 0.0555]
17
4/9/12
Is the true difference zero, or more than zero?
Use classical statistical rejection testing
Null hypothesis: The two machines have the same speed
(i.e., μ, the difference in sample rate, is equal to zero)
Can we reject this hypothesis, based on the observed data?
If the null hypothesis were true, what is the probability we would have observed this data?
We can measure this probability using the t distribution
In this case, the computed t value = (μ1 – μ2) / σ’ = 21.69
The probability of seeing this t value, if μ was actually zero, is nearly nonexistent: The 99.999% confidence interval (for the null hypothesis) is [-4.59, 4.59], so the probability of this t value is (much) less than 0.00001
18
4/9/12
Suppose we had a set of 10 different benchmark programs that we ran on the two machines, yielding these CPU times:
Obviously, we don’t want to just compare the means, since the programs have such different running times
19
4/9/12
CPU1 seems to be systemically faster (offset to the left) than CPU2
20
4/9/12
CPU1 is always faster than CPU2 (i.e., above the diagonal line that corresponds to equal speed)
21
4/9/12
The cocorrelation of program “difficulty” (and faster
CPU speed of CPU1) is even more obvious in this ordered (by program number) line plot:
22
4/9/12
If the differences are in the same “units,” we can subtract the
CPU times for the “paired” tests and assume a t distribution on these differences
The probability of observing a sample mean difference as large as 02779, given a null hypothesis of the machines having the same speed, is 0.0000466 – we can reject the null hypothesis
If we have no prior belief about which machine is faster, we should use a “two-tailed test”
The probability of observing a sample mean difference this large in either direction is 00000932 – slightly larger, but still sufficiently improbable that we can be sure that the machines have different speeds
Note that we can also use a bootstrap analysis on this type of paired data
23
4/9/12
If we don’t pair the data (just compare the overall mean, not the differences for paired tests):
Distributional analysis doesn’t let us reject the null hypothesis
Bootstrap analysis doesn’t let us reject the null hypothesis
24
4/9/12
I mentioned before that the paired t-test is appropriate if the measurements are in the same
“units”
If the magnitude of the difference is not important, or not meaningful, we still can compare performance
Look at the sign of the difference (here, CPU1 is faster 10 out of 10 times; but in another case, it might only be faster 9 out of 10 times)
Use the binomial distribution (flip a coin to get the sign) to compute a confidence interval for the probability that CPU1 is faster
25
4/9/12
Regression analysis
Cross-validation
Human subjects analysis and user study design
Analysis of Variance (ANOVA)
For your particular investigation, you need to know which of these topics are relevant, and to learn about them!
26
4/9/12
Make sure you understand the nuances before you design your experiments...
...and definitely before you analyze your experimental data!
Designing the statistical methods (and hypotheses) after the fact is not valid!
You can often find a hypothesis and associated statistical method and hypothesis ex post facto – i.e., design an experiment to fit the data instead of the other way around
In the worst case, doing this is downright unethical
In the best case, it shows a lack of clear research objectives and may not be reproducible or meaningful
27