Statistical Methods for Computer Science Marie desJardins (

advertisement

October 1999

Statistical Methods for

Computer Science

Marie desJardins

( mariedj@cs.umbc.edu

)

CMSC 601

April 9, 2012

Material adapted from slides by

Tom Dietterich, with permission

4/9/12

Statistical Analysis of Data

 Given a set of measurements of a value, how certain can we be of the value?

 Given a set of measurements of two values, how certain can we be that the two values are different?

 Given a measured outcome, along with several condition or treatment values, how can we remove the effect of unwanted conditions or treatments on the outcome?

2

4/9/12

Measuring CPU Time

 Here are 37 measurements of the CPU time required to compute C(10000, 500):

 0.27 0.25 0.23 0.24 0.26 0.24 0.26 0.25 0.24 0.25

0.25 0.24 0.25 0.24 0.25 0.26 0.24 0.25 0.25 0.25

0.25 0.25 0.24 0.25 0.24 0.25 0.25 0.24 0.25 0.25

0.24 0.25 0.24 0.24 0.25 0.25 0.26

 What is the “true” CPU cost of this computation?

 Before doing any calculations,

3

4/9/12

CPU Data Distribution

4

4/9/12

Kernel Density Estimate

 Kernel density: place a small Gaussian distribution

(“kernel”) around each data point, and sum them

 Useful for visualization; also often used as a regression technique

5

Sample Mean

4/9/12

The data seems to have reasonably close to a normal (Gaussian or bell curve) distribution

Given this assumption, we can compute a sample mean: x

1 n

 n i

1 x i

0.248

How certain can we be that this is the true value?

Confidence interval [min, max]:

 n =37, and computed the sample means

 95% of the time, this value would lie between max and min

6

4/9/12

Confidence Intervals via Resampling

We can simulate this process algorithmically

Draw 1000 random subsamples (with replacement) from the original 37 points

 This process makes no assumption about a Gaussian distribution!

Sort the means of these subsamples

Choose the 26 th and 975 th values as min and max of a 95% confidence interval (includes 95% of the sample means!)

 Result: The resampled confidence interval is [0.245, 0.251]

7

Confidence Intervals via

Distributional Theory

 The Central Limit Theorem says that the distribution of the sample means is normally distributed,

 If the original data is normally distributed with mean

μand standard deviation σ, then the sample means will be normally distributed with mean μ and standard deviation σ’ = σ/√n (but we don’t know the original μ and σ...):

4/9/12 p ( x )

1

2



' e

1

 x

2

'

  2



 Note that it isn’t important to remember this formula, since Matlab, R, etc. will do this for you. But it is very important to understand why you are computing it!

8

4/9/12

t Distribution

 Instead of assuming a normal distribution, we can use a t distribution (sometimes called a “Student’s t distribution”), which has three parameters: μ, σ, and the degrees of freedom (d.f. = n-1)

 The probability distribution function looks somewhat like a normal distribution, but gives a tighter peak (with longer tails) as n increases

 This distribution yields just slightly tighter confidence limits, using the central limit theorem:

9

4/9/12

Distributional Confidence Intervals

 We can use the mathematical formula for the t distribution to compute a p (typically, p=0.95) confidence interval:

The 0.025 t-value, t

0.025

, is the value such that the probability that μ-μ’ < t

0.025 is 0.975

The 95% confidence interval is then [μ’-t

0.025

, μ+t

0.025

]

 For the CPU example, t

0.025 is 0.028, so the distributional confidence interval is [0.220, 0.276] -tighter than the bootstrapped CI of [0.245, 0.251]

10

4/9/12

Bootstrap Computations of

Other Statistics

 The bootstrap method can be used to compute other sample statistics for which the distribution method isn’t appropriate:

 median

 mode

 variance

 Because the tails and outlying values may not be well represented in a sample, the bootstrap method is not as useful for statistics involving the “ends” of the distribution:

 minimum

 maximum

11

4/9/12

Measuring the Number of Occurrences of Events

 In CS, we often want to know how often something occurs:

 How many times does a process complete successfully?

 How many times do we correctly predict membership in a class?

 How many times do we find the top search result?

 Again, the sample rate θ’ is what we have observed, but we would like to know the “true” rate θ

12

4/9/12

Bootstrap Confidence Intervals for Rates

 Suppose we have observed 100 predictions of a decision tree, and 88 of them were correct

Draw many (say, 1000) samples of size n , with replacement, from the n observed predictions (here, n =100), and compute the sample classification rate

Sort the sample rates θ i in increasing order

Choose the 26 th and 975 th values as the ends of the confidence interval: here, the confidence interval is [0.81, 0.94]

13

4/9/12

Binomial Distributional

Confidence

 If we assume that the classifier is a “biased coin” with probability θ of coming up heads, then we can use the binomial distribution to analytically compute the confidence interval

 This requires a small correction because the binomial distribution is actually discrete, but we want a continuous estimate

14

4/9/12

Comparing Two Measurements

 Consider the CPU measurements of the earlier example, and suppose we have performed the same computation on a different machine, yielding these

CPU times:

 0.21 0.20 0.20 0.19 0.20 0.19 0.18 0.20 0.19 0.19

0.19 0.19 0.20 0.18 0.19 0.20 0.22 0.20 0.20 0.20

0.19 0.20 0.18 0.19 0.19 0.20 0.20 0.22 0.18 0.29

0.21 0.23 0.20

 These times certainly seem faster than the first machine, which yielded a distributional confidence interval of [0.220, 0.276] – but how can we be sure?

15

4/9/12

Kernel Density Comparison

 Visually, the second machine (Shark3) is much faster than the first (Darwin):

16

Difference Estimation

4/9/12

 Bootstrap testing:

 Repeat many times:

 Draw a bootstrap sample from each of the machines, computer sample means

 If Shark3 is faster than Darwin more than 95% of the time, we can be 95% confident that it really is faster

 We can also compute a 95% bootstrap confidence interval on the difference between the means – this turns out to be

[0.0461, 0.0553]

 If the samples are drawn from t distributions, then the difference between the sample means also has a t distribution

 Confidence interval on this difference: [0.0463, 0.0555]

17

Hypothesis Testing

4/9/12

 Is the true difference zero, or more than zero?

 Use classical statistical rejection testing

 Null hypothesis: The two machines have the same speed

(i.e., μ, the difference in sample rate, is equal to zero)

 Can we reject this hypothesis, based on the observed data?

 If the null hypothesis were true, what is the probability we would have observed this data?

We can measure this probability using the t distribution

In this case, the computed t value = (μ1 – μ2) / σ’ = 21.69

The probability of seeing this t value, if μ was actually zero, is nearly nonexistent: The 99.999% confidence interval (for the null hypothesis) is [-4.59, 4.59], so the probability of this t value is (much) less than 0.00001

18

4/9/12

Paired Differences

 Suppose we had a set of 10 different benchmark programs that we ran on the two machines, yielding these CPU times:

 Obviously, we don’t want to just compare the means, since the programs have such different running times

19

4/9/12

Kernel Density Visualization

 CPU1 seems to be systemically faster (offset to the left) than CPU2

20

4/9/12

Scatterplot Visualization

 CPU1 is always faster than CPU2 (i.e., above the diagonal line that corresponds to equal speed)

21

4/9/12

Sequential Visualization

 The cocorrelation of program “difficulty” (and faster

CPU speed of CPU1) is even more obvious in this ordered (by program number) line plot:

22

Distribution Analysis I

4/9/12

 If the differences are in the same “units,” we can subtract the

CPU times for the “paired” tests and assume a t distribution on these differences

 The probability of observing a sample mean difference as large as 02779, given a null hypothesis of the machines having the same speed, is 0.0000466 – we can reject the null hypothesis

If we have no prior belief about which machine is faster, we should use a “two-tailed test”

 The probability of observing a sample mean difference this large in either direction is 00000932 – slightly larger, but still sufficiently improbable that we can be sure that the machines have different speeds

Note that we can also use a bootstrap analysis on this type of paired data

23

4/9/12

Paired vs. Non-Paired

 If we don’t pair the data (just compare the overall mean, not the differences for paired tests):

Distributional analysis doesn’t let us reject the null hypothesis

Bootstrap analysis doesn’t let us reject the null hypothesis

24

4/9/12

Sign Tests

 I mentioned before that the paired t-test is appropriate if the measurements are in the same

“units”

 If the magnitude of the difference is not important, or not meaningful, we still can compare performance

 Look at the sign of the difference (here, CPU1 is faster 10 out of 10 times; but in another case, it might only be faster 9 out of 10 times)

 Use the binomial distribution (flip a coin to get the sign) to compute a confidence interval for the probability that CPU1 is faster

25

4/9/12

Other Important Topics

 Regression analysis

 Cross-validation

 Human subjects analysis and user study design

 Analysis of Variance (ANOVA)

 For your particular investigation, you need to know which of these topics are relevant, and to learn about them!

26

4/9/12

Statistically Valid Experimental

Design

 Make sure you understand the nuances before you design your experiments...

 ...and definitely before you analyze your experimental data!

 Designing the statistical methods (and hypotheses) after the fact is not valid!

 You can often find a hypothesis and associated statistical method and hypothesis ex post facto – i.e., design an experiment to fit the data instead of the other way around

 In the worst case, doing this is downright unethical

 In the best case, it shows a lack of clear research objectives and may not be reproducible or meaningful

27

Download