Statistical Methods for Computer Science Marie desJardins (

October 1999

Statistical Methods for

Computer Science

Marie desJardins

( mariedj@cs.umbc.edu

)

CMSC 601

April 9, 2012

Material adapted from slides by

Tom Dietterich, with permission

4/9/12

Statistical Analysis of Data

 Given a set of measurements of a value, how certain can we be of the value?

 Given a set of measurements of two values, how certain can we be that the two values are different?

 Given a measured outcome, along with several condition or treatment values, how can we remove the effect of unwanted conditions or treatments on the outcome?

2

4/9/12

Measuring CPU Time

 Here are 37 measurements of the CPU time required to compute C(10000, 500):

 0.27 0.25 0.23 0.24 0.26 0.24 0.26 0.25 0.24 0.25

0.25 0.24 0.25 0.24 0.25 0.26 0.24 0.25 0.25 0.25

0.25 0.25 0.24 0.25 0.24 0.25 0.25 0.24 0.25 0.25

0.24 0.25 0.24 0.24 0.25 0.25 0.26

 What is the “true” CPU cost of this computation?

 Before doing any calculations,

3

4/9/12

CPU Data Distribution

4

4/9/12

Kernel Density Estimate

 Kernel density: place a small Gaussian distribution

(“kernel”) around each data point, and sum them

 Useful for visualization; also often used as a regression technique

5

Sample Mean

4/9/12





The data seems to have reasonably close to a normal (Gaussian or bell curve) distribution

Given this assumption, we can compute a sample mean: x



1 n

 n i



1 x i



0.248





How certain can we be that this is the true value?

Confidence interval [min, max]:

 n =37, and computed the sample means

 95% of the time, this value would lie between max and min

6

4/9/12

Confidence Intervals via Resampling





We can simulate this process algorithmically

Draw 1000 random subsamples (with replacement) from the original 37 points

 This process makes no assumption about a Gaussian distribution!





Sort the means of these subsamples

Choose the 26 th and 975 th values as min and max of a 95% confidence interval (includes 95% of the sample means!)

 Result: The resampled confidence interval is [0.245, 0.251]

7

Confidence Intervals via

Distributional Theory

 The Central Limit Theorem says that the distribution of the sample means is normally distributed,

 If the original data is normally distributed with mean

μand standard deviation σ, then the sample means will be normally distributed with mean μ and standard deviation σ’ = σ/√n (but we don’t know the original μ and σ...):

4/9/12 p ( x )



1

2



' e



1

 x



2



'

  2



 Note that it isn’t important to remember this formula, since Matlab, R, etc. will do this for you. But it is very important to understand why you are computing it!

8

4/9/12

t Distribution

 Instead of assuming a normal distribution, we can use a t distribution (sometimes called a “Student’s t distribution”), which has three parameters: μ, σ, and the degrees of freedom (d.f. = n-1)

 The probability distribution function looks somewhat like a normal distribution, but gives a tighter peak (with longer tails) as n increases

 This distribution yields just slightly tighter confidence limits, using the central limit theorem:

9

4/9/12

Distributional Confidence Intervals

 We can use the mathematical formula for the t distribution to compute a p (typically, p=0.95) confidence interval:





The 0.025 t-value, t

0.025

, is the value such that the probability that μ-μ’ < t

0.025 is 0.975

The 95% confidence interval is then [μ’-t

0.025

, μ+t

0.025

]

 For the CPU example, t

0.025 is 0.028, so the distributional confidence interval is [0.220, 0.276] -tighter than the bootstrapped CI of [0.245, 0.251]

10

4/9/12

Bootstrap Computations of

Other Statistics

 The bootstrap method can be used to compute other sample statistics for which the distribution method isn’t appropriate:

 median

 mode

 variance

 Because the tails and outlying values may not be well represented in a sample, the bootstrap method is not as useful for statistics involving the “ends” of the distribution:

 minimum

 maximum

11

4/9/12

Measuring the Number of Occurrences of Events

 In CS, we often want to know how often something occurs:

 How many times does a process complete successfully?

 How many times do we correctly predict membership in a class?

 How many times do we find the top search result?

 Again, the sample rate θ’ is what we have observed, but we would like to know the “true” rate θ

12

4/9/12

Bootstrap Confidence Intervals for Rates

 Suppose we have observed 100 predictions of a decision tree, and 88 of them were correct







Draw many (say, 1000) samples of size n , with replacement, from the n observed predictions (here, n =100), and compute the sample classification rate

Sort the sample rates θ i in increasing order

Choose the 26 th and 975 th values as the ends of the confidence interval: here, the confidence interval is [0.81, 0.94]

13

4/9/12

Binomial Distributional

Confidence

 If we assume that the classifier is a “biased coin” with probability θ of coming up heads, then we can use the binomial distribution to analytically compute the confidence interval

 This requires a small correction because the binomial distribution is actually discrete, but we want a continuous estimate

14

4/9/12

Comparing Two Measurements

 Consider the CPU measurements of the earlier example, and suppose we have performed the same computation on a different machine, yielding these

CPU times:

 0.21 0.20 0.20 0.19 0.20 0.19 0.18 0.20 0.19 0.19

0.19 0.19 0.20 0.18 0.19 0.20 0.22 0.20 0.20 0.20

0.19 0.20 0.18 0.19 0.19 0.20 0.20 0.22 0.18 0.29

0.21 0.23 0.20

 These times certainly seem faster than the first machine, which yielded a distributional confidence interval of [0.220, 0.276] – but how can we be sure?

15

4/9/12

Kernel Density Comparison

 Visually, the second machine (Shark3) is much faster than the first (Darwin):

16

Difference Estimation

4/9/12

 Bootstrap testing:

 Repeat many times:

 Draw a bootstrap sample from each of the machines, computer sample means

 If Shark3 is faster than Darwin more than 95% of the time, we can be 95% confident that it really is faster

 We can also compute a 95% bootstrap confidence interval on the difference between the means – this turns out to be

[0.0461, 0.0553]

 If the samples are drawn from t distributions, then the difference between the sample means also has a t distribution

 Confidence interval on this difference: [0.0463, 0.0555]

17

Hypothesis Testing

4/9/12

 Is the true difference zero, or more than zero?

 Use classical statistical rejection testing

 Null hypothesis: The two machines have the same speed

(i.e., μ, the difference in sample rate, is equal to zero)

 Can we reject this hypothesis, based on the observed data?

 If the null hypothesis were true, what is the probability we would have observed this data?







We can measure this probability using the t distribution

In this case, the computed t value = (μ1 – μ2) / σ’ = 21.69

The probability of seeing this t value, if μ was actually zero, is nearly nonexistent: The 99.999% confidence interval (for the null hypothesis) is [-4.59, 4.59], so the probability of this t value is (much) less than 0.00001

18

4/9/12

Paired Differences

 Suppose we had a set of 10 different benchmark programs that we ran on the two machines, yielding these CPU times:

 Obviously, we don’t want to just compare the means, since the programs have such different running times

19

4/9/12

Kernel Density Visualization

 CPU1 seems to be systemically faster (offset to the left) than CPU2

20

4/9/12

Scatterplot Visualization

 CPU1 is always faster than CPU2 (i.e., above the diagonal line that corresponds to equal speed)

21

4/9/12

Sequential Visualization

 The cocorrelation of program “difficulty” (and faster

CPU speed of CPU1) is even more obvious in this ordered (by program number) line plot:

22

Distribution Analysis I

4/9/12

 If the differences are in the same “units,” we can subtract the

CPU times for the “paired” tests and assume a t distribution on these differences





 The probability of observing a sample mean difference as large as 02779, given a null hypothesis of the machines having the same speed, is 0.0000466 – we can reject the null hypothesis

If we have no prior belief about which machine is faster, we should use a “two-tailed test”

 The probability of observing a sample mean difference this large in either direction is 00000932 – slightly larger, but still sufficiently improbable that we can be sure that the machines have different speeds

Note that we can also use a bootstrap analysis on this type of paired data

23

4/9/12

Paired vs. Non-Paired

 If we don’t pair the data (just compare the overall mean, not the differences for paired tests):



Distributional analysis doesn’t let us reject the null hypothesis



Bootstrap analysis doesn’t let us reject the null hypothesis

24

4/9/12

Sign Tests

 I mentioned before that the paired t-test is appropriate if the measurements are in the same

“units”

 If the magnitude of the difference is not important, or not meaningful, we still can compare performance

 Look at the sign of the difference (here, CPU1 is faster 10 out of 10 times; but in another case, it might only be faster 9 out of 10 times)

 Use the binomial distribution (flip a coin to get the sign) to compute a confidence interval for the probability that CPU1 is faster

25

4/9/12

Other Important Topics

 Regression analysis

 Cross-validation

 Human subjects analysis and user study design

 Analysis of Variance (ANOVA)

 For your particular investigation, you need to know which of these topics are relevant, and to learn about them!

26

4/9/12

Statistically Valid Experimental

Design

 Make sure you understand the nuances before you design your experiments...

 ...and definitely before you analyze your experimental data!

 Designing the statistical methods (and hypotheses) after the fact is not valid!

 You can often find a hypothesis and associated statistical method and hypothesis ex post facto – i.e., design an experiment to fit the data instead of the other way around

 In the worst case, doing this is downright unethical

 In the best case, it shows a lack of clear research objectives and may not be reproducible or meaningful

27

Statistical Methods for Computer Science Marie desJardins (

Statistical Methods for

Computer Science

Statistical Analysis of Data

Measuring CPU Time

CPU Data Distribution

Kernel Density Estimate

Sample Mean

Confidence Intervals via Resampling

Confidence Intervals via

Distributional Theory

t Distribution

Distributional Confidence Intervals

Bootstrap Computations of

Other Statistics

Measuring the Number of Occurrences of Events

Bootstrap Confidence Intervals for Rates

Binomial Distributional

Confidence

Comparing Two Measurements

Kernel Density Comparison

Difference Estimation

Hypothesis Testing

Paired Differences

Kernel Density Visualization

Scatterplot Visualization

Sequential Visualization

Distribution Analysis I

Paired vs. Non-Paired

Sign Tests

Other Important Topics

Statistically Valid Experimental

Design

Related documents

Products

Support

Statistical Methods for Computer Science Marie desJardins (