Statistical Inference Lecturer: Prof. Duane S. Boning 1 Agenda 1. Review: Probability Distributions & Random Variables 2. Sampling: Key distributions arising in sampling • Chi-square, t, and F distributions 3. Estimation: Reasoning about the population based on a sample 4. Some basic confidence intervals • • • Estimate of mean with variance known Estimate of mean with variance not known Estimate of variance 5. Hypothesis tests 2 Discrete Distribution: Bernoulli • Bernoulli trial: an experiment with two outcomes • Probability mass function (pmf): f(x) ¾ ¼ p 1-p 0 1 x 3 Discrete Distribution: Binomial • Repeated random Bernoulli trials • n is the number of trials • p is the probability of “success” on any one trial • x is the number of successes in n trials 4 Binomial Distribution Binomial Distribution 1 .2 1 0 .8 0.25 0 .6 S eries1 0 .4 0 .2 29 27 25 23 21 19 17 15 9 13 7 11 5 3 1 0 0.15 0.1 0.05 29 27 25 23 21 19 17 15 13 11 9 7 5 3 0 1 Probability 0.2 Number of "Successes" 5 Discrete Distribution: Poisson • Mean: • Variance: • Example applications: – # misprints on page(s) of a book – # transistors which fail on first day of operation • Poisson is a good approximation to Binomial when n is large and p is small (< 0.1) 6 Poisson Distributions Poisson Distribution 0.2 0.18 =5 0.16 Probability 0.14 0.12 0.1 c 0.08 Poisson Distribution 0.06 0.1 0.04 0.09 =20 0.02 0.08 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 0.07 Poisson Distribution 0.08 Probability Events per unit 0.06 0.05 c 0.04 0.03 0.07 =30 0.06 0.02 0.01 Probability 0.05 0 1 0.04 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 Events per unit 0.03 0.02 0.01 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 Events per unit 7 Continuous Distributions • Uniform Distribution • Normal Distribution – Unit (Standard) Normal Distribution 8 Continuous Distribution: Uniform • probability density function (pdf)† a b x a b x • cumulative distribution function* (cdf) 1 †also sometimes called a probability distribution function *also sometimes called a cumulative density function 9 Standard Questions You Should Be Able To Answer (For a Known cdf or pdf) • Probability x less than or equal to some value • Probability x sits within some range 1 a b x a b x 10 Continuous Distribution: Normal (Gaussian) • pdf 1 • cdf 0.99865 0.977 0.84 0.5 0.00135 0.16 0.0227 0 11 Continuous Distribution: Unit Normal • Normalization • Mean • Variance • pdf • cdf 12 Using the Unit Normal pdf and cdf • We often want to talk about “percentage points” of the distribution – portion in the tails 1 0.9 0.5 0.1 13 Philosophy The field of statistics is about reasoning in the face of uncertainty, based on evidence from observed data • Beliefs: – Distribution or model form – Distribution/model parameters • Evidence: – Finite set of observations or data drawn from a population • Models: – Seek to explain data 14 Moments of the Population vs. Sample Statistics Population Sample • Mean • Variance • Standard Deviation • Covariance • Correlation Coefficient 15 Sampling and Estimation • Sampling: act of making observations from populations • Random sampling: when each observation is identically and independently distributed (IID) • Statistic: a function of sample data; a value that can be computed from data (contains no unknowns) – average, median, standard deviation 16 SticiGui: Statistics Tools for Internet and Classroom Instruction with a Graphical User Interface http://stat-www.berkeley.edu/~stark/SticiGui Sampling Demo 17 Population vs. Sampling Distribution Population (probability density function) n = 20 Sample Mean (statistic) n = 10 Sample Mean (sampling distribution) n=2 n=1 18 Sampling and Estimation, cont. • • • • Sampling Random sampling Statistic A statistic is a random variable, which itself has a sampling distribution – I.e., if we take multiple random samples, the value for the statistic will be different for each set of samples, but will be governed by the same sampling distribution • If we know the appropriate sampling distribution, we can reason about the population based on the observed value of a statistic – E.g. we calculate a sample mean from a random sample; in what range do we think the actual (population) mean really sits? 19 Sampling and Estimation – An Example • Suppose we know that the thickness of a part is normally distributed with std. dev. of 10: • We sample n = 50 random parts and compute the mean part thickness: • First question: What is distribution of • Second question: can we use knowledge of distribution to reason about the actual (population) mean given observed (sample) mean? 20 Estimation and Confidence Intervals • Point Estimation: – Find best values for parameters of a distribution – Should be • Unbiased: expected value of estimate should be true value • Minimum variance: should be estimator with smallest variance • Interval Estimation: – Give bounds that contain actual value with a given probability – Must know sampling distribution! Confidence Interval Demo 21 Confidence Intervals: Variance Known • We know , e.g. from historical data • Estimate mean in some interval to (1- )100% confidence • • Remember the unit normal percentage points Apply to the sampling distribution for the sample mean 1 0.9 0.5 0.1 22 Example, Cont’d • Second question: can we use knowledge of distribution to reason about the actual (population) mean given observed (sample) mean? n = 50 95% confidence interval, = 0.05 ~95% of distribution lies within +/- 2 of mean 23 Reasoning & Sampling Distributions • Example shows that we need to know our sampling distribution in order to reason about the sample and population parameters • Other important sampling distributions: – Student-t • Use instead of normal distribution when we don’t know actual variation or – Chi-square • Use when we are asking about variances – F • Use when we are asking about ratios of variances 24 Sampling: The Chi-Square Distribution • Typical use: find distribution of variance when mean is known • Ex: So if we calculate s2, we can use knowledge of chi-square distribution to put bounds on where we believe the actual (population) variance sits 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0 5 10 15 20 25 30 25 Sampling: The Student-t Distribution • Typical use: Find distribution of average when • For k ! 1, tk ! N(0,1) • Consider xi ~ N( , 2) . Then is NOT known • This is just the “normalized” distance from mean (normalized to our estimate of the sample variance) 26 Back to our Example • Suppose we do not know either the variance or the mean in our parts population: • We take our sample of size n = 50, and calculate • Best estimate of population mean and variance (std.dev.)? • If had to pick a range where would be 95% of time? Have to use the appropriate sampling distribution: In this case – the t-distribution (rather than normal distribution) 27 Confidence Intervals: Variance Unknown • Case where we don’t know variance a priori • Now we have to estimate not only the mean based on our data, but also estimate the variance • Our estimate of the mean to some interval with (1- )100% confidence becomes Note that the t distribution is slightly wider than the normal distribution, so that our confidence interval on the true mean is not as tight as when we know the variance. 28 Example, Cont’d • Third question: can we use knowledge of distribution to reason about the actual (population) mean given observed (sample) mean – even though we weren’t told ? n = 50 t distribution is slightly wider than gaussian distribution 95% confidence interval 29 Once More to Our Example • Fourth question: how about a confidence interval on our estimate of the variance of the thickness of our parts, based on our 50 observations? 30 Confidence Intervals: Estimate of Variance • The appropriate sampling distribution is the Chi-square • Because 2 is asymmetric, c.i. bounds not symmetric. 31 Example, Cont’d • Fourth question: for our example (where we observed sT2 = 102.3) with n = 50 samples, what is the 95% confidence interval for the population variance? 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 0 10 20 30 40 50 60 70 80 90 100 32 Sampling: The F Distribution • Typical use: compare the spread of two populations • Example: – x ~ N( x, – y ~ N( y, – Then from which we sample x1, x2, …, xn 2 ) from which we sample y , y , …, y y 1 2 m 2 x) or 33 Concept of the F Distribution • Assume we have a normally distributed population • We generate two different random samples from the population • In each case, we calculate a sample variance si2 • What range will the ratio of these two variances take? ) F distribution • Purely by chance (due to sampling) we get a range of ratios even though drawing from same population Example: • Assume x ~ N(0,1) • Take samples of size n = 20 • Calculate s12 and s22 and take ratio • 95% confidence interval on ratio Large range in ratio! 34 Hypothesis Testing • A statistical hypothesis is a statement about the parameters of a probability distribution • H0 is the “null hypothesis” – E.g. – Would indicate that the machine is working correctly • H1 is the “alternative hypothesis” – E.g. – Indicates an undesirable change (mean shift) in the machine operation (perhaps a worn tool) • In general, we formulate our hypothesis, generate a random sample, compute a statistic, and then seek to reject H0 or fail to reject (accept) H0 based on probabilities associated with the statistic and level of confidence we select 35 Which Population is Sample x From? • Two error probabilities in decision: – Type I error: “false alarm” – Type II error: “miss” – Power of test (“correct alarm”) Consider H0 the “normal” condition • Control charts are hypothesis tests: Consider H1 an “alarm” condition Set decision point (and sample size) based on acceptable , risks – Is my process “in control” or has a significant change occurred? 36 Summary 1. 2. Review: Probability Distributions & Random Variables Sampling: Key distributions arising in sampling • 3. 4. Estimation: Reasoning about the population based on a sample Some basic confidence intervals • • • 5. Chi-square, t, and F distributions Estimate of mean with variance known Estimate of mean with variance not known Estimate of variance Hypothesis tests Next Time: 1. Are effects (one or more variables) significant? ) ANOVA (Analysis of Variance) 2. How do we model the effect of some variable(s)? ) Regression modeling 37 MIT OpenCourseWare http://ocw.mit.edu 2.854 / 2.853 Introduction to Manufacturing Systems Fall 2010 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.