Statistics : Statistical Inference Krishna.V.Palem Kenneth and Audrey Kennedy Professor of Computing Department of Computer Science, Rice University 1 Sampling distribution of X Population and Sample 1 x1 2 Sample 2 x2 Sample 3 x3 Sample 4 x3 Sampling Distribution …… …… Sample k xk Central Limit Theorem (4) The mean of the sampling distribution of X is equal to the population mean, i.e. X (5) Standard deviation of the sampling distribution of X is the population standard deviation divided by the square root of sample size, i.e. X 3 n Sampling distribution of X for a Normal population) N=1: X 1.41, SD 0.145 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 N=5: X 1.40, SD 0.065 1.8 N=10: X 1.40, SD 0.047 4 1.02 1.11 1.2 1.29 1.38 1.47 1.56 1.65 1.74 1.83 1.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 N=50: X 1.40, SD 0.020 1 1.05 1.13 1.2 1.27 1.351.43 1.5 1.57 1.65 1.73 1.8 1.87 Sampling dist. of X for a non-Normal population N=5:X N=1: X = 1.40, SD = 0.147 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.1 1.2 1.3 X N=50: X = 1.41, SD = 0.021 5 1 N=100: 2 = 1.40, SD = 0.066 1.4 1.5 1.6 1.7 1.8 1.9 2 = 1.41, SD = 0.015 1 1.06 1.151.24 1.331.42 1.5 1.58 1.671.76 1.851.942 Computer simulation of the sampling distribution of the sample mean Pick any probability distribution and specify a mean and standard deviation. Tell the computer to randomly generate 1000 observations from that probability distributions E.g., the computer is more likely to spit out values with high probabilities Plot the “observed” values in a histogram. Next, tell the computer to randomly generate 1000 averages-of-2 (randomly pick 2 and take their average) from that probability distribution. Plot “observed” averages in histograms. Repeat for averages-of-10, and averages-of-100. 6 Uniform Distribution on [0,1]: average of 1 sample (original distribution) 7 Uniform Distribution: 1000 averages of 2 samples 8 Uniform Distribution: 1000 averages of 5 samples 9 Uniform Distribution: 1000 averages of 100 samples 10 Exponential Distribution: 1000 averages of 2 samples 11 Exponential Distribution: average of 1 sample (original distribution) 12 Exponential Distribution: 1000 averages of 5 samples 13 Exponential Distribution: 1000 averages of 100 samples 14 Contents Summary of Statistics Learnt so Far Statistical Inference Central Limit Theorem and its implications Estimation theory Interval Estimation What is Confidence Interval? Tutorial 15 Estimation Theory In statistics, estimation refers to the process by which one makes inferences about a population, based on information obtained from a sample. Statisticians use sample statistics to estimate population parameters. For example, sample means are used to estimate population means; sample proportions, to estimate population proportions. 16 Two types of Estimates Point estimate. A point estimate of a population parameter is a single value of a statistic. For example, the sample mean x is a point estimate of the population mean μ. When we estimate the mean (μ) by x, the probability that we are exactly correct is close to zero, i.e. P(x= μ) ~ 0 Assuming, the population is heterogeneous and the sample size n << population size N Hence, we are not very “confident” about our estimates we make using point estimates 17 Two Types of Estimates (contd.) How can we be more confident about our estimates? we want P(x = μ) to be a bigger value than zero We can increase our confidence levels by using a less than precise estimates instead of point estimates estimate in an interval instead of point Interval estimate. An interval estimate is defined by two numbers, between which a population parameter is said to lie. For example, a < x < b is an interval estimate of the population mean μ. It indicates that the population mean is greater than a but less than b. 18 Contents Summary of Statistics Learnt so Far Statistical Inference Central Limit Theorem and its implications Estimation theory Interval Estimation What is Confidence Interval? Tutorial 19 History of Interval Estimation Neyman (1937) identified interval estimation ("estimation by interval") as distinct from point estimation ("estimation by unique estimate"). he was the first to recognize and formulate interval estimation work quoting results in the form of an estimate plus-or-minus a standard deviation was the interval estimation his paper on this was titled "On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection" given at the Royal Statistical Society on 19 June 1934 20 You can download the paper from : http://stevereads.com/papers_to_read/on_the_two_different_aspects_of_the_representative_method.pdf What is an Interval Estimate? In statistics, interval estimation is the use of sample data to calculate an interval of possible (or probable) values of an unknown population parameter in contrast to point estimation, which is a single number. Interval estimate. An interval estimate is defined by two numbers, between which a population parameter is said to lie. for example, a < μ < b is an interval estimate of the population mean μ. indicates that the population mean is greater than a but less than b. we use x to estimate this interval 21 Interval estimates provide a "best estimate" of a parameter an indication of the precision with which the parameter is known. Types of Interval Estimation The most prevalent forms of interval estimation are: confidence intervals a frequentist method credible intervals a Bayesian method Other common approaches to interval estimation, which are encompassed by statistical theory, are: Tolerance intervals Prediction intervals used mainly in Regression Analysis 22 Of these, confidence intervals is the most common and widely used and hence, will be covered in more detail in this class Contents Summary of Statistics Learnt so Far Statistical Inference Central Limit Theorem and its implications Estimation theory Interval Estimation What is Confidence Interval? Tutorial 23 What is a Confidence Interval? In statistics, a confidence interval (CI) is an interval estimate of a population parameter. instead of estimating the parameter by a single value, an interval likely to include the parameter is given. confidence intervals are used to indicate the reliability of an estimate. How likely the interval is to contain the parameter is determined by the confidence level increasing the desired confidence level will widen the confidence interval. Confidence intervals and interval estimates more generally have applications across the whole range of quantitative studies. 24 Example of Confidence Interval For example, a confidence interval can be used to describe how reliable some opinion survey results are. In a survey of election voting-intentions, the result might be that 40% of respondents intend to vote for a certain party. A 95% confidence level for the proportion in the whole population having the same intention on the survey date might be in the confidence interval 36% to 44%. From the same survey date one may calculate a smaller 90% confidence level for the proportion in the whole population of for instance in confidence interval 38% to 42%. All other things being equal, a survey result with a small confidence interval with a higher confidence level is more desired 25 Video on Confidence Interval 26 Example In the whole of Houston, what percentage of adults do you think will want to watch a movie sometime in the next 10 days? assume a variance of 0.0625 for the whole population Choose a random sample of 10 adults and ask their opinion Will this be anywhere close to the actual percentage? Let X be the random variable denoting the percentage of adults attending the movies out of the sample. Xi be the value from ith sample How can we be sure to be closer to the actual mean? 27 Take very large number of samples Example (contd.) But, taking large number of samples is generally not feasible. We want to arrive at an estimate based on fewer samples. For example, in the previous example, if you take only 1 sample of 10 people and found that 5 of the 10 people would like to go for a movie, then you can say We are pretty sure that 50% of the adult population would want to go for a movie in the next 10 days. Isn’t this ambiguous? How sure is pretty sure? 28 Need to be more definitive Example (contd.) We use confidence interval to remove the ambiguity What if we want to be 100% sure? The only statement we can make which is 100% sure is that the 0%-100% of the adult population would want to watch a movie in the next 10 days. What if we want to be 50% sure? This statement doesn’t hold much importance as you are wrong half the time Then, what kind of statements make sense? 90% sure or 95% sure or 98% sure or 99% sure 29 Confidence Levels Calculating Confidence Level The general norm is to vary the interval by multiples of σ and compute the confidence level σ is varied equally on the either side of the mean The probability that μ is correct by the interval [x- σ,x+ σ] can be calculated as P( [ x , x ]) P( x x ) P( [ x , x ]) P( x ) Assuming Normal distribution, we get P([ x , x ]) 0.6852 What if we increase the interval from 2σ to 4σ? P([ x 2 , x 2]) 0.9544 30 Source for calculations: http://www.analyzemath.com/statistics/normal_calculator.html Confidence Level Table Some of the most commonly used confidence levels in statistics are given in the table below: Confidence Level Number of σs away from mean 90% 1.64 95% 1.96 98% 2.33 99% 2.575 Less than 90% is generally not considered a strong enough confidence level to make a statement 31 Example (Contd.) Let us continue with computing the confidence interval for our movie example Assume that we took a random sample of 10 adults. Among them, 5 adults said that they would like to go for the movie in the next 10 days Hence, we get, mean (x)= 0.5 (denotes 50% ) and standard deviation = 0.0625 0.1581 (Var(x) = σ /n ) 2 10 Say, we want to be 95% confident about our estimation. 32 Example (Contd.) From the table we can see that we have to be 1.96σ away from the mean. Hence, we need to be 1.96*0.1581 = 0.31 away from the mean Summarizing, we can now say with 95% confidence that the mean of the actual population will be between [0.5-0.31, 0.5+0.31] = [0.19,0.81] which is between 19%-81% of total population What if you want to be 98% confident? 33 Graphical Representation of Confidence Intervals Example A plot of a normal distribution (or bell curve). Each colored band has a width of one standard deviation. 34 Confidence Interval for when is known A 95% confidence interval for if is known is given by: x 1.96 Overlay Plot n 95% of the x ‘s lie between 1.96 0.4 Normal Density 0.3 0.2 0.1 95% 0 35 -3 1.96 n -2 -1 0 X 1 1.96 n 2 3 X n Rationale for Confidence Interval From the sampling distribution of X conclude that and are within 1.96 standard errors ( ) of each other 95% of n the time Otherwise stated, 95% of the intervals contain So, the interval x can be taken as an interval that typically would include x 1.96 36 n Example A random sample of 80 tablets had an average potency of 15mg. Assume is known to be 4mg. x =15, =4, n=80 A 95% confidence interval for is 15 1.96 4 80 = (14.12 , 15.88) 38