Chapter 6 Statistical Inference From now on, we will use probability theory only to find answers to the questions arising from specific problems we are working on. In this chapter we want to draw inferences about some characteristic of an underlying population - e.g. the average height of a person. Instead of measuring this characteristic of each individual, we will draw a sample, i.e. choose a “suitable” subset of the population and measure the characteristic only for those individuals. Using some probabilistic arguments we can then extend the information we got from that sample and make an estimate of the characteristic for the whole population. Probability theory will give us the means to find those estimates and measure, how “probable” our estimates are. Of course, choosing the sample, is crucial. We will demand two properties from a sample: • the sample should be representative - taking only basketball players into the sample would change our estimate about a person’s height drastically. • if there’s a large number in the sample we should come close to the “true” value of the characteristic The three main area of statistics are • estimation of parameters: point or interval estimates: “my best guess for value x is . . . ”, “my guess is that value x is in interval (a, b)” • evaluation of plausibility of values: hypothesis testing • prediction of future (individual) values 6.1 Parameter Estimation Statistics are all around us - scores in sports, prices at the grocers, weather reports ( and how often they turn out to be close to the actual weather), taxes, evaluations . . . The most basic form of statistics are descriptive statistics. But - what exactly is a statistic? - Here is the formal definition: Definition 6.1.1 (Statistics) Any function W (x1 , . . . , xk ) of observed values x1 , . . . , xk is called a statistic. Some statistics you already know are: 81 82 CHAPTER 6. STATISTICAL INFERENCE ! X̄ = n1 i Xi X(1) - Parentheses indicate that the values are sorted X(n) X(n) − X(1) value(s) that appear(s) most often “middle value” - that value, for which one half of the data is larger, the other half is smaller. If n is odd the median is X(n/2) , if n is even, the median is the average of the two middle values: 0.5 · X((n−1)/2) + 0.5 · X((n+1)/2) For this section it is important to distinguish between xi and Xi properly. If not stated otherwise, any capital letter denotes some random variable, a small letter describes a realization of this random variable, i.e. what we have observed. xi therefore is a real number, Xi is a function, that assigns a real number to an event from the sample space. Mean (Average) Minimum Maximum Range Mode Median Definition 6.1.2 (estimator) Let X1 , . . . , Xk be k i.i.d random variables with distribution Fθ with (unknown) parameter θ. A statistic Θ̂ = Θ̂(X1 , . . . , Xk ) used to estimate the value of θ is called an estimator of θ. θ̂ = Θ̂(x1 , . . . , xk ) is called an estimate of θ. Desirable properties of estimates: x true value • value for from one sample Unbiasedness: • E[Θ̂] = θ • • and not • • • • • • • •• • • •• • • • x Efficiency: estimator 1 • •• • • •• • x• • • • • is better than • • •• • • x• • • • • • • estimator 2 Consistency: • • Consistency, if we have a larger sample size n, we want the estimate θ̂ to be closer to the true parameter θ. • • • • Efficiency, for two estimators, Θ̂1 and Θ̂2 of the same parameter θ, Θ̂1 is said to be more efficient than Θ̂2 , if V ar[Θ̂1 ] < V ar[Θ̂2 ] • • x • • • • • • • • • • • x• • • lim P (|Θ̂ − θ| > ") = 0 n→∞ same estimator for n = 100 • •• • • x•• • •• • • • • Unbiasedness, i.e the expected value of an estimator is the true parameter: for n = 10000 Example 6.1.1 Let X1 , . . . , Xn be n i.i.d. random variables with E[Xi ] = µ. !n Then X̄ = n1 i=1 Xi is an unbiased estimator of µ, because n E[X̄] = 1 1" E[Xi ] = n · µ = µ. n i=1 n ok, so, once we have an estimator, we can decide, whether it has the properties. But how do we find estimators? 83 6.1. PARAMETER ESTIMATION 6.1.1 Maximum Likelihood Estimation Situation: We have n data values x1 , . . . , xn . The assumption is, that these data values are realizations of n i.i.d. random variables X1 , . . . , Xn with distribution Fθ . Unfortunately the value for θ is unknown. X observed values x1, x2, x3, ... f with =0 f with = -1.8 f with =1 By changing the value for θ we can “move the density function fθ around” - in the diagram, the third density function fits the data best. Principle: since we do not know the true value θ of the distribution, we take that value θ̂ that most likely produced the observed values, i.e. maximize something like P (X1 = x1 ∩ X2 = x2 ∩ . . . ∩ Xn = xn ) = n # Xi are independent! = = P (X1 = x1 ) · P (X2 = x2 ) · . . . · P (Xn = xn ) = P (Xi = xi ) (*) i=1 This is not quite the right way to write the probability, if X1 , . . . , Xn are continuous variables. (Remember: P (X = x) = 0 for a continuous variable X; this is still valid) We use the above “probability” just as a plausibility argument. To come around the problem that P (X = x) = 0 for a continuous variable, we will write (*) as: n # i=1 $ pθ (xi ) %& and ' for discreteXi n # i=1 $ fθ (xi ) %& ' for continuousXi where pθ ) is the probability mass function of discrete Xi (all Xi have the same, since they are i.d) and fθ is the density function of continuous Xi . Both these functions depend on θ. In fact, we can write the above expressions as a function in θ. This function, which we will denote by L(θ), is called the Likelihood function of X1 , . . . , Xn . The goal is now, to find a value θ̂ that maximizes the Likelihood function. (this is what “moves” the density to the right spot, so it fits the observed values well) How do we get a maximum of L(θ)? - by the usual way, we maximize a function! - Differentiate it and set it to zero! (After that, we ought to check with the second derivative, whether we’ve actually found a maximum, but we won’t do that unless we’ve found more than one possible value for θ̂.) Most of the time, it is difficult to find a derivative of L(θ) - instead we use another trick, and find a maximum for log L(θ), the Log-Likelihood function. Note: though its name is “log”, we use the natural logarithm ln. The plan to find an ML-estimator is: 1. Find Likelihood function L(θ). 2. Get natural log of Likelihood function log L(θ). 3. Differentiate log-Likelihood function with respect to θ. 84 CHAPTER 6. STATISTICAL INFERENCE 4. Set derivative to zero. 5. Solve for θ. Example 6.1.2 Roll a Die A die is rolled until its face shows a 6. repeating this experiment 100 times gave the following results: #Rolls of a Die until first 6 20 15 # runs 10 5 0 1 k # trials 1 18 2 20 3 8 4 9 2 5 9 3 4 5 6 6 5 7 8 7 8 9 11 8 3 9 5 14 11 3 16 14 3 20 15 3 27 16 1 17 1 20 1 29 21 1 27 1 29 1 We know, that k the number of rolls until a 6 shows up has a geometric distribution Geop . For a fair die, p is 1/6. The Geometric distribution has probability mass function p(k) = (1 − p)k−1 · p. What is the ML-estimate p̂ for p? 1. Likelihood function L(p): Since we have observed 100 outcomes k1 , ..., k100 , the likelihood function L(p) = L(p) = 100 # i=1 (1 − p)ki −1 p = p100 · 100 # i=1 P100 (1 − p)ki −1 = p100 · (1 − p) i=1 (ki −1) ) * P100 log p100 · (1 − p) i=1 ki −100 = ) * P100 + , = log p100 + log (1 − p) i=1 ki −100 = - 100 . " = 100 log p + ki − 100 log(1 − p). = i=1 i=1 p(ki ), P100 = p100 · (1 − p) 2. log of Likelihood function log L(p): log L(p) (100 i=1 ki −100 . 85 6.1. PARAMETER ESTIMATION 3. Differentiate log-Likelihood with respect to p: d log L(p) dp ' 100 ( = 1 100 + p = 1 p(1 − p) 1 p(1 − p) = ) −1 = 1−p i=1 ' ' 100 ) ) ( 100(1 − p) − ki − 100 p = ki − 100 i=1 ' 100 − p 100 ( ki i=1 ) . 4. Set derivative to zero. For the estimate p̂ the derivative must be zero: ⇐⇒ d log L(p̂) = 0 dp ' ) 100 ( 100 − p̂ ki = 0 1 p̂(1 − p̂) i=1 5. Solve for p̂. 1 p̂(1 − p̂) ' 100 − p̂ 100 ( 100 p̂ = *100 i=1 In total, we have an estimate p̂ = 100 568 ki = 1 100 = 0.1710. = 0 ki = 0 ki i=1 100 − p̂ ) 100 ( i=1 1 *100 i=1 ki . Example 6.1.3 Red Cars in the Parking Lot The values 3,2,3,3,4,1,4,2,4,3 have been observed while counting the numbers of red cars pulling into the parking lot # 22 between 8:30 - 8:40 am Mo to Fr during two weeks. The assumption is, that these values are realizations of ten independent Poisson variables with (the same) rate λ. What is the Maximum Likelihood estimate of λ? x The probability mass function of a Poisson distribution is pλ (x) = e−λ · λx! . We have ten values xi , this gives a Likelihood function: L(λ) = 10 " i=1 e−λ · 10 " P10 λXi 1 = e−10λ · λ i=1 Xi · Xi ! X i! i=1 The log-Likelihood then is log L(λ) = −10λ + ln(λ) · 10 ( i=1 Xi − ( ln(Xi ). 86 CHAPTER 6. STATISTICAL INFERENCE Differentiating the log-Likelihood with respect to λ gives: 10 d 1 ( log L(λ) = −10 + · Xi dλ λ i=1 Setting it to zero: 10 1 ( · Xi = 10 λ̂ i=1 10 ⇐⇒ λ̂ = 1 ( Xi 10 i=1 29 = 2.9 10 This gives us an estimate for λ - and since λ is also the expected value of the Poisson distribution, we can say, that on average the number of red cars pulling into the parking lot each morning between 8:30 and 8:40 pm is 2.9. ⇐⇒ λ̂ = ML-estimators for µ and σ 2 of a Normal distribution Let X1 , . . . , Xn be n independent, identically distributed normal variables with E[Xi ] = µ and V ar[Xi ] = σ 2 . µ and σ 2 are unknown. The normal density function fµ,σ2 is fµ,σ2 (x) = √ 1 2πσ 2 e− (x−µ)2 2σ 2 Since we have n independent variables, the Likelihood function is a product of n densities: L(µ, σ 2 ) = n " i=1 Log-Likelihood: √ 1 2πσ 2 e− log L(µ, σ 2 ) = − (xi −µ)2 2σ 2 = (2πσ 2 )n/2 · e− Pn i=1 (xi −µ)2 2σ 2 n n 1 ( ln(2πσ 2 ) − 2 (xi − µ)2 2 2σ i=1 Since we have now two parameters, µ and σ 2 , we need to get 2 partial derivatives of the log-Likelihood: d log L(µ, σ 2 ) dµ d log L(µ, σ 2 ) dσ 2 = 0−2· n n −1 ( 1 ( 2 · (x − µ) · (−1) = (xi − µ)2 i 2σ 2 i=1 σ 2 i=1 n n 1 1 ( (xi − µ)2 = − + 2 σ2 2(σ 2 )2 i=1 We know, must find values for µ and σ 2 , that yield zeros for both derivatives at the same time. d Setting dµ log L(µ, σ 2 ) = 0 gives n 1( xi , µ̂ = n i=1 plugging this value into the derivative for σ 2 and setting n d dσ 2 log L(µ̂, σ 2 ) = 0 gives 1( σˆ2 = (xi − µ̂)2 n i=1 87 6.2. CONFIDENCE INTERVALS 6.2 Confidence intervals The previous section has provided a way to compute point estimates for parameters. Based on that, our next question is - how good is this point estimate? or How close is the estimate to the true value of the parameter? Instead of just looking at the point estimate, we will now try to compute an interval around the estimated parameter value, in which the true parameter is “likely” to fall. An interval like that is called confidence interval. Definition 6.2.1 (Confidence Interval) Let θ̂ be an estimate of θ. If P (|θ̂ − θ| < e) > α, we say, that the interval (θ̂ − e, θ̂ + e) is an α · 100% Confidence interval of θ (cf. fig. 6.1). Usually, α is a value near 1, such as 0.9, 0.95, 0.99, 0.999, etc. Note: • for any given set of values x1 , . . . , xn the value or θ̂ is fixed, as well as the interval (θ̂ − e, θ̂ + e). • The true value θ is either within the confidence interval or not. prob P( 1- x-e x -e x+e -e< x < e prob + e) > 1- confidence interval for Figure 6.1: The probability that x̄ falls into an e interval around µ is α. Vice versa, we know, that for all of those x̄ µ is within an e interval around x̄. That’s the idea of a confidence interval. !!DON’T DO!! A lot of people are tempted to reformulate the above probability to: P (θ̂ − e < θ < θ̂ + e) > α Though it looks ok, it’s not. Repeat: IT IS NOT OK. θ is a fixed value - therefore, it does not have a probability to fall into some interval. The only probability that we have, here, is P (θ − e < θ̂ < θ + e) > α, we can therefore say, that θ̂ has a probability of at least α to fall into an e- interval around θ. Unfortunately, that doesn’t help at all, since we do not know θ! How do we compute confidence intervals, then? - that’s different for each estimator. First, we look at estimates of a mean of a distribution: 88 6.2.1 CHAPTER 6. STATISTICAL INFERENCE Large sample C.I. for µ Situation: we have a large set of observed values (n > 30, usually). The assumption is, that these values are realizations of n i.i.d random variables X1 , . . . , Xn with E[X̄] = µ and V ar[X̄] = σ 2 . We already know from the previous section, that X̄ is an unbiased ML-estimator for µ. But we know more! - The CLT tells us, that in exactly the situation we are X̄ is an approximately normal 2 distributed random variable with E[X̄] = µ and V ar[X̄] = σn . We therefore can find the boundary e by using the standard normal distribution. Remember: if X̄ ∼ X̄−µ √ ∼ N (0, 1) = Φ: N (µ, σ 2 /n) then Z := σ/ n P (|X̄ − µ| ≤ e) ≥ α use standardization $ # |X̄ − µ| e √ ≤ √ ≥α ⇐⇒ P σ/ n σ/ n # $ e √ ⇐⇒ P |Z| < ≥α σ/ n # $ e e √ ⇐⇒ P − √ < Z < ≥α σ/ n σ/ n # $ # $ e e √ ⇐⇒ Φ −Φ − √ ≥α σ/ n σ/ n # $ # # $$ e e √ √ ⇐⇒ Φ − 1−Φ ≥α σ/ n σ/ n # $ e √ ⇐⇒ 2Φ −1≥α σ/ n $ # α e √ ≥1+ ⇐⇒ Φ 2 σ/ n $ # 1 + α e −1 √ ≥Φ ⇐⇒ 2 σ/ n # $ 1+α σ √ ⇐⇒ e ≥ Φ−1 2 n % &' ( :=z This computation gives a α· 100% confidence value around µ as: $ # σ σ X̄ − z · √ , X̄ + z · √ n n Now we can do an example: Example 6.2.1 Suppose, we want to find a 95% confidence interval for the mean salary of an ISU employee. A random sample of 100 ISU employees gives us a sample mean salary of $21543 = x̄. Suppose, the standard deviation of salaries is known to be $3000. By using the above expression, we get a 95% confidence interval as: # $ 1 + 0.95 3000 −1 21543 ± Φ ·√ = 21543 ± Φ−1 (0.975) · 300 2 100 How do we read Φ−1 (0.975) from the standard normal table? - We look for which z the probability N(0,1) (z) ≥ 0.975! 89 6.2. CONFIDENCE INTERVALS This gives us z = 1.96, the 95% confidence interval is then: 21543 ± 588, i.e. if we repeat this study 100 times (with 100 different employees each time), we can say: in 95 out of 100 studies, the true parameter µ falls into a $588 range around x̄. Critical values for z, depending on α are: α 0.90 0.95 0.98 0.99 z = Φ−1 ( 1+α 2 ) 1.65 1.96 2.33 2.58 Problem: Usually, we do not know ' σ( n 1 2 Slight generalization: use s = n−1 i=1 (Xi − X̄) instead of σ! An α· 100% confidence interval for µ is given as " ! s s X̄ − z · √ , X̄ + z · √ n n where z = Φ−1 ( 1+α 2 ). Example 6.2.2 Suppose, we want to analyze some complicated queueing system, for which we have no formulas and theory. We are interested in the mean queue length of the system after reaching steady state. The only thing possible for us is to run simulations of this system and look at the queue length at some large time t, e.g. t = 1000 hrs. After 50 simulations, we have got data: X1 = number in queue at time 1000 hrs in 1st simulation X2 = number in queue at time 1000 hrs in 2nd simulation ... X50 = number in queue at time 1000 hrs in 50th simulation ' (n 1 2 Our observations yield an average queue length of x̄ = 21.5 and s = n−1 i=1 (xi − x̄) = 15. A 90% confidence interval is given as " ! s s x̄ − z · √ , x̄ + z · √ n n ! 15 15 = 21.5 − 1.65 · √ , 21.5 + 1.65 · √ 50 50 = (17.9998, 25.0002) " = Example 6.2.3 The graphs show a set of 80 experiments. The values from each experiment are shown in one of the green framed boxes. Each experiment consists of simulating 20 values from a standard normal distributions (these are drawn as the small blue lines). For each of the experiments, the average from the 20 value is computed (that’s x̄) as well as a confidence interval for µ- for parts a) and b) it’s the 95% confidence interval, for part c) it is the 90% confidence interval, for part d) it is the 99% confidence interval. The upper and the lower confidence bound together with the sample mean are drawn in red next to the sampled observations. 90 CHAPTER 6. STATISTICAL INFERENCE a) 95 % confidence intervals b) 95 % confidence intervals c) 90 % confidence intervals d) 99 % confidence intervals There are several things to see from this diagram. First of all, we know in this example the “true” value of the parameter µ - since the observations are sampled from a standard normal distribution, µ = 0. The true parameter is represented by the straight horizontal line through 0. 91 6.2. CONFIDENCE INTERVALS We see, that each sample yields a different confidence interval, all of the are centered around the sample mean. The different sizes of the intervals tells us another thing: in computing these confidence intervals, we had to use the estimate s instead of the true standard deviation σ = 1. Each sample gave a slightly different standard deviation. Overall, though, the intervals are not very different in lengths between parts a) and b). The intervals in c) tend to be slightly smaller, though - these are 90% confidence intervals, whereas the intervals in part d) are on average larger than the first ones, they are 99% confidence intervals. Almost all the confidence intervals contain 0 - but not all. And that is, what we expect. For a 90% confidence interval we expect, that in 10 out of 100 times, the confidence interval does not contain the true parameter. When we check that - we see, that in part c) 4 out of the 20 confidence intervals don’t contain the true parameter for µ - that’s 20%, on average we would expect 10% of the conficence intervals not to contain µ. Official use of Confidence Intervals: In an average of 90 out of 100 times the 90% confidence interval of θ does contain the true value of θ. 6.2.2 Large sample confidence intervals for a proportion p Let p be a proportion of a large population or a probability. In order to get an estimate for this proportion, we can take a sample of n individuals from the population and check each one of them, whether or not they fulfill the criterion to be in that proportion of interest. Mathematically, this corresponds to a Bernoulli-n-sequence, where we are only interested in the number of “successes”, X, which in our case corresponds to the number of individuals that qualify for the interesting subgroup. X then has a Binomial distribution, with parameters n and p. We know, that X̄ is an estimate for E[X]. Now think: for a Binomial variable X, the expected value E[X] = n · p. Therefore we get an estimate p̂ for p as p̂ = n1 X̄. Furthermore, we even have a distribution for p̂ for large n: Since X̂ is, using the CLT, a normal variable with E[X̄] = np and V ar[X̄] = np(1 − p), we get that for large n p̂ is a approximately normally distributed with E[p̂] = p and V ar[p̂] = p(1−p) . n BTW: this tells us, that p̂ is an unbiased estimator of p. Prepared with the distribution of p̂ we can set up an α · 100% confidence interval as: where e is some positive real number with: (p̂ − e, p̂ + e) P (|p̂ − p| ≤ e) ≥ α We can derive the expression for e in the same way as in the previous section and we come up with: e=z· p(1 − p) n where z = Φ−1 ( 1+α 2 ). We also run into the problem that e in this form is not ready for use, since we do not know the value for p. In this situation, we have different options. We can either find a value that maximizes the value p(1 − p) or we can substitute an appropriate value for p. 6.2.2.1 Conservative Method: replace p(1 − p) by something that’s guaranteed to be at least as large. The function p(1 − p) has a maximum for p = 0.5. p(1 − p) is then 0.25. p 92 CHAPTER 6. STATISTICAL INFERENCE The conservative α · 100% confidence interval for p is 1 p̂ ± z · √ 2 n where z = Φ−1 ( 1+α 2 ). 6.2.2.2 Substitution Method: Substitute p̂ for p, then: The α · 100% confidence interval for p by substitution is ! p̂(1 − p̂) p̂ ± z · n where z = Φ−1 ( 1+α 2 ). Where is the difference between the two methods? • for large n there is almost no difference at all • if p̂ is close to 0.5, there is also almost no difference Besides that, conservative confidence intervals (as the name says) are larger than confidence intervals found by substitution. However, they are at the same time easier to compute. Example 6.2.4 Complicated queueing system, continued Suppose, that now we are interested in the large t probability p that a server is available. Doing 100 simulations has shown, that in 65 of them a server was available at time t = 1000 hrs. What is a 95% confidence interval for this probability? 60 If 60 out of 100 simulations showed a free server, we can use p̂ = 100 = 0.6 as an estimate for p. −1 For a 95% confidence interval, z = Φ (0.975) = 1.96. The conservative confidence interval is: 1 1 p̂ ± z √ = 0.6 ± 1.96 √ = 0.6 ± 0.098. 2 n 2 · 100 For the confidence interval using substitution we get: ! ! p̂(1 − p̂ 0.6 · 0.4 = 0.6 ± 1.96 = 0.6 ± 0.096. p̂ ± z n 100 Example 6.2.5 Batting Average In the 2002 season the baseball player Sammy Sosa had a batting average of 0.288. (The batting average is the ratio of the number of hits and the times at bat.) Sammy Sosa was at bats 555 times in the 2002 season. Could the ”true” batting average still be 0.300? Compute a 95% Confidence Interval for the true batting average. Conservative Method gives: 1 0.288 ± 1.96 · √ 2 555 0.288 ± 0.042 93 6.2. CONFIDENCE INTERVALS Substitution Method gives: 0.288 ± 1.96 · 0.288 ± 0.038 ! 0.288(1 − 0.288) 555 The substitution method gives a slightly smaller confidence interval, but both intervals contain 0.3. There is not enough evidence to allow the conclusion that the true average is not 0.3. Confidence intervals give a way to measure the precision we get from simulations intended to evaluate probabilities. But besides that it also gives as a way to plan how large a sample size has to be to get a desired precision. Example 6.2.6 Suppose, we want to estimate the fraction of records in the 2000 IRS data base that have a taxable income over $35 K. We want to get a 98% confidence interval and wish to estimate the quantity to within 0.01. this means that our boundaries e need to be smaller than 0.01 (we’ll choose a conservative confidence interval for ease of computation): e ≤ 0.01 1 z is 2.33 ⇐⇒ z · √ ≤ 0.01 2 n 1 ⇐⇒ 2.33 · √ ≤ 0.01 2 n √ 2.33 ⇐⇒ n ≥ = 116.5 2 · 0.01 ⇒ n ≥ 13573 6.2.3 Related C.I. Methods Related to the previous confidence intervals, are confidence intervals for the difference between two means, µ1 − µ2 , or the difference between two proportions , p1 − p2 . Confidence intervals for these differences are given as: large n confidence interval for µ1 − µ2 (based on independent X̄1 and X̄2 ) X̄1 − X̄2 ± z " s21 n1 + s22 n2 large n confidence interval for p1 − p2 (based on independent p̂1 and p̂2 ) p̂1 − p̂2 ± z 21 " or p̂1 − p̂2 ± z stitution) 1 "n1 + 1 n2 p̂1 (1−p̂1 ) n1 (conservative) + p̂2 (1−p̂2 ) n2 (sub- Why? The argumentation in both cases is very similar - we will only discuss the confidence interval for the difference between means. X̄1 − X̄2 is approximately normal, since X̄1 and X̄2 are approximately normal, with (X̄1 , X̄2 are independent) E[X̄1 − X̄2 ] = E[X̄1 ] − E[X̄2 ] = µ1 − µ2 V ar[X̄1 − X̄2 ] = V ar[X̄1 ] + (−1)2 V ar[X̄2 ] = σ2 σ2 + n1 n2 94 CHAPTER 6. STATISTICAL INFERENCE Then we can use the same arguments as before and get a C.I. for µ1 − µ2 as shown above. ! Example 6.2.7 Assume, we have two parts of the IRS database: East Coast and West Coast. We want to compare the mean taxable income between reported from the two regions in 2000. East Coast West Coast # of sampled records: n1 = 1000 n2 = 2000 mean taxable income: x̄1 = $37200 x̄2 = $42000 standard deviation: s1 = $10100 s2 = $15600 We can, for example, compute a 2 sided 95% confidence interval for µ1 − µ2 = difference in mean taxable income as reported from 2000 tax return between East and West Coast: ! 101002 156002 + = −5000 ± 927 37000 − 42000 ± 1000 2000 Note: this shows pretty conclusively that the mean West Coast taxable income is higher than the mean East Coast taxable income (in the report from 2000). The interval contains only negative numbers - if it contained the 0, the message wouldn’t be so clear. One-sided intervals idea: use only one of the end points x̄ ± z √sn This yields confidence intervals for µ of the form (−∞, #) # $% & (##, ∞) # $% & upper bound lower bound However, now we need to adjust z to the new situation. Instead of worrying about two tails of the normal distribution, we use for a one sided confidence interval only one tail. P( x < x + e) < e x+e prob ≤ 1 - confidence interval for Figure 6.2: One sided (upper bounded) confidence interval for µ (in red). Example 6.2.8 complicated queueing system, continued What is a 95% upper confidence bound of µ, the parameter for the length of the queue? −1 x̄ + z √sn is the upper confidence bound. Instead of z = Φ−1 ( α+1 (α) (see fig. 6.2). 2 ) we use z = Φ This gives: 21.5 + 1.65 √1550 = 25.0 as the upper confidence bound. Therefore the one sided upper bounded confidence interval is (−∞, 25.0). 95 6.3. HYPOTHESIS TESTING Critical values z = Φ−1 (α) for the one sided confidence interval are α z = Φ−1 (α) 0.90 1.29 0.95 1.65 0.98 2.06 2.33 0.99 Example 6.2.9 Two different digital communication systems send 100 large messages via each system and determine how many are corrupted in transmission. p̂1 = 0.05 and pˆ2 = 0.10. What’s the difference in the corruption rates? Find a 98% confidence interval: Use: 0.05 − 0.1 ± 2.33 · ! 0.05 · 0.95 0.10 · 0.90 + = −0.05 ± 0.086 100 100 This calculation tells us, that based on these sample sizes, we don’t even have a solid idea about the sign of p1 − p2 , i.e. we can’t tell which of the pi s is larger. So far, we have only considered large sample confidence intervals. The problem with smaller sample sizes is, that the normal approximation in the CLT doesn’t work, if the standard deviation σ 2 is unknown. What you need to know is, that there exist different methods to compute C.I. for smaller sample sizes. 6.3 Hypothesis Testing Example 6.3.1 Tea Tasting Lady It is claimed that a certain lady is able to tell, by tasting a cup of tea with milk, whether the milk was put in first or the tea was put in first. To put the claim to the test, the lady is given 10 cups of tea to taste and is asked to state in each case whether the milk went in first or the tea went in first. To guard against deliberate or accidental communication of information, before pouring each cup of tea a coin is tossed to decide whether the milk goes in first or the tea goes in first. The person who brings the cup of tea to the lady does not know the outcome of the coin toss. Either the lady has some skill (she can tell to some extent the difference) or she has not, in which case she is simply guessing. Suppose, the lady tested 10 cups of tea in this manner and got 9 of them right. This looks rather suspicious, the lady seems to have some skill. But how can we check it? We start with the sceptical assumption that the lady does not have any skill. If the lady has no skill at all, the probability she gives a correct answer for any single cup of tea is 1/2. The number of cups she gets right has therefore a Binomial distribution with parameter n = 10 and p = 0.5. The diagram shows the probability mass function of this distribution: 96 CHAPTER 6. STATISTICAL INFERENCE p(x) observed x x Events that are as unlikely or less likely are, that the lady got all 10 cups right or - very different, but nevertheless very rare - that she only got 1 cup or none right (note, this would be evidence of some “antiskill”, but it would certainly be evidence against her guessing). ' ( The total probability for these events is (remember, the binomial probability mass function is p(x) = nx px (1− p)n−x ) p(0) + p(1) + p(9) + p(10) = 0.510 + 10 · 0.510 + 10 · 0.510 + 0.510 = 0.021 i.e. what we have just observed is a fairly rare event under the assumption, that the lady is only guessing. This suggests, that the lady may have some skill in detecting which was poured first into the cup. Jargon: 0.021 is called the p-value for testing the hypothesis p = 0.5. The fact that the p-value is small is evidence against the hypothesis. Hypothesis testing is a formal procedure to check whether or not some - previously made - assumption can be rejected based on the data. We are going to abstract the main elements of the previous example and cook up a standard series of steps for hypothesis testing: Example 6.3.2 University CC administrators have historical records that indicate that between August and Oct 2002 the mean time between hits on the ISU homepage was 2 per min. They suspect that in fact the mean time between hits has decreased (i.e. traffic is up) - sampling 50 inter-arrival times from records for November 2002 gives: X̄ = 1.7 min and s = 1.9 min. Is this strong evidence for an increase in traffic? 97 6.3. HYPOTHESIS TESTING 1 2 3 4 Formal Procedure Application to Example State a “null hypothesis” of the form H0 : function of parameter(s) = # meant to embody a status quo/ pre data view State an “alternative hypothesis” of the form > (= # Ha : function of parameter(s) < meant to identify departure from H0 State test criteria - consists of a test statistic, a “reference distribution” giving the behavior of the test statistic if H0 is true and the kinds of values of the test statistic that count as evidence against H0 . show computations H0 : µ = 2.0 min between hits 5 Report and interpret a p-value = “observed level of significance, with which H0 can be rejected”. This is the probability of an observed value of the test statistic at least as extreme as the one at hand. The smaller this value is, the less likely it is that H0 is true. Note aside: a 90% confidence interval for µ is Ha : µ < 2 (traffic is down) √ test statistic will be Z = X̄−2.0 s/ n The reference density will be standard normal, large negative values for Z count as evidence against H0 in favor of Ha sample gives z = 1.7−2.0 √ 1.9/ 50 = −1.12 The p-value is P (Z ≤ −1.12) = Φ(−1.12) = 0.1314 This value is not terribly small - the evidence of a decrease in mean time between hits is somewhat weak. s x̄ ± 1.65 √ = 1.7 ± 0.44 n This interval contains the hypothesized value of µ = 2.0 There are four basic hypothesis tests of this form, testing a mean, a proportion or differences between two means or two proportions. Depending on the hypothesis, the test statistic will be different. Here’s an overview of the tests, we are going to use: Hypothesis Statistic Reference Distribution √ H0 : µ = # Z = X̄−# Z is standard normal s/ n where p̂ = H0 : p = # Z= H0 : µ1 − µ2 = # Z= H 0 : p1 − p2 = # Z=√ n1 p̂1 +n2 p̂2 n1 +n2 . q p̂−# Z is standard normal #(1−#) n X̄ 1 −X̄2 −# r s2 1 n1 s2 p̂(1−p̂) 1 n1 + n2 2 p̂1 −p̂q 2 −# Z is standard normal + n1 Z is standard normal 2 Example 6.3.3 tax fraud Historically, IRS taxpayer compliance audits have revealed that about 5% of individuals do things on their tax returns that invite criminal prosecution. A sample of n = 1000 tax returns produces p̂ = 0.061 as an estimate of the fraction of fraudulent returns. does this provide a clear signal of change in the tax payer behavior? 1. state null hypothesis: H0 : p = 0.05 2. alternative hypothesis: Ha : p (= 0.05 CHAPTER 6. STATISTICAL INFERENCE 3. test statistic: p̂ − 0.05 Z=p 0.05 · 0.95/n Z has under the null hypothesis a standard normal distribution, any large values of Z - positive and negative values - will count as evidence against H0 . p 4. computation: z = (0.061 − 0.05)/ 0.05 · 0.95/1000 = 1.59 5. p-value: P (|Z| ≥ 1.59) = P (Z ≤ −1.59) + P (Z ≥ 1.59) = 0.11 This is not a very small value, we therefore have only very weak evidence against H0 . Example 6.3.4 life time of disk drives n1 = 30 and n2 = 40 disk drives of 2 different designs were tested under conditions of “accelerated” stress and times to failure recorded: Standard Design n1 = 30 x̄1 = 1205 hr s1 = 1000 hr New Design n2 = 40 x̄2 = 1400 hr s2 = 900 hr Does this provide conclusive evidence that the new design has a larger mean time to failure under “accelerated” stress conditions? 1. state null hypothesis: H0 : µ1 = µ2 (µ1 − µ2 = 0) 2. alternative hypothesis: Ha : µ1 < µ2 (µ1 − µ2 < 0) 3. test statistic is: x̄1 − x̄2 − 0 Z= q 2 s22 s1 n1 + n2 Z has under the null hypothesis a standard normal distribution, we will consider large negative values of Z as evidence against H0 . p 4. computation: z = (1205 − 1400 − 0)/ 10002 /30 + 9002 /40 = −0.84 5. p-value: P (Z < −0.84) = 0.2005 This is not a very small value, we therefore have only very weak evidence against H0 . Example 6.3.5 queueing systems 2 very complicated queuing systems: We’d like to know, whether there is a difference in the large t probabilities of there being an available server. We do simulations for each system, and look whether at time t = 2000 there is a server available: System 1 System 2 n1 = 1000 runs n2 = 500 runs (each with different random seed) server at time t = 2000 available? 551 p̂1 = 1000 p̂2 = 303 500 How strong is the evidence of a difference between the t = 2000 availability of a server for the two systems? 96 CHAPTER 6. STATISTICAL INFERENCE 3. test statistic is: x̄1 − x̄2 − 0 Z= q 2 s1 s22 n1 + n2 Z has under the null hypothesis a standard normal distribution, we will consider large negative values of Z as evidence against H0 . p 4. computation: z = (1205 − 1400 − 0)/ 10002 /30 + 9002 /40 = −0.84 5. p-value: P (Z < −0.84) = 0.2005 This is not a very small value, we therefore have only very weak evidence against H0 . Example 6.3.5 queueing systems 2 very complicated queuing systems: We’d like to know, whether there is a difference in the large t probabilities of there being an available server. We do simulations for each system, and look whether at time t = 2000 there is a server available: System 1 System 2 n1 = 1000 runs n2 = 500 runs (each with different random seed) server at time t = 2000 available? 551 p̂1 = 1000 p̂2 = 303 500 How strong is the evidence of a difference between the t = 2000 availability of a server for the two systems? 1. state null hypothesis: H0 : p1 = p2 (p1 − p2 = 0) 2. alternative hypothesis: Ha : p1 6= p2 (p1 − p2 6= 0) 3. Preliminary: note that, if there was no difference between the two systems, a plausible estimate of the availability of a server would be p̂ = np̂1 + np̂2 551 + 303 = 0.569 = n1 + n2 1000 + 500 a test statistic is: Z=p p̂1 − p̂2 − 0 q p̂(1 − p̂) · n11 + 1 n2 Z has under the null hypothesis a standard normal distribution, we will consider large values of Z as evidence against H0 . p p 4. computation: z = (0.551 − 0.606)/( 0.569 · (1 − 0.569) 1/1000 + 1/500) = −2.03 5. p-value: P (|Z| > 2.03) = 0.04 This is fairly strong evidence of a real difference in t=2000 availabilities of a server between the two systems. 6.4 Regression A statistical investigation only rarely focusses on the distribution of a single variable. We are often interested in comparisons among several variables, in changes in a variable over time, or in relationships among several variables. The idea of regression is that we have a vector X1 , . . . , Xk and try to approximate the behavior of Y by finding a function g(X1 , . . . , Xk ) such that Y ≈ g(X1 , . . . , Xk ). Simplest possible version is: 6.4. REGRESSION 6.4.1 97 Simple Linear Regression (SLR) Situation: k = 1 and Y is approximately linearly related to X, i.e. g(x) = b0 + b1 x. Notes: • Scatterplot of Y vs X should show the linear relationship. • linear relationship may be true only after a transformation of X and/or Y , i.e. one needs to find the “right” scale for the variables: e.g. if y ≈ cxb , this is nonlinear in x, but it implies that ln x + ln c, ln y ≈ b |{z} |{z} x0 =:y 0 so on a log scale for both x and y-axis one gets a linear relationship. Example 6.4.1 Mileage vs Weight Measurements on 38 1978-79 model automobiles. Gas mileage in miles per gallon as measured by Consumers’ Union on a test track. Weight as reported by automobile manufacturer. A scatterplot of mpg versus weight shows an indirect proportional relationship: 35 30 M 25 P G 20 2.25 Transform weight by 1 x 3.00 Weight 3.75 to weight−1 . A scatterplot of mpg versus weight−1 reveals a linear relationship: 35 30 M 25 P G 20 0.300 0.375 1/Wgt 0.450 Example 6.4.2 Olympics - long jump Results for the long jump for all olympic games between 1900 and 1996 are: 98 CHAPTER 6. STATISTICAL INFERENCE year 1960 1964 1968 1972 1976 1980 1984 1988 1992 1996 year long jump (in m) 1900 7.19 1904 7.34 1908 7.48 1912 7.60 1920 7.15 1924 7.45 1928 7.74 1932 7.64 1936 8.06 1948 7.82 1952 7.57 1956 7.83 A scatterplot of long jump versus year shows: long jump (in m) 8.12 8.07 8.90 8.24 8.34 8.54 8.54 8.72 8.67 8.50 l o 8.5 n g j 8.0 u m 7.5 p 0 20 40 year 60 80 The plot shows that it is perhaps reasonable to say that y ≈ β0 + β1 x The first issue to be dealt with in this context is: if we accept that y ≈ β0 + β1 x, how do we derive empirical values of β0 , β1 from n data points (x, y)? The standard answer is the “least squares” principle: y y=b0 + b1 x 0.75 0.50 0.25 -0.00 0.2 0.4 0.6 0.8 x In comparing lines that might be drawn through the plot we look at: Q(b0 , b1 ) = n X (yi − (b0 + b1 xi )) 2 i=1 i.e. we look at the sum of squared vertical distances from points to the line and attempt to minimize this 6.4. REGRESSION 99 sum of squares: d Q(b0 , b1 ) db0 = −2 d Q(b0 , b1 ) db1 = −2 n X i=1 n X (yi − (b0 + b1 xi )) xi (yi − (b0 + b1 xi )) i=1 Setting the derivatives to zero gives: nb0 − b1 b0 n X xi − b1 i=1 n X i=1 n X xi = x2i = i=1 n X i=1 n X yi xi yi i=1 Least squares solutions for b0 and b1 are: b1 = Pn Pn Pn Pn 1 (x − x̄)(yi − ȳ) i=1 xi · i=1 xi yi − n i=1 yi i=1 Pn i = Pn Pn 2 2 1 2 (x − x̄) xi − ( xi ) i=1 i i=1 n b0 = n slope i=1 n 1X 1X yi − b1 xi ȳ − x̄b1 = n i=1 n i=1 y − intercept at x = 0 These solutions produce the “best fitting line”. Example 6.4.3 Olympics - long jump, continued X := year, Y := long jump n X n X xi = 1100, i=1 i=1 n X x2i = 74608 n X yi = 175.518, i=1 yi2 = 1406.109, n X xi yi = 9079.584 i=1 i=1 The parameters for the best fitting line are: b1 = b0 = 9079.584 − 1100·175.518 22 11002 22 74608 − = 0.0155(in m) 175.518 1100 − · 0.0155 = 7.2037 22 22 The regression equation is high jump = 7.204 + 0.016year (in m). It is useful for addition, to be able to judge how well the line describes the data - i.e. how “linear looking” a plot really is. There are a couple of means doing this: 100 CHAPTER 6. STATISTICAL INFERENCE 6.4.1.1 The sample correlation r This is what we would get for a theoretical correlation % if we had random variables X and Y and their distribution. Pn Pn Pn Pn 1 i=1 xi yi − n i=1 xi · i=1 yi i=1 (xi − x̄)(yi − ȳ) = r r := pPn Pn 2 2 Pn Pn Pn Pn 2 2 1 1 2 2 i=1 (xi − x̄) · i=1 (yi − ȳ) i=1 xi − n ( i=1 xi ) i=1 yi − n ( i=1 yi ) The numerator is the numerator of b1 , one part under the root of the denominator is the denominator of b1 . Because of its connection to %, the sample correlation r fulfills (it’s not obvious to see, and we want prove it): • −1 ≤ r ≤ 1 • r = ±1 exactly, when all (x, y) data pairs fall on a single straight line. • r has the same sign as b1 . Example 6.4.4 Olympics - long jump, continued r= q 9079.584 − (74608 − 1100·175.518 22 11002 n )(1406.109 − = 0.8997 175.5182 ) 22 Second measure for goodness of fit: 6.4.1.2 Coefficient of determination R2 This is based on a comparison of “variation accounted for” by the line versus “raw variation” of y. The idea is that !2 n n n X X 1 X 2 2 (yi − ȳ) = yi − yi = SST T otal S um of S quares n i=1 i=1 i=1 is a measure for the variability of y. (It’s (n − 1) · s2y ) y 0.75 0.50 y 0.25 -0.00 0.2 0.4 0.6 0.8 x After fitting the line ŷ = b0 + b1 x, one doesn’t predict y as ȳ anymore and suffer the errors of prediction above, but rather only the errors ŷi − yi =: ei . So, after fitting the line n X i=1 e2i = n X (yi − ŷ)2 = SSES um of S quares of E rrors i=1 is a measure for the remaining/residual/ error variation. 6.4. REGRESSION 101 y y=b0 + b1 x 0.75 0.50 0.25 -0.00 0.2 0.4 0.6 0.8 x The fact is that SST ≥ SSE. So: SSR := SST − SSE ≥ 0. SSR is taken as a measure of “variation accounted for” in the fitting of the line. The coefficient of determination R2 is defined as: R2 = SSR SST Obviously: 0 ≤ R2 ≤ 1, the closer R2 is to 1, the better is the linear fit. Example 6.4.5 Olympics - long jump, continued Pn Pn 2 2 SST = i=1 yi2 − n1 ( i=1 yi ) = 1406.109 − 175.518 = 5.81. 22 SSE and SSR? y x ŷ y − ŷ (y − ŷ)2 7.185 0 7.204 -0.019 0.000 7.341 4 7.266 0.075 0.006 7.480 8 7.328 0.152 0.023 7.601 12 7.390 0.211 0.045 7.150 20 7.513 -0.363 0.132 7.445 24 7.575 -0.130 0.017 7.741 28 7.637 0.104 0.011 7.639 32 7.699 -0.060 0.004 8.060 36 7.761 0.299 0.089 7.823 48 7.947 -0.124 0.015 7.569 52 8.009 -0.440 0.194 7.830 56 8.071 -0.241 0.058 8.122 60 8.133 -0.011 0.000 8.071 64 8.195 -0.124 0.015 8.903 68 8.257 0.646 0.417 8.242 72 8.319 -0.077 0.006 8.344 76 8.381 -0.037 0.001 8.541 80 8.443 0.098 0.010 8.541 84 8.505 0.036 0.001 8.720 88 8.567 0.153 0.024 8.670 92 8.629 0.041 0.002 8.500 96 8.691 -0.191 0.036 SSE = 1.107 So SSR = SST − SSE = 5.810 − 1.107 = 4.703 and R2 = SSR SST = 0.8095. Connection between R2 and r R2 is SSR/SST - that’s the squared sample correlation of y and ŷ. If - and only if! - we use a linear function in x to predict y, i.e. ŷ = b0 + b1 x, the correlation between ŷ and x is 1. Then R2 (and only then!) is equal to the squared sample correlation between y and x = r2 : R2 = r2 if and only if ŷ = b0 + b1 x 102 CHAPTER 6. STATISTICAL INFERENCE Example 6.4.6 Olympics - long jump, continued R2 = 0.8095 = (0.8997)2 = r2 . It is possible to go beyond simply fitting a line and summarizing the goodness of fit in terms of r and R2 to doing inference, i.e. making confidence intervals, predictions, . . . based on the line fitting. But for that, we need a probability model. 6.4.2 Simple linear Regression Model In words: for input x the output y is normally distributed with mean β0 + β1 x = µy|x and standard deviation σ. In symbols: yi = β0 + β1 xi + i with i i.i.d. normal N (0, σ 2 ). β0 , β1 , and σ 2 are the parameters of the model and have to be estimated from the data (the data pairs (xi , yi ). Pictorially: y density of y given x x How do we get estimates for β0 , β1 , and σ 2 ? Point estimates: β̂0 = b0 , βˆ1 = b1 from Least Squares fit (which gives β̂0 and βˆ1 the name Least Squares Estimates. and σ 2 ? σ 2 measures the variation around the “true” line β0 + β1 x - we don’t know that line, but only b0 + b1 x. Should we base the estimation of σ 2 on this line? The “right” estimator for σ 2 turns out to be: n σ̂ 2 = 1 X SSE (yi − ŷi )2 = . n − 2 i=1 n−2 Example 6.4.7 Olympics - long jump, continued β̂0 = b0 = 7.2073 (in m) β̂1 = b1 = 0.0155 (in m) SSE 1.107 = = = 0.055. n−2 20 σ̂ 2 Overall, we assume a linear regression model of the form: y = 7.2037 + 0.0155x + e, with e ∼ N (0, 0.055).