MATH2411 Applied Statistics Tutorial Notes 4 Null Hypothesis Warm-up (Distribution of Sample Mean) The random variable X, representing the number of cherries in a cherry puff, has the following probability distribution: x P (X = x) 4 0.2 5 0.4 6 0.3 7 0.1 H0 : µX = µ0 Test Statistics Condition σX is known z0 = x − µ0 √ σX / n (a) Find the expectation E(X) and the variance V ar(X). E(X) = 4 (0.2) + 5 (0.4) + 6 (0.3) + 7(0.1) = 0.8 + 2 + 1.8 + 0.7 = 5.3 H0 : µX = µ0 σX is unknown t0 = x − µ0 √ sX / n V ar(X) = E(X 2 ) − (E(X))2 = 42 (0.2) + 52 (0.4) + 62 (0.3) + 72 (0.1) − 5.32 = (3.2 + 10 + 10.8 + 4.9) − 28.09 = 0.81 H0 : 2 σX = σ02 µX is unknown χ20 (n − 1)s2X = σ02 Alternative Hypothesis Rejection Criteria H1 : µX 6= µ0 |z0 | > z α2 H1 : µX > µ0 z0 > zα H1 : µX < µ0 z0 < −zα H1 : µX 6= µ0 |t0 | > tn−1, α2 H1 : µX > µ0 t0 > tn−1,α H1 : µX < µ0 t0 < −tn−1,α 2 H1 : σ X > σ02 χ20 > χ2n−1,α H1 : 2 σX 6= σ02 χ20 > χ2n−1, α 2 or χ20 < χ2n−1,1− α H1 : 2 σX < σ02 χ20 2 (b) Suppose 36 cherry puffs are to be randomly selected and use X to denote the sample mean (average number of cherries in 36 puffs). Find the mean E(X) and the variance V ar(X). E(X) = E(X) = 5.3 V ar(X) = 1 9 1 V ar(X) = (0.81) = 36 36 400 (c) Find the probability that the average number of cherries in 36 cherry puffs will be less than 5.5. r 9 3 σX = = = 0.15 400 20 X − 5.3 5.5 − 5.3 4 P (X ≤ 5.5) = P =P Z≤ ≈ 0.9082 ≤ 0.15 0.15 3 < χ2n−1,1−α Example 1 (Test for population mean) The breaking strength of a fiber used in manufacturing cloth is required to be at least 160 psi. Past experience has indicated that the standard deviation of the breaking strength is 3 psi. A random sample of 40 specimens from a certain batch is tested and the average breaking strength is found to be 159.8 psi. For α = 0.05, should this batch be judged acceptable or not? Let X be the random variable of breaking strength. Then σX = 3, n = 40, x = 159.8, α = 0.05 and zα = z0.05 = 1.645 ( H0 : µX = 160 H1 : µX < 160 159.8 − 160 √ = −0.421637 > −1.645 = −z0.05 , hence we do not reject H0 , 3/ 40 at 0.05 significance level, based on the given observations. ∴ z0 = Example 2 (Test for population standard deviation) A soft-drink dispensing machine is said to be out of control if the standard deviation of the contents exceeds 15 ml. If a random sample of 25 drinks from this machine has a sample standard deviation of 20.3 ml, does this indicate at the 0.05 level of significance that the machine is out of control? Assume that the contents are normally distributed. Let X be the amount of one drink from the machine. Then sX = 20.3, n = 25, α = 0.05 and χ2n−1, α = χ224, 0.05 = 36.415 ( 2 = 225) H0 : σX = 15 (σX 2 H1 : σX > 15 (σX > 225) 24 · 20.32 = 43.956 > χ224,0.05 . 225 Hence, we have strong enough evidence, at 0.05 significance level, to reject H0 based on the given observations. χ20 = Exercise 1 Assume that the yield of alfalfa (in tons per acre) has a normal distribution with mean 1.5 and variance 0.09. It is hoped that a new fertilizer will increase the average yield. We shall test the one-sided right test with H0 : µX = 1.5, where µX is the population mean of the yield with the new fertilizer. Assume that the normal population is still used and the variance continues to equal 0.09 with the new fertilizer. Determine the unknown sample size n and critical value c so that the Type I error probability is 0.05 and the power of the test statement at µX = 1.7 is 0.95. ( ( H0 : µX = 1.5 Type I error ⇒ α = P (X > c | H0 ) & H1 : µX = 1.7 power ⇒ 1 − β = P (X > c | H1 ) X − 1.5 c − 1.5 √ > √ Type I error: P = 0.05 = P (Z > 1.645) ⇒ 0.3/ n 0.3/ n X − 1.7 c − 1.7 √ > √ Power: P = 0.95 = P (Z > −1.645) ⇒ 0.3/ n 0.3/ n 0.3 1.5 + 1.7 So, c − 1.5 = 1.645 √ = −(c − 1.7) ⇒ c = = 1.6 2 n √ 1.645(0.3) ⇒ n = 4.9352 = 24.354225 ≈ 24 1.6 − 1.5 ∴ c = 1.6 and n = 24 and hence Let x1 , x2 , ..., xn be given fixed points. Let Y1 , Y1 , ..., Yn be the response values at x1 , x2 , ..., xn respectively. Under the linear assumption, we have Yi = β0 + β1 xi + i where the random errors i ∼ N (0, σ 2 ) are assumed to be independent. For the observed paired data (x1 , y1 ), (x2 , y2 ), ..., (xn , yn ) with x and y being the means of xi and yi respective, we define the followings: Pn n n X X ( i=1 xi )2 (1) SXX = (xi − x)2 = (xi 2 ) − n i=1 i=1 P n n n X X ( i=1 yi )2 SY Y = (yi − y)2 = (yi 2 ) − n i=1 i=1 Pn Pn n n X X ( i=1 xi ) ( i=1 yi ) (xi yi ) − SXY = (xi − x)(yi − y) = n i=1 i=1 (2) Pn (x − x)(yi − y) SXY Pn i is the Least Square Estimate of β1 b = i=1 = 2 S (x − x) XX i=1 i a = y − bx is the Least Square Estimate of β0 (3) ŷ = a + bx is the fitted regression line where yˆi is the fitted value of Yi . (4) ei = yi − yˆi is the residual of Yi 2 (5) s = (6) βˆ1 = c − 1.5 √ = 1.645 0.3/ n Pn i=1 ei n−2 Pn 2 = SY Y − b SXY is called the mean square error (MSE) n−2 (x − x)(Yi − i=1 Pn i 2 i=1 (xi − x) Y) is the Least Square Estimator of β1 βˆ0 = Y − βˆ1 x is the Least Square Estimator of β0 c − 1.7 √ = −1.645 0.3/ n (7) If σ is unknown, then the 100(1 − α)% confidence intervals for β1 and β0 are sP r n 2 1 i=1 (xi ) b ± tn−2, α2 s and a ± tn−2, α2 s respectively. SXX SXX (8) Given xnew , the value ynew = a + bxnew is called the predictor of ynew , where its s 1 (xnew − x)2 100(1 − α)% prediction interval is ynew ˆ ± tn−2, α2 s 1 + + n SXX n= Null Hypothesis H0 : β 1 = b1 H0 : β 0 = b0 Test Statistics Condition σ is unknown σ is unknown t0 = t0 = s b − b1 √ s/ SXX a−b q Pn 0 2 i=1 (xi ) n SXX (b) Find the least square fitted linear regression line equation. Alternative Hypothesis Rejection Criteria H1 : β1 6= b1 |t0 | > tn−2, α2 Then, H1 : β1 > b1 t0 > tn−2,α SXX = Let y = a + bx be the line required. 100 45 110 51 120 54 130 61 H1 : β1 < b1 t0 < −tn−2,α H1 : β0 6= b0 |t0 | > tn−2, α2 H1 : β0 > b0 t0 > tn−2,α H1 : β0 < b0 t0 < −tn−2,α SXY (xi − x)2 = 2(52 + 152 + 252 + 352 + 452 ) = 8250 140 66 150 70 160 74 170 78 180 85 n X = (xi − x)(yi − y) = i=1 ∴b= 190 89 (a) Plot the given data points on the coordinate paper. Refer to scanned answer at the link y = 67.3 i=1 Example 3 A chemical enginner is investigating the effect of process operating temperature on product yield. Then experiment results are listed below. Temperature (x) Yield (y) x = 145, n X http://docs.wixstatic.com/ugd/68f78b_9f525f2ffee046cc98aeebc1b0705471.pdf n X ! − nx y = 101570 − (145)(673) = 3985 xi yi i=1 3985 797 SXY = = 0.483̇0̇ = SXX 8250 1650 and 673 797 − (145) 10 1650 −452 = = −2.73̇9̇ 165 a = y − bx = So the line equation required is: ŷ = −2.739 + 0.4830x . (c) Use the result in (b) to predict the yield ynew ˆ at the temperature xnew = 122◦ C. ynew ˆ = −2.739 + 0.4830(122) = 56.19 Exercise 2 In 2013-2014 academic year, 9 student samples from MATH 2411 class are drawn and their percentage scores in midterm (x) and final examination (y) are as follows: x y 77 82 50 66 71 78 72 34 81 47 94 85 96 99 99 99 67 68 (a) Find the fitted least square regression line. Here, sample size n is 9 so we have : P9 9 X ( i=1 xi )2 7072 18164 2 SXX = xi − = 57577 − = 9 9 9 i=1 P P 9 9 9 X ( i=1 xi )( i=1 yi ) 707(658) 14116 SXY = xi yi − = 53258 − = 9 9 9 i=1 SXY 14116 3529 b= = = = 0.7771 SXX 18164 4541 So we have : 658 3529 707 54775 a = y − bx = − = = 12.062 9 4541 9 4541 ∴ the fitted least square regression line is: ŷ = 12.062 + 0.7771x (b) Evaluate the Mean Square Error (MSE) s2 . P9 9 X ( i=1 yi )2 6582 34856 2 SY Y = yi − = 51980 − = 9 9 9 i=1 34856 3529 14116 − S − b S 12051748 9 4541 9 Y Y XY 2 so s = = = = 379.1407808 n−2 9−2 31787 (c) Given that Billy got 85 in the midterm, use the result in (a) to estimate his final exam score. Billy’s final score estimated, ynew ˆ = 12.062 + 0.7771 (85) = 78.1155 . (d) Find 80% prediction interval for Billy’s final exam score. v u s 707 2 r u (85 − ) 2 u 1 (xnew − x) 1 20556 9 = 1+ + =u 1 + + ≈ 1.063808749 t 18164 n SXX 9 18164 9 s (xnew − x)2 1 80% prediction interval = ynew ˆ ± tn−2, 1−80% s 1+ + 2 n SXX √ = 78.1155 ± t9−2, 0.1 379.141 (1.063808749) Example 4 A researcher wants to investigate the relationship between the driving experience and the monthly auto insurance premium. A random sample of 100 auto drivers insured with a company and having similar auto insturance policies was selected. The following table summarizes their driving experience x (in years) and the monthly auto insurance premium y (in dollars). Variable x y Mean 11.25 69 Standard Deviation 7.4 14.8 It is also given that Sxy = −7774.6 . (a) Find the least square regression line for predicting the monthly auto insurance premium from the years of driving experience. Pn n X (xi − x)2 SXX = (xi − x)2 = (n − 1) i=1 = (100 − 1)(7.42 ) = 5421.24 n − 1 i=1 SXY −7774.6 Therefore, b = = −1.434099948 = SXX 5421.24 −7774.6 and hence a = y − bx = (69) − (11.25) = 85.13362441 5421.24 ∴ the lease square regression line is ŷ = 85.13 − 1.434x = 78.1155 ± 1.415 (19.47) (1.06381) = 78.12 ± 29.31 = [48.81, 107.43] P. S. If the score percentage cannot exceed 100 (e.g. no bonus marks), then the prediction interval would be [48.81, min(100, 107.43)] = [48.81, 100] And it would be [max(0, 48.81), min(100, 107.43)] = [48.81, 100] if there is no negative score. (b) Predict the monhly auto insurance premium for a driver with 10 years of driving experience. Round your answer to the nearest dollar. Monthly auto insurance premium predicted = 85.13 − 1.434(10) = 70.79 (e) Plot the given data points and the fitted regression line on the coordinate paper given. For the plot, please refer to past link at https://drive.google.com/file/d/0B- frzO- qxjYjcmp1MU0wb0l4b2M/view Also, the EQUATION of the fitted least square regression line is required. = 71, correct to the nearest dollar Exercise 3 (Counting Principle) A study was made on the amount of converted sugar in a certain process at various temperatures. The data were coded and recorded as follows: Temperature, x 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 Converted sugar, y 8.1 7.8 8.5 9.8 9.5 8.9 8.6 10.2 9.3 9.2 10.5 σX = 0.11 and y = 91.2̇7̇, σY = 0.7201̇8̇ n X (xi − x)2 = (n − 1)x = 10(0.11) = 1.1 i=1 ! n n X X = (xi − x)(yi − y) = xi yi − nx y = 152.59 − (1.5)(100.4) = 19.9 SXX = SXY i=1 ∴b= SXY 19.9 = = 1.80̇9̇ SXX 1.1 SY Y = n X i=1 s2 = (yi 2 ) − Pn ( i=1 yi )2 100.42 = 923.58 − = 7.201̇8̇ n 11 7.201̇8̇ − 1.80̇9̇ (1.99) 39619 SY Y − b SXY = = = 0.4001̇9̇ n−2 11 − 2 99000 (d) Construct a 95% confidence interval for β0 . sP 95% confidence interval = a ± tn−2, 1−95% 2 s = 6.413̇6̇ ± t11−2, 0.025 (a) Find the least square linear regression line. x = 1.5, (c) Evaluate s2 . n i=1 xi 2 n SXX s p 25.85 0.4001̇9̇ 11 (1.1) = 6.413̇6̇ ± 2.262 (0.6326) (1.4616) = 6.4136 ± 2.0915 = [4.322, 8.505] i=1 and a = y − bx = 100.4 199 − (1.5) = 6.413̇6̇ 11 110 So the line equation required is: ŷ = 6.4136 + 1.8091x . (e) Construct a 95% confidence interval for β1 . r 1 95% confidence interval = b ± tn−2, 1−95% s 2 SXX r p 1 = 1.80̇9̇ ± t11−2, 0.025 0.4001̇9̇ 1.1 = 1.80̇9̇ ± 2.262 (0.6326) (0.9535) (b) Predict the amount of converted sugar produced when the coded temperature is 1.75. = 6.4136 ± 1.3644 = [0.445, 3.173] Amount of converted sugar predicted, ynew ˆ = 6.413̇6̇ + 1.80̇9̇ (1.75) = 9.5795̇4̇ (Answers will be available at http://ihome.ust.hk/~makittylee)