CDA COLLEGE (LIMASSOL) 2014 - 2015 MTH 221 (Y2S4) STATISTICS II FOR BUSINESS ADMINISTRATION Syllabus: 1. Review: Location and Dispersion 2, 3. Sampling, Estimation and Confidence Intervals 4, 5. Hypothesis Testing –Distribution free tests 6, 7 Review – Midterm 8, 9. Correlation and Regression 10, 11.Time Series and Forecasting 12. Introduction to the Analysis of Variance 13. Practice 50% Homework and 50% Final Exam: Passing Mark 50 1. (MTH 121, Semester 2/MTH 221, Semester 4) Sanders, Eng, Merph : “STATISTICS, A FRESH APPROACH” , Mc Graw Hill 2. (MTH 121, Semester 2/MTH 221, Semester 4) Francis, M.: ADVANCED LEVEL STATISTICS Stanley Thornes Publishers 3. (MTH 121, Semester 2/MTH 221, Semester 4) Hamburg, M.: BASIC STATISTICS Harcourt Brace Jovanovich ■☺ 2 WEEK 1 : REVIEW - STATISTICAL MEASURES Measures of Location: 1 n 1 k , or f n 1 xi n 1 xj j (n 1) / 2 F w, Where, Median (the middle value) : m = L+ f n is the sample size, k is the number of classes, f is the median class frequency, x j is the midpoint of the class, L is the lower boundary of the Minimum, Maximum, Average: X = class, F is the cumulative frequency of the previous class and w is the class width. Similarly, (n 1) / 4 F 3(n 1) / 4 F w ; Q =L+ w Quartiles: Q =L+ 1 3 f f Mode is the value with the highest frequency Methods of calculation; Single values, frequencies, discrete, continuous Examples – Exercises 1. Keith records the amount of rainfall, in mm, at his school, each day for a week. The results are given below. 2.8 5.6 2.3 9.4 0.0 0.5 1.8 (a) Calculate the mean, the median and the mode of the amount of rainfall during the7 days. Keith realizes that he has transposed two of his figures. The number 9.4 should have been 4.9 and the number 0.5 should have been 5.0. Keith corrects these figures. (b) State, giving your reason, the effect this will have on the mean. 2. The following table summarises the birth weights of a random sample of 100 babies born in clinic over a year. (a) (b) (c) (d) Weight in Kg frequency 1.5-1.9 2 2.0-2.4 9 2.5-2.9 12 3.0-3.4 18 3.5-3.9 22 4.0-4.4 17 4.5-4.9 13 5.0+ 7 Write down the upper class boundary of the second class Represent these data by a histogram Calculate estimates of the median and the quartiles of these birth weights Comment on the skewness of these data. 3 Measures of Dispersion: Range: x max x min _ 1 k ( x )2 x j 1 n 1 Variance: Standard deviation s= Interquartile range IR= s 2 = s j or 2 k _ 1 2 ( xj n x ) n 1 1 2 Q Q 3 1 Properties; appropriateness Computer; calculator Applications 1. The data below represent the cost of electricity during July 2004 for a random sample of 50 one-bedroom apartments in a city: 96 157 141 95 108 171 185 149 163 119 202 90 206 150 183 178 116 175 154 151 147 172 123 130 114 102 111 128 143 135 153 148 144 187 191 197 213 168 166 137 127 130 109 139 129 82 165 167 149 158 a. After constructing the Stem and leaf Plot, form a frequency distribution that have class intervals with upper class limits $99, $119 and so on b. Draw the ogive and estimate the median and the quartiles from your graph. c. Estimate the mean, the range, the standard deviation and the interquartile range algebraically. 4 WEEKS 2 and 3 SAMPLING, ESTIMATION AND CONFIDENCE INTERVALS “ Knowledge is happiness…” THE PROBLEM: Here we have one of the fundamental problems in Statistics: From a relatively small sample, we try to make inferences about the whole population. Either with a specified value (Point Estimation), or with an interval, which intends to cover the true parameter value with a prespecified confidence (Interval Estimation). In Probability Theory, we calculate the probability of an event, before the experiment: given of course the values of the parameters. Notation: f ( x; ) or f ( x | ) Where, X is the vector of observations, and θ is the vector of parameters. In Statistical Theory, we estimate the values of the parameters θ, after the experiment, based on the observed data X. SAMPLING “You don’t have to eat the whole cheesecake to realize, that it turned sour” The Problem: One of the fundamental objectives of statistical science is to make inferences about the whole population, examining a small portion of it! Simple Random Sample? : A collection of Independent Identically Distributed (i.i.d.) random variables X1 , X 2 , , Xn or, we can consider it as a small group of size n, randomly taken from the population of size N. To draw the sample, we can use random numbers, drawn from the computer, a calculator, or from statistical tables. In order to understand some of the important aspects of sampling, we consider a very simple special example: Consider the “population” of 4 families in a small village with numbers of children: { 3, 1, 0, 8} Population size: N= 4; Variable: X= the number of children in a family. Note that for this population 5 1 1 X (3 1 0 8) 3 , N 4 2 1 (3 3)2 (1 3)2 (0 3)2 (8 3)2 0 4 9 25 2 ( X ) 9.5 N 4 4 If we take a sample of size n=2, without replacement, the number of possible 4 samples is = 6, and the sampling distribution is the set of all values of X , 2 along with the corresponding probabilities, for all possible samples: Sample X Pr (3,1) 2 1/6 (3,0) 1.5 1/6 (3,8) 5.5 1/6 (1,0) 0.5 1/6 (1,8) 4.5 1/6 (0,8) 4 1/6 totals 18 1.0 Note that the median of the population is 2 and the mode does not exist! Now: 1 ( X ) XP( X ) (2 1.5 ... 4) 3( ) 6 allX 1 Var ( X ) [(2 3) 2 (1.5 3) 2 ... (4 3) 2 ] 3.167 (verify !) 6 1 N n 2 1 42 9.5 3.167 n n 1 2 2 1 Infinite Population Starting from the population parameters E(X)=μ and Var(X) = σ 2 , and the proportion π, we have the Sample statistics: Sample Size: n 1 n X X 1 X X i ; Proportion: p or , where, X Sample Average: n i 1 n n N (Binomial Distribution) n X i nX 1 n 2 i 1 s (Xi X ) = , Total T X i n 1 i 1 n 1 i 1 2 n 2 Variance with the properties: E ( X ) , E (T ) n , E ( p) , and Var ( X ) 2 n ,Var (T ) n 2 ,Var ( p) (1 ) n 2 n X i 1 i , 6 INTERVAL ESTIMATION Point estimation (estimating a parameter with a single value), gives us some idea about the value of the parameter, but important information about the precision of this estimate is missing! An estimate of the population mean μ is given by = X , for the variance Var(X) = σ 2 , the estimate is 2 S 2 , for the proportion π, is = p and for the correlation coefficient ρ, is = r . This procedure is useful, but almost always out of target. A more realistic and “safe” procedure is often entertained through confidence Intervals. Given a sample X 1 , X 2 , , X n , we define a P% Confidence Interval (CI) about the parameter θ, as a random interval [T1 , T2 ] , such that: Pr(T1 T2 ] P% , (Central), where T 1 , and T 2 , are the values of the statistic T, obtained from the sample. As a general principle the C.I. for θ is z S.E.( ) approximately! Some commonly used intervals: 1. An exact P% C.I. for μ when σ 2 is known, (or approximate , when σ 2 is unknown, but n is large (n>100)), is obtained by solving for μ the inequality: Z (X ) n Z to find μ=X z n , 1 n where of course X X i and z is the appropriate value, from a n i 1 standardized normal distribution, corresponding to the presecified degree of “confidence” (probability) P. The random interval covers the true but unknown value of μ with probability P%, considering all possible samples of the same size from the population! Care should be taken since, this is different from an interval about individual values of N (μ, σ 2 ) 7 X = μ ± zσ which is much wider! 2. A P% C.I. for μ when σ 2 is unknown, and n is small (n≤100) is given by X t( n 1) s 1 n 2 s ( X i X ) 2 , the sample variance and , where n 1 i 1 n t( n 1) is the appropriate value, from a Students' t-Distribution with (n-1) degrees of freedom! 3. Approximate Confidence Intervals: For the intensity rate λ of a Poisson distribution or, for the proportion π of Binomial: For Poisson: x z x , where x is the observed value (number of events in a rather wide time interval), and then, by possibly dividing accordingly for the required interval! For Binomial: p z x p(1 p) , where p , the sample proportion. n n This interval is approximate, for two reasons! (i) Binomial distribution is approximated by the corresponding Normal and (ii) π, the theoretical proportion, is replaced by p, the sample proportion! Aside: Note that an interval closer to the exact C.I. (by overcoming the second approximation) for π, may be obtained by solving the relevant quadratic inequality z ( p ) n z (1 ) And this interval turns out to be: 2np z 2 z z 2 4np(1 p) 2(n z 2 ) 4. A P% C.I. for the population variance σ 2 , is given by 1 n (n 1) s 2 (n 1) s 2 2 2 ( X i X ) 2 , the sample , where s u2 u1 n 1 i 1 2 variance and (u1 , u2 ) are the corresponding point s from a ( n 1) distribution with (n-1) degrees of freedom, and covering P% central probability! 8 Examples - Exercises 1. A large bag of coins contains 20c, 50c and 100c (1Є), in the ratio 3:2:1. (a)Find the mean and the variance for the value of coins in this population A random sample of two coins is taken and their values 1 , and , 2 are recorded (b)List all possible samples. (c) Find the sampling distribution for the average X 1 2 2 2. The following grouped frequency distribution summarizes the time, to the nearest minute, spent waiting by a sample of patients in a doctor’s surgery. Waiting Time (to nearest Number of minute) Patients 3 or less 6 4-6 15 7-8 27 9 49 10 52 11-12 29 13-15 13 16 or more 9 The average of the times was 9.63 minutes and the standard deviation was 3.05 minutes (Taking the missing limits to be 0 and 20 ) a) Using interpolation, estimate the median and semi-interquartile range of these data. For a normal distribution, the ratio of the semi-interquartile range to the standard deviation would be approximately 0.67. b) Calculate the corresponding value for the above data. Comment on your result. For a normal distribution, 90% of times would be expected to lie in the interval (Mean ± 1.645 standard deviation) c) Find the theoretical limits for these data. d) Find the 90% C.I. for the mean μ and the variance σ2 of the population from which this sample has been drawn e) Find also a 95% CI for the proportion π of patients who wait longer than 15 minutes 9 WEEKS 4 and 5: HYPOTHESIS TESTING “Doubt is the root of progress” The Problem: Why testing? First, there should be a claim for the value of a parameter (μ, σ, π, λ, and so on) or a model, and then the evidence from the observed data should be left to decide about the claim! A test is a statistical procedure which decides with some confidence whether a statement – hypothesis, about a parameter value, or the whole model, is valid! The Characteristics or the “ingredients” of a test! The supposed claims are formulated as follows: (a) Null Hypothesis: H 0 : This statement should specify completely the value of the parameter of interest! e.g. μ=50 vs: (b) Alternative: H 1 : This is a complementary statement to H 0 , in some direction! e.g. μ>50, or μ<50, (one tailed) or μ 50 (two tailed tests); (X ) n N (0,1) ; This is (used to be called) a statistic Test Statistic: e.g. Z from the sample, which serves as the criterion for decision - Accept or reject H 0 ; ( In fact this is rather a Pivotal Quantity, since by definition; a Statistic involves only functions of the sample!) Significance level: α = Pr (we observe what we did, under H 0 , or even worse in the direction of H 1 ) = (p-value); Often this is predetermined at 5% or at 1%, but it can take any value! Critical value; Critical region: The value, or rather the set of values of the test statistic Z or X which lead to rejection of H 0 . The Decision Rule: This is a rule which is based on a specified value test statistic and is of the form: T * ,of the Reject H 0 , if Tobs T * or Tobs T * Accept H 0 , otherwise Relation with Confidence intervals: If the assumed value of μ lies inside the C.I., then we accept H 0 ; if, however it is outside the interval, then we reject the null hypothesis in favor of the alternative. Example: Two friends Roger and Marcos play the best of five series (at most 5) games of tennis to decide whether Roger is better (Roger’s claim) or they are equally competent! Let the parameter of interest π = Pr (Roger wins any game); 10 H 0 : π = 0.5 (Marco’s claim) vs H 1 : π > 0.5; (Roger’s claim) The test statistic naturally is the final score (outcome of the game) with the corresponding distribution as in the table: Final score 3-0 3 -1 3-2 2-3 1-3 0-3 Probability 4/32 6/32 6/32 6/32 6/32 4/32 (under H 0 ) Suppose that the game ends (3 – 0) for Roger. Obviously there is evidence in favor of Roger. The observed significance level is: S.L.= P(score 3 – 0, or better for R) = 4/32 = 1/8 = 0.125 PRINCIPLE: The rationale behind a statistical test is the following; “If a small value of the significance level α is observed, or equivalently an extreme value of the test statistic T is realized, then we are faced with two options: (a) Either H 0 is true and we just observed a quite rare event, or (b) H 0 . Is false, and that is the reason, we have observed the realized outcome”! Therefore, following the above rationale, we reject the Null Hypothesis when we observe a rather improbable outcome! It should be emphasized that when we decide under uncertainty, we take a risk of making an error as follows: UKNOWN STATE OF NATURE DECISION H 0 is true H 1 is true Correct decision Pr. = 1− α Type I Error Pr. = α = SL Accept H 0 Reject H 0 Type II error Pr.= β Correct decision Pr.= 1− β= power We can explain all these, using an easy and realistic example; At a backgammon tournament, there are (among others, of course) the two players Adam and Daniel, in short A and D. Adam claims that he is a better player than Daniel, whereas Daniel modestly argues that there is nothing between them. A test may be formulated as follows: Let π = Pr(Adam winning any game) Null Hypothesis H 0 : 1 2 Alternative Hypothesis H 1 : Daniel’s claim 1 2 Adam’s claim 11 If a decision is to be taken on the basis of a series of, say 10 games, then the Test statistic is naturally X= number of wins for Adam, X~ Bin(n, π), n=10 Now suppose that the series ended with the score 9-1 for Adam, so X obs = 9. The observed significance level: Pr( X 9) 10 10 x 0.5 (1 0.5) x 9 x 10 x 0.0107 Taking into account the above quite unlikely result, we reject H 0 in favor of H 1 . Note however that in real situations the decision rule is traditionally Reject H 0 if X ≥ 6, Accept H 0 otherwise In this case the critical region is X= 6, 7... 10 Following along the same lines, we get The significance level: 10 x 10 x 0.3770 0.5 (1 0.5) x 6 x 10 Pr( X 6) Basic Parametric Tests (a) Tests for the mean μ: H 0 : 0 (i) Known variance or large sample size (n>50, replacing of course the unknown σ with the sample standard deviation s, which is an estimate of σ ) Test statistic: Z (X ) n N (0,1) (ii) Unknown variance and small sample size (iii) Z Tests for two samples (sizes n, n): (Y X ) ( y x ) x2 y n m 2 where s 2 2 nm2 ( X ) n s t( n 1) H 0 : x y N (0,1) , (known variances), or T (n 1) sx2 (m 1) s y2 T (Y X ) ( y x ) 1 1 s n m ; The pooled estimate of the common population variance 2 t( n m2) , 12 iv) Paired t test: When natural pairing exists between the observations in the two samples, the most efficient test is the paired one and the set up is the following: H 0 : x y or x y 0 Paired Sample Assumptions: Di X i Yi , (difference) D N ( , 2 ) , independent 1 n 1 n Di , and , S d2 ( Di D ) 2 n i 1 n 1 i 1 Test statistic: T (d ) n sd X1 , X 2 , , Xn Y1 , Y2 , , Yn t( n 1) Examples – Exercises 1. A student takes a multiple choice test. The test is made up of 10 questions each with 5 possible answers. The student gets 4 questions correct. Her teacher claims she was guessing the answers. Using a one tailed test, at the 5% level of significance, test whether or not there is evidence to reject the teacher’s claim. State your hypotheses clearly. 2. A random sample of 10 mustard plants had the following height, in mm, after 4 days growth. 5.0, 4.5, 4.8, 5.2, 4.3, 5.1, 5.2, 4.9, 5.1, 5.0 Those grown previously had a mean height of 5.1 mm after 4 days. Using a 5% significance level, test whether or not the mean height of these plants is less than those grown previously. (You may assume that the heights above are normally distributed) 13 DISTRIBUTION FREE METHODS Non - parametric methods; (i) One sample with size n: Assumption of normality is waived, but symmetry is needed! As a general principle these tests are obtained by applying the corresponding parametric tests on the ranks! (a) Sign test : The test is based only on the sign of the differences of the observed values X 1 , , X n from the assumed median η 0 Null Hypothesis H 0 : η (population median) =η 0 vs. H 1 : η > η 0 , η < η 0 or η η0 First we obtain the differences: d i = x i – η 0 (ignoring any 0’s and adjusting the sample size accordingly!). The Test Statistic is U = # of +’s (or –‘s) (whichever is smaller!) in the differences di ; Now, under the null Hypothesis, H 0 : δ (median of the differences) = 0, U ~ Bin(n, ½); S.L. =Pr(U ≤ u obs ) or Pr(U u obs ) For a two tailed test, just double the observed S.L., you have obtained from the one tailed test. For large samples (say, n>20), U~ N (μ, σ 2 ), approximately, where μ = n/2, σ 2 = n/4 and we proceed accordingly with the test statistic: n x 0 u 2 2u n Z N (0,1) n n 4 (Normal approximation to Binomial!) (b) Wilcoxon signed rank test (one sample with size n): This test is based on the ranks of the signed differences of the observed values X 1 , , X n from the assumed median η 0 Null hypothesis H 0 : η (pop’n median) =η 0 vs. H 1 : η > η 0 , η < η 0 or η η 0 Obtain the differences d i = x i – η 0 , but have in mind to ignore 0’s, adjusting the sample size accordingly, and then assign ranks to the resulted differences, considered as one set, ignoring the sign. When there are ties in the values, then take the average rank of the tied observations. Eventually, you get two totals: T : the sum of the ranks for the positive differences and 14 T : the sum of the ranks for the negative differences Check however: Since this is the sum of the first n natural numbers, we must n(n 1) allways have: T +T = 2 Test statistic: T= T or T (whichever is smaller), and compare the value with the critical value on the appropriate table for the test! This is a better test than the sign test, since it takes into account the ranks, and hence the relative magnitude of the differences and not only the signs! There is however the (not unusual ! ) case, where we observe a lot of (say) positive differences, but very few, absolutely big negative differences. In this situation the Sign Test, will turn out to be significant, but the Wilcoxon Test will not! For large samples (say, n>20), then T~ N (μ, σ 2 ), approximately, where μ= n(n+1)/4, σ 2 = n(n+1)(2n+1)/24 and based on these we can proceed as in Normal tests. CONTINGENCY TABLE (m n) Test for Association or Independence ( Karl Pearson 1857 – 1936): (The values in the cells are frequencies. If the values are percentages, then, before proceeding to the test, we have to transform them to frequencies,) H 0 : No association between the two factors, or the two classifications A and B, are independent (m rows and n columns) Cl.B Cl.A A1 A2 B1 B2 … Bn Total o11 o21 o12 o22 … o1n r1 … … Am om1 Totals c1 … om 2 c2 … o2n r2 … … … … omn rm … cn T Expected frequencies in each cell (i th row, j th column) : (under H 0 ) e ij = (row total) × (column total)/ (grand total) rc i j N For the test to be “good” and reliable (according to Pearson), each expected frequency, should always be at least 5; otherwise we have to combine adjacent or similar classes, in order to achieve all expected frequencies to be at least 5. Test statistic 15 (oij eij ) D e i, j 2 (2m1)( n 1) ij For a 2 2 contingency table, we often use Yates’ continuity correction, and the appropriate test statistic becomes (|oij eij |0.5) D e i, j 2 2 (1) ij The correction (- 0.5) drives the observed value D, away from the critical region. So if the uncorrected value is not significant, the corrected would also be insignificant! Examples - Exercises 1. Manuel is planning to buy a new machine to squeeze oranges in his cafe and he has two models, at the same price, on trial. The manufacturers of machine B claim that their machine produces more juice from an orange than machine A. To test this claim Manuel takes a random sample of 8 oranges, cuts them in half and puts one half in machine A and the other half in machine B. The amount of juice, in ml, produced by each machine is given in the table below. Orange 1 2 3 4 5 6 7 8 Machine A 60 58 55 53 52 51 54 56 Machine B 61 60 58 52 55 50 52 58 Stating your hypotheses clearly, test, at the 10% level of significance, whether or not the mean amount of juice produced by machine B is more than the mean amount produced by machine A. Use both parametric and non-parametric tests and compare! . 2. A survey in a college was commissioned to investigate whether or not there was any association between gender and passing a driving test. A group of 50 male and 50 female students were asked whether they passed or failed their driving test at the first attempt. All students asked had taken the test. The results were as follows, Pass Fail Male 23 27 Female 32 18 Stating your hypotheses clearly test,at the 10% level of significance, whether there is any evidence of an association between gender and passing a driving test at the first attempt. 16 WEEKS 8, 9 CORRELATION - REGRESSION “Nothing is isolated!” The Problem: Given a set of bivariate observations, we attempt to estimate the “best” algebraic relationship between X and Y Correlation from the Sample: -We are interested in measuring linear association between: The response (dependent) Variable: Y, vs The explanatory (independent) Variable X -A first indication of the existence of this linear association is revealed, if we draw the scatter diagram between X and Y. Other possible shapes Correlation: meaning linear association between X and Y; cause and effect. A numerical measure of how strong is this linear association between X and Y is the sample product moment correlation coefficient (pmcc) 1 n X iYi nXY , is the sample covariance and r , where sxy sx s y n 1 i 1 sxy s x ,s y , are the sample standard deviations of X and Y. Note that always −1≤ r ≤1, with r = 1, for perfect correlation; positive or negative! (See the shapes above) Another way to calculate the pmcc: r SXY ; SXX SYY (Can be obtained easily from an advanced calculator! ) Where 17 n n SXY ( X i X ) (Yi Y ) X iYi nXY , i 1 i 1 n n SXX ( X i X ) 2 X i2 nX 2 (n 1) sx2 , i 1 i 1 n n i 1 i 1 SYY (Yi Y ) 2 Yi 2 nY 2 (n 1) s y2 Note that: (i) The numerical existence of significant correlation does not necessarily imply linear association between the two variables, unless of course some natural explanation exists! Spurious (unexplained, nonsense) correlation sometimes occurs! (ii) On the other hand, a value of the pmcc close to 0 implies no linear association, but it might indicate non-linear association! (Examples are shown above on the scatter diagrams) (iii) For tests, or inferences on the true but unknown value of ρ, we can use the result: T r n2 1 r 2 t( n2) The product moment correlation coefficient (pmcc) however, is invariant under scale and location transformations, i.e. it remains exactly the same! SPEARMAN’s RANK CORRELATION COEFFICIENT The corresponding non- parametric, rank correlation coefficient has been developed by Charles Spearman (1904) and, basically this is the p.m.c.c. of the corresponding ranks. It can also be evaluated through Spearman’s formula n rs 1 6 di2 i 1 2 n( n 1) Where d i is the difference between corresponding ranks of the two variables. When there are ties in the ranks, we take the average rank of the tied observations. 18 Note that the proof (see appendix A7) of the formula requires no ties, so the two ways of calculation: (i) from the pmcc of the ranks and (ii) from the above formula of Spearman, might be slightly different; More accurate is the first one, obtained using a calculator! Testing is performed through the use of appropriate tables of rank correlation coefficient. Comparisons: (i) The rank correlation coefficient may be used when only ranking is available, or the data are qualitative. (ii) The big difference with the p.m.c.c, is that Spearman’s coefficient measures agreement of ranks, but not necessarily linear association. (iii) This coefficient (Spearman’s rank) does not rely on Bivariate Normal. (iv) Numerically, the two coefficients are often close, but in special cases their values might be quite apart! LINEAR REGRESSION “ It is our opinion of the situation at one stage, but this must change, if we find , at a later stage, that the facts are against it” Glancing the scatter diagram, we may suspect a rather strong linear association of the response variable Y on the explanatory variable X.(a p.m.c.c. close to r 1 reveals that!). In this case, we use regression methods to establish the “suspected” linear relationship. For example, for the set of points, given below, we draw the scatter diagram: (X, Y) points: (1, 49), (3, 51), (4, 52), (6, 52), (6, 53), (7, 53), (8, 54), (11, 56), (12, 56), (14, 57), (14, 58), (17, 59), (18, 59), (20, 60), (20, 61) The line of “best” fit is obtained through the statistical regression model: 19 Y i =α+ βX i + ε i , where ε i is the unobserved error. The vertical random errors satisfy the conditions; (i) ε i ~ N (0, σ 2 ), and (ii) they are independent ( i=1,2,…,n) where α, is the Y intercept and β, is the slope (gradient) of the straight line Variables: Explanatory, Independent; X Response, Dependent; Y Principle of regression: Applying least squares methods, we obtain the estimates of the parameters α and β, which turn out to be the Statistics a and b, by minimizing the sum of squared vertical errors, with respect to α and β: n n i 1 i 1 SSE i2 (Yi X i ) 2 Solving the resulting two regression equation (see appendix A4), the “best” estimates of the unknown parameters α and β, turn out to be: The Slope b s s xy 2 x i.e the sample covariance divided by the sample variance of X, or equivalently, b = SXY/SXX where SXY and SXX are defined as in the previous chapter (Correlation). n n n n SXY ( X i X ) (Yi Y ) X iYi nXY , SXX ( X i X ) X i2 nX 2 2 i 1 i 1 The Y- intercept: The variance of the Another form of the fitted i 1 i 1 a Y bX 2 s2 SSE n2 Model: Y Y b( X X ) line is: From Analytic Geometry, this form indicates that the line passes through the point G ( X , Y ) the center of gravity of the data, and has gradient m = b 20 Natural interpretation of the parameters of the regression Model: a: Is the estimated value of the response variable Y, when the explanatory variable X = 0 b: Represents the estimated change of the response variable Y, for a unit increase of the explanatory variable X Note that prediction may be obtained from the fitted line, but, this is quite risky, outside the range of observations (extrapolation), as in any other hard science! To examine the goodness of fit of the entertained model, we can use the residuals: ei ( yi yi ) i=1, 2,…, n SSE , This is the percentage of total SYY variation explained by the regression line. For this simple model, we have and the coefficient of determination , R 2 = 1– R 2 r 2 . In other words the coefficient of determination is just the square of the p.m.c.c. For an adequate fit, the coefficient of determination R 2 should be high! (Close to one!) Also, the residuals, plotted against the explanatory variable X should, theoretically, be scattered randomly, above and below the axis of X, within a narrow horizontal band. Note also that e i 0, always! i Multiple Regression The general linear model is explained here, through the powerful tools of Linear Algebra and we have a look at it, from the elegant perspective of three dimensional geometry! 21 For a multiple regression situation, we entertain the matrix model: y X y1 x11 y2 x21 ; In matrix form, ... ... yn xn1 x22 ... xn 2 MVN (0, 2 I n ) b ( X T X )1 X T y with the solution being: and ... x1k 1 1 ... x2 k 2 2 ... ... ... ... , ... xnk k n x12 E (b) ; also Var (b) ( X X ) The predicted value for a given explanatory X is: T 1 2 yi xi b x i ( X T X ) 1 X T y, with,Var ( yi ) x i ( X T X ) 1 x i 2 t t t and the estimate of ( y y) ( y y) T s2 2, is (n k ) Application Matrix approach to the simple model The simple model may be formulated in a multiple regression context as follows: y X , where y1 1 x1 1 y2 1 x2 y ,X , , 2 ... ... ... ... yn 1 xn n In this context, it turns out that n XT X nx nx 1 T 1 n ( X X ) , and nSXX xi2 i 1 n 2 xi i 1 nx nx , n so the variance- covariance matrix of the estimated coefficients (slope and intercept) is given by: 22 n 2 a xi Var i 1 b nSXX nx 2 nx n A P% confidence region for the parameter β, or a test may be constructed from the distributional result of the pivotal quantity P (b )T X T X (b ) / 2 ( y X b)T ( y X b) / (n 2) F (2, n 2) In fact the confidence region will become an ellipsoid which will cover the true, but unknown value of the vector β, with probability P%. How good is the model? The goodness of the model is measured by (i) The size of the sum of squared errors SSE, and, better (ii) The value of the coefficient of determination: R2 SSR SSE 1 SST SST However, both these two measures can be reduced technically by introducing more independent variables X (more columns in the X matrix). As we know from geometry the more the number of explanatory variables, we introduce, the closer the hyperplane comes to the observed vector and naturally the question arises where to stop. An adjusted coefficient of determination has been developed which takes into account the extra variables. This is adjusted for the loss of the degrees of freedom when we introduce new parameters. 2 R 1 SSE / (n k ) n 1 MSE 1 (1 R 2 ) 1 SST / (n 1) nk MST Other indicators of the goodness of the model are (i) Plots of the residuals (ii) Significance of the estimated coefficients (iii) Tests for Normality of the residuals. 23 Examples – Exercises 1. The amount of blood expelled from the heart with each ventricular contraction is known as the stroke volume. Medical researchers studying the relationship between age (in years) and stroke volume (ml of blood) obtained the following data from a random sample of patients. Age (x) 25 30 35 40 45 50 55 60 65 70 Stroke of volume 76 77 74 71 72 70 68 67 64 62 (You may use x 2 =25 025, y 2 =54 835, xy =34 120) Draw a scatter diagram to represent these data. a) Calculate the product moment correlation coefficient between x and y b) Interpret your result c) Find the equation of the regression line of y on x in the form y=a+bx. d) From your line estimate the stroke volume of a patient at the age of 75 2. A teacher took a random sample of 8 children from a class. For each child the teacher recorded the length of their left foot, f cm, and their height, h cm. The results are given in the table below. f h 23 135 26 144 (You may use f =186 23 134 22 136 27 140 h =1085 Sff = 39.5 25 291) 24 134 20 130 Shh =139.875 21 132 fh = (a) Calculate Sfh. (b) Find the equation of the regression line of h on f in the form h = a + bf. Give the value of a and the value of b correct to 3 significant figures. (c) Use your equation to estimate the height of a child with a left foot length of 25 cm. (d) Comment on the reliability of your estimate in part (c), giving a reason for your answer. The left foot length of the teacher is 25 cm. (e) Give a reason why the equation in part (b) should not be used to estimate the teacher’s height. 24 WEEKS 10, 11 TIME SERIES “Standing on the past, we live the present, hoping for the future” The Problem: A set of measurements taken at consecutive time points constitutes a time series. A number of ways to analyze the series, estimate the parameters, and forecast future values are considered! 15.1.The Model: Y t = Trend + Seasonal variation + Short term (non random) variation + Random variation E: Expansion; R: Recession Very often, among the problems in analyzing time series is how to estimate, and remove seasonal variation, to get a clear picture of the trend. Here are some commonly used techniques: (i) Regression with dummies: Yt a bt cQ2 dQ3 fQ4 t , where Q2 , Q3 , Q4 are dummy variables, taking the values 1 or 0, depending on whether we are on the second quarter, the third , or the fourth, thus reflecting any existing seasonal component! However, the residuals are serially correlated, so the model can estimate the first order coefficient of correlation by: n r e t 2 n et t 1 et2 , t 2 where e1 , e2 , en ,the residuals are from the above multiple regression. 25 The Hypothesis H 0 : 0, vsH1 : is tested using the Durbin-Watson test statistic: n D (e t 2 t et 1 ) 2 n e t 2 0 2 2r 2 t (ii) The Moving Average Model The second model, often used is the Moving Average (M.A.) of appropriate order (usually the period of the process), to estimate the trend. If s (the period of seasonality) is even, then we need further MA of order two to center the estimates, so that, these estimates correspond to the observed values. In mathematical terns, the first term of the MA with period s is * 1 s 2 Y 1 s Yi , s i 1 or more explicitly, for most series with observations taken quarterly the period is four and the moving average centered at Y t is M (Yt ) .5Yt 2 Yt 1 Yt Yt 1 .5Yt 2 , 4 t = 3, 4,…, n-2 The second step is to calculate the seasonal component by taking: either (i) the difference Ut Yt M (Yt ) (Additive Model) or (ii) the ratio U t Yt / M (Yt ) (Multiplicative Model) and then average over all observations at each of the four phases. i.e. so ut 1 ut 5 ut 9 ... ut ut 4 ut 8 ... ; s1 n1 no s2 ut 3 ut 7 ut 11 ... ut 2 ut 6 ut 10 ... s ; 3 n3 n2 where n i is the number of observations on the i-th phase. Finally we deseasonalize the series by subtracting (dividing for the multiplicative model) each observation by the corresponding seasonal effect, obtaining Yt so , Yt 1 s1 , Yt 2 s2 , Yt 3 s3 , Yt 4 so ,... A linear regression is fitted to the deseasonalized series to obtain forecasts of the trend, which at the last stage should be corrected by the seasonal effect! For the most commonly used additive model, after estimating the trend and the seasonal component, which, in practice, is estimated by subtracting the estimated trend from the observed series, and then averaging for each quarter or repeated point, we can make predictions based on algebraic or graphical estimates. 26 Yt =Trend estimate + Seasonal component Random variation cannot be predicted. Non random variation? This is non - regular or cyclical variation about trend. -Autoregressive models may also apply! Very simple linear regression models: Due to the fact that the main components of most time series are the trend and the seasonal component, it is possible to fit simple models which take into account these two main effects, for example: Yt Yt s t , where t N (0, 2 ), independent and s is the period of seasonality. α and β are the parameters to be estimated from the data. Often we take s = 4, for quarterly data or s = 12 for monthly data! Examples – Exercises 1. Trend, seasonal variation and random variation are three terms used in the analysis of time series. Explain what they mean. In order to compete with its larger rivals, a small cinema shows only first-rate films with one performance each evening and change of program every five weeks. The table below gives the weekly attendances at this cinema, in hundreds, during a 20 week period beginning in September. week Attendance (hundreds) week Attendance (hundreds) 1 5.5 2 9.2 3 4 10.1 7.5 5 6.8 6 7 8 9 10 11.7 17.4 17.2 15.0 13.1 11 12 13 14 15 16 17 18 19 20 16.3 23.0 25.1 21.2 19.1 25.0 30.4 31.6 29.0 28.3 (i) Plot these data on a graph. (ii) Calculate an appropriate moving average in order to smooth the series and plot the values on the graph (iii) Discuss with reasons what you think might happen to attendances in the next 20 weeks and predict the attendances on 21 st and 22nd weeks. 27 2. The earnings of a corporation during the seventies are given below in $000: Year 1971 1972 1973 1974 1975 1976 1977 1978 Quarter 1 300 330 495 550 590 610 700 820 2 460 545 680 870 990 1050 1230 1410 3 345 440 545 660 830 920 1060 1250 4 910 1040 1285 1580 1730 2040 2320 2730 (i) Plot these data on a graph. (ii) Calculate an appropriate moving average in order to smooth the series and plot the values on the graph (iii) Estimate the trend by least squares on MA (iv)Provide predictions for all four quarters of 1979 28 WEEK 12 EXPERIMENTAL DESIGN “We only observe the effects! What about the cause?” The Problem : To isolate the factors of interest from other experimental noise. 13.1 Completely Randomized Design: One factor: X ij j ij , The model With ij N (0, 2 ) , independent; where X ij : is the ith observation on the jth column (treatment) μ: is the overall effect j : is the effect of treatment j ij : the error, whereas i 1, 2,..., n j and j 1, 2,..., m . The set-up Replication/ Level 1 1 X11 2 X 12 3 X13 … m … X 1m 2 X 21 X 22 X 23 … X 2m 3 X 31 X 32 X 33 … X 3m … … X n11 … X n2 2 … X n3 3 … … … X nm m Totals C1 C2 C3 … Cm Where N k n j 1 j ; Total number of observations, T X i, j nj C j X ij , , so i 1 The jth column average is X j Cj nj The sums of squares: Total : T SST ( X ij X )2 X ij2 i, j i, j T2 N ij , X T , and N 29 SSC ( X ij X ) 2 Between columns: i, j j C 2j nj T2 , and N The within or error sum of squares: SSW or SSE ( X ij X j ) X 2 i, j 2 ij i, j j C 2j nj Note that, to avoid confusion of the denomimators, as a general principle, each total squared is divided by the corresponding number of observations contributing to that total! Analysis of Variance Table Source of variation Sum of Squares Degrees of Freedom Columns (Treatments) Residual (Error) Total (Corrected) SSC m-1 SSE (by subtraction) SST N-m Mean Square= MS/d.f F-Ratio SSC m 1 SSE MSE N m MSC MSE MSC F (m 1, N m) N-1 The Null Hypothesis H 0 : No difference between treatments i.e. j 0, j Reasoning: If there is no difference between treatments, then the model becomes X ij ij . So the two mean squares MSC and MSE are independent estimates of the same variance of the assumed model. Their ratio is distributed like an F random variable with the corresponding degrees of freedom shown above. 2 Randomized Block Design - Two factors: X ij j ij , The model with assumptions ij N (0, 2 ) , independent , no interaction between the two factors; where X ij : is the observation on the ith row block and on the jth column (treatment) μ: is the overall effect i : is the effect of the ith block (row) 30 j : is the effect of jth treatment (column) ij : error, and i 1, 2,..., n and j 1, 2,..., m , The set-up 1 2 3 … m Totals Block 1 X11 X 12 X13 … X 1m 2 X 21 X 22 X 23 … X 2m R1 R2 3 X 31 X 32 X 33 … X 3m … n … X n1 … X n2 … X n3 … … … X nm Totals C1 C2 C3 … Cm Treatment R3 … Rn T Where N=mn is the total number of observations, the grand total T X ij and i, j T X .. , the grand average; N m n j 1 i 1 Ri X ij ,and C j X ij , and Ri C , the i-th row average, X . j j , the jth column average and m n The sums of squares: X i. Total : SST ( X ij X .. ) 2 X ij2 i, j i, j Between columns: T2 N SSC ( X . j X .. ) 2 i, j Between rows: j C 2j n T2 , and N 2 i R T2 SSR ( X .i X .. ) , N i, j j m 2 The within or error: SSE ( X ij X . j ) ( X i. X .. ) i, j 2 31 Analysis of Variance Table Source of variation Sum of Degrees Square of s Freedom SSR n-1 Between Rows Between SSC Columns Residual SSE (Error) Total SST (Corrected) m-1 N-n-m+1 Mean Square=MS/d.f F-Ratio SSR n 1 SSC MSC m 1 SSE MSE N n m 1 MSR MSE MSC MSE MSR F (n 1, N n m 1) F (m 1, N n m 1) N-1 The Null Hypothesis H 0 : No difference between rows or between columns i.e. i 0, j 0, i, j Reasoning: If there is no difference between rows or columns, then the model becomes X ij ij . So the two mean squares MSR and MSC are independent estimates of the same variance . However, regardless of the columns or the rows effects, an independent estimate of the variance is the MSE, since any differences are subtracted from the sum of squares. 2 Exercises A factory manufactures batches of an electronic component. Each component is manufactured in one of three shifts. A component may have one of two types of defect, D1 or D2 , at the end of the manufacturing process. A Production manager believes that the type of defect is dependent upon the shift that manufactured the component. He examines 200 randomly selected defective components and classifies them by defect type and shift. The results are shown in the table below. Defect type Shift D1 D2 First shift 45 Second shift 55 Third shift 50 18 20 12 Stating your hypotheses, test, at the 10% level of significance, whether or not there is evidence to support the manager’s belief. Show your working clearly. WEEKS 13 and 14: Review