1 CDA COLLEGE (LIMASSOL) 2015 - 2016 MBA 603 QUANTITATIVE METHODS Fall Term Syllabus 1. Introductory techniques & methods 2. Descriptive Statistics 3. Probability Theory 4. Sampling and Sampling Distributions 5. Hypothesis Testing 6. One Sample t-tests 7. Revision 8. Mid-term Examination 9. Two sample t-tests 10. Random digits – Degrees of Freedom - Contingency Tables 11. Simple Linear Regression 12. Multiple regression 13. Time Series Analysis & Forecasting Revision for the final exam ► Coursework 50%, Final Exam 50% Pass 60% Textbook: Statistics, D. Freeman,R. Pisani, R.Purves, Norton , 2000 Reference books: 1. The Elementary Forms of Statistical Reason R. P. Cuzzort and James S Vrettos.; St. Martins (1996) 2. Strategic Management and Business Policy Glueck/Jauch McGraw and Hill 3. Management Policy, Stanford M J; Prentice and Hall 4. Business Policy, Thomas R E ; Philip Allan 2 WEEK 1: Introductory Techniques and Methods “ A picture is worth a thousand words” The Problem: Presenting a set of numerical information is mostly an art, and the ways of presentation vary considerably, depending on the skill and imagination of the performer. STATISTICS: from the Latin word “Status”, meaning Situation – State (Κατάσταση – Κράτος) The use of quantitative techniques in business; Objective, reliable, accurate The role of quantitative techniques in business; Basis for judgment and model building Variables (qualitative, quantitative, discrete, continuous) Review of : ∑ notation, Equations; Derivatives - Integration Sequences and Series n Definition g (r ) = g(1) + g(2) +…+ g(n) r 1 f ( x) r 0 f ( r ) (0) r f ' (0) f '' (0) 2 x f (0) x x ... r! 1! 2! Arithmetic and Geometric series along with Maclaurin’s theorem, above, lead to the following: n(n 1) S1= x ; S2 = 2 x 1 n n [n(n 1)] n(n 1)( 2n 1) 3 x ; S3 = x 6 4 x 1 x 1 n 2 n a ( 1 r ) k 1 Geometric (finite): a+ar+ar 2 +…+ar n 1 = ar = 1 r k 1 n Geometric : 1+x+x 2 +x 3 +…+x n +…= x r 0 r 1 , provided -1< x <1 1 x xn x x xn Exponential : e 1+x+ + +…+ +… = x R 2! 3! n! n 0 n! Examples and exercises Evaluate the following sums 4 10 n 1 (i) (3k 1) , (ii) 2 k , (iii) (r 2 2r 1) (n = 5, 10, 100) , 2 x k 1 n k 0 2 3 r 1 (iv) 1000(.9) x (for n = 1, 10, ) x 0 2 3 The exponential function x2 x3 xn x e 1+ x+ 2! + 3! +…+ n! +… x R or xn n 0 n! or lim (1 n ) x n n Review of Logarithms Fundamental statements: If y = b x , then x = log b y, b > o The natural logarithm is defined from the equation x = e y ; Then: y = lnx Properties: ln1= 0, lne = 1, ln(xy) = lnx + lny, ln(1/x) = - lnx, ln(x/y) = lnx - lny Differentiation: y dy or y = lim dx x 0 x dy If y= ax n , then = nax n 1 , n dx (special case If y= constant, then y = 0) If y = sinx, then y = cosx If y = cosx, then y = –sinx If y = tanx. then y = sec 2 x Note that If y = e ax , then y = a e ax , also if y = a x , then y = a x lna If y = lnx, then y = 1/x, Naturally Product Rule: Quotient Rule: ( ) Chain Rule: Notation x d e ( )= dx e x and e x dx e x c d (u v) du dv dx dx dx (uv ) = u v uv , u v u v uv , v2 dy dy du dx du dx Equation of: tangent y - y 1 = m(x - x 1 ), where dy m= | x x1 , the gradient of the curve at the given point. dx 4 Examples and exercises 1. Find dy/dx: (i) y = 2x 3 ,(ii) y = 4x 4 - x 2 +1, (iii) y = x(x - 2)(x + 1), (iv) y = x 2 /(1-x), (v) y = 2cos3x + 4tan5x, 2. Find the second derivative (i) y = 7x 3 - 6x 5 (ii) y=(x 2 - 3)/x, (iii) y = sinx 3. Obtain the equation of tangent of the functions in question 1, at x =1 Integration: The general principle is that integration is the inverse operation of differentiation, i.e 1 ax n 1 + c, provided n -1 n 1 1 2. x 1 dx = dx =lnx + c x 3. All formulae which can be deduced from the known derivatives. 1. ax n dx = Area under the curve y = f(x) within the lines x = a and x = b and the OX-axis, b A= f ( x)dx (Sum of rectangles) a Volume generated by revolution of the part of the curve between the lines x = a and x = b about the OX - axis, b V= п y 2 dx (Sum of cylinders) a Examples and exercises 1. Evaluate the following integrals 1 (i) 5 (2 x 3)dx , (ii) 2 0 1 0 x 1dx , (iii) (t 2)(t 1)dt , (iv) 1 /2 1 cos xdx (v) 1( x 2) 0 2. Find the area between the graph of f and the x-axis (i) f(x) = 2 + x 3 , x [0,1], (ii) f(x) = x 2 (3+x), x [0,8], (iii) f(x) = x 1 , x [3,8] 3. If the region is rotated 360 0 , about OX, Sketch the region bounded by the curves and find the volume generated (i) y = x , y = x 2 , (ii) y = 6x - x 2 , y = 2x, (iii) y = x 2 + 2/x, y = 5 2 dx 5 WEEK 2: Graphical Representation Pie chart; Histogram; Cumulative Frequency Polygon (Ogive) advantagesdisadvantages General principles of presentation (simple, self-explained, accuracy e.t.c) The Histogram Principle: The area of each rectangular block is proportional to the corresponding frequency, so that the Y-axis represents the density = frequency/width Example (i) The number of classes chosen from raw (i) data is arbitrary. (ii) Stem and leaf plot (An example) Marks (out of 100) of 30 students: (Often n is adequate!) (ii) Connecting the midpoints of the top sides of the rectangles, we get the frequency polygon. 50, 55, 72, 51, 70, 63, 32, 44, 46, 68 85, 22, 25, 57, 74, 85, 35, 84, 53, 48 35, 53, 72, 61, 44, 64, 65, 45, 55, 53 Stem Leaf (4|5, means 45) Freq. 2│2 5 2 3│5 2 5 3 4│4 4 5 6 8 5 5│0 5 3 1 7 3 5 3 8 6│1 3 4 5 8 5 7│2 2 0 4 4 8│5 5 4 3 (i) (ii) (iii) Each point on the leaf represents the last digit of the observation. Provides the shape of the histogram if we turn it ninety degrees anticlockwise. Most importantly, nothing is lost from the raw data. (As opposed to the more elegant histogram!) Bar Chart - Used for qualitative variable - Area of block represents frequency Pie Chart - For each sector, the corresponding angle is 360 0 f , where f is the frequency of N the class, and k N is the total frequency: i 1 fi N 6 Cumulative frequency polygon or curve (ogive) Each point (X, F) represents the pair: X: upper boundary of the class. F: Number of observations x (Absolute, or relative %) EXAMPLES – EXERCISES 1. The results of the Euro-elections in May 2014, were the following Party (a) ΔΗΣΥ (b) ΑΚΕΛ (c) ΔΗΚΟ (d) ΕΔΕΚ (e) Άλλοι Total Votes Colour 37.75% 26.98% 10.83% 7.68% 16.76% Blue Red Yellow Green Cyan 100.00% After calculating the corresponding angles (to the nearest degree) construct the pie chart with the corresponding colours. 2. The results of an Economics examination at a college are shown below. Construct a stem and leaf diagram to represent these data: 31 54 80 58 73 50 69 65 84 49 67 47 70 78 77 67 55 78 62 59 54 41 69 65 41 96 80 89 54 68 3. The following table shows the time, to the nearest second, recorded for the telephonist to answer the calls received during a particular day. For the third class, find the class characteristics.(Boundaries; Limits; Midpoint; Width; Frequency, Cumulative Frequency) Represent these data by a histogram and a cumulative frequency polygon. Time to answer Number of (Nearest second) calls 10-19 20 20-24 20 25-29 15 30 14 31-34 16 35-39 10 40-59 10 7 WEEK 3: Descriptive Statistics Measures of Location: Minimum, Maximum 1 n 1 k Average: X = xi , or x j f j n 1 n 1 (n 1) / 2 F w (for continuous variables) Median: m = L+ f Where, n is the sample size, k is the number of classes, f is the median class frequency, x j is the midpoint of the class, L is the lower boundary of the class, F is the cumulative frequency of the previous class and w is the class width. Similarly, (n 1) / 4 F 3(n 1) / 4 F w; Q =L+ w 1 3 f f Mode is the value with the highest frequency Methods of calculation; Single values, frequencies, discrete, continuous Quartiles: Q =L+ Measures of Dispersion: Understanding Dispersion: Consider the marks achieved in four Mathematics tests by two students; Andy and Brian Test # Andy Brian 1 15 10 2 16 20 3 14 18 4 15 12 Average: X 15 15 Do you see any differences between the performances of the two students? Andy is more predictable, Brian has potential, but no consistency! It is precisely this characteristic which is expressed by dispersion. We quantify this characteristic by the following measures: Range: R = X max - X min Interquartile range IQR = Q 3 - Q 1 ; Semi IQR = (Q 3 - Q 1 )/2 Variance σ 2 or s 2 ; Standard Deviation: σ or S For a Sample of size n: 1 n 2 1 n 2 2 X i nX 2 (for single values) s ( X i X ) or s n 1 i 1 n 1 i 1 2 1 k 1 k 2 2 2 s ( X j X ) f j , or s ( X 2j f j nX ) (for frequency table) n 1 j 1 n 1 j 1 2 8 Applications 1. The high blood pressure for a sample of eleven pensioners is given below: 12.4, 14.7, 10.2, 16.3, 13.9, 12.2, 10.7, 11.8, 12.6, 11.9, 11.5 (a) Estimate the median and the quartiles. (b) Estimate the mean, the standard deviation and the interquartile range. . 2 Delegates to a National Congress had their ages recorded in years. The table below summarises these data. Age Freq. 18-24 20 25-31 35 32-38 25 39-45 18 46-52 12 53-59 7 60-73 3 (i) Construct the histogram and the cumulative frequency polygon. (ii) Calculate the appropriate measures, of location and dispersion, justifying your choices. 3. The following table summarises the birth weights of a random sample of 100 babies born in clinic over a year. Weight in Kg 1.5-1.9 2.0-2.4 2.5-2.9 3.0-3.4 3.5-3.9 4.0-4.4 4.5-4.9 5.0+ (a) (b) (c) (d) frequency 2 9 12 18 22 17 13 7 Write down the upper class boundary of the third class Calculate estimates of the median and the quartiles of these birth weights Calculate an estimate of the mean; Comment on the skewness of these data. 9 WEEK 4: Probability Theory The notion of Probability arises, when we have: A random experiment, considered as the one: Which can be performed over and over again, theoretically, under identical conditions, but we cannot predict the outcome! For example, tossing a coin, observing the price of a stock on a particular day, measuring the weight of a randomly selected person, counting the number of bank customers in a queue at a particular moment of the day, and so on; For such an experiment, we may talk about: - Events (Ενδεχόμενα): The term is often used before the experiment; (Subject to probabilistic analysis) - Outcomes (Γεγονότα): The term often examined after the experiment; (Subject to statistical analysis) Special Events Α Simple Event can occur in a single way! (e.g. getting the pair [5, 6] with this order, when we toss two dice, one red and one blue! i.e. 5 red, 6 blue) Α Composite Event can occur in more than one, ways; (obtaining a sum of 8 with two dice! i.e. 2-6, 6-2, 3-5, 5-3, 4-4) The Sample Space Ω (or S) is the set of all simple events; Ω Venn Diagram: ---► The Certain event “Ω” is the one which always occurs! The Impossible event “Ø”, Never occurs! Complement of an event A: “Non- occurrence of A” is denoted by A , (or A c or Aʹ). In the diagram the complement of A is the set of points outside A, but in the frame Ω. i.e. blue, purple, red and white points! Recall from set theory: Union A B means the set of all elements x belonging to, either the set A, or the set B, or both sets A and B, but these elements are taken only once. Intersection A∩B or AB (means the set of all elements x belonging to both sets A and B) Definition of Probability: For any event E, a subset of the sample space Ω, P(E) = (# of favorable equiprobable simple events in E) (# of possible equiprobable simple events in Ω ) ( or lim n relative frequency) where f : number of occurrences of that particular event E, out of a total of n, number of trials f : limiting n 10 Kolmogorov’s Axioms (basic 3 out of 7) For any events E, A or B belonging to the sample space, Ω we have: E Ω (i) 0 P(E) 1 (ii) P(Ω ) = 1 (iii) If events A and B are mutually exclusive, i.e. A∩B = Ø, then P(A B) = P(A) + P(B) (Addition Law) Independence: Events A and B are said to be independent, if the occurrence of one, does not affect the probability of occurrence of the other, and mathematically we have: P(A∩B ) = P(AB ) = P(A) P(B) Some Theorems: (may be proved straight from the axioms) P (Ø) = 0 P (A B) = P(A) + P(B) ─ P(AB) Where the notation “AB” is the same as “A∩B” P ( A ) = 1 ─ P(A) (Complement rule) P ( A B ) = P(B) ─ P(AB), Using the Venn Diagram Morgan Laws: ( A B) and ( A B ) Some practical ways of calculating probabilities (i) From the definition of probability and/or counting (ii) Using the complement (iii) Applying the theorems (iv) Drawing the Venn diagram (v) Tree diagram with the corresponding probability on each branch! Example: Consider the random experiment: Tossing two fair dice (Blue; Green) Define the events: B: “getting four with Blue die”; G: “getting four with Green die”; E: “At least a four on either die”; Find (a) P(Sum 10); (b) P(E): 11 Conditional probability (dependent events): Event “B can occur, given event A has occurred”. Definition: P( B | A) P( AB) P( AB) ; or P( A | B) P( A) P( B) Another way to look at it is that knowing the occurrence of the first event A, we can restrict the sample space to those points covered by the occurrence of A! If, of course, A and B are independent, then simply P(B|A) = P(B) and P(A|B) = P(A). Use of Tree or Venn diagrams helps a lot! Multiplication Law (Inverting conditional probability): P(AB) = P(A) P(B|A) = P(B) P(A|B), or, for three events A, B, C, we have P(ABC) = P(A) P(B|A) P(C|AB) Law of Total probability: P( E ) P( AE ) P( AE ) P( A) P( E | A) P( A) P( E | A) “In words, event E can occur , either with event A, or without A!” Random variables: A r.v X takes its values according to some Probabilistic Law; A random variable and its probability density function (p.d.f.) always obey the following two conditions: Discrete (i) P(x) ≥ 0 , or (ii) P(x) = 1, or allx Continuous f(x) ≥ 0 x f ( x) dx 1 allx The Mean or Expected Value: E[ X ] xP( x) or allx xf ( x)dx allx The Variance: σ = Var(X) = E[ (X- μ) ] = E[ X ] - μ 2 Useful Note : E(X 2 ) = σ 2 + μ 2 2 2 2 NORMAL DISTRIBUTION Notation X~N(μ, σ 2 ): E(X)= μ, Var(X)=σ 2 Shape of the distribution: Symmetric – Bell Shape x Standardize Z= ~ N(0,1); Tables, P(Z≤z)=Ф(z) Central Limit Theorem CLT: the average X = 1 n n x ~ N(μ, σ i 2 /n) 1 Combinatorial Analysis (Optional!) Arrangements of n different objects; (Ordering); An = n! , e.g. for three letters A,B,C we get ABC, ACB, BAC, BCA, CAB, CBA (i.e. 3! = 1x2x3 = 6); Cyclical: An= (n-1)! , 12 n! where r, s, t are the numbers for similar objects. r! s!t! Of course ( r + s + t = n) n! Permutations: P nx = ; (Choosing and ordering) (n x)! n n! Combinations; notation: C nx or = ; x!(n x)! x (Just choosing, with no ordering) 4 e.g. for three different objects: A, B, C, D we have: C 42 = = 6 : 2 AB, AC, AD, BC, BD, CD. With repetitions: An= For Permutations however, ordering is important, so P 42 = 12: (AB, BA, AC, CA, AD, DA, BC, CB, BD, DB, CD, and DC) Note that P nx C nx A x for all x, < n N Application : Find the probability of getting at least 1 heart if we draw four cards out 52 (there are 13 hearts in a full pack) Examples – Exercises 1. A Casino player bets €10 on Red. If it comes up Red, he bets another €10. If the first number is Black, he bets €20 again on Red. Find the probability that a total of two red numbers come up. Find also his expected gain. Note that P(Red) = 1/2 2. A committee of size 5 is to be selected at random from 3 women and 5 men. a) Show that there are 56 ways of choosing the committee. Let W represent the number of women on the committee. 15 b) Show that P (W = 2) = 28 c) Find the probability distribution for W. d) Find E (W). 3. Jam is packed into tins of advertised weight 1 kg. The weight of a randomly selected tin of jam is normally distributed about a target weight with a standard deviation of 12g. If the target weight is 1kg, find the probability that a randomly chosen tin weights i) less than 985g, ii) between 970 and 1015g. 4. The random variable X (number of matches in a box) is roughly normally distributed with mean μ = 40 and standard deviation σ = 3. For the sample mean X , of a random sample of 12 boxes, write down: The mean and the standard deviation, and find: The probability that the average X is between 37 and 43 13 WEEK 5: Sampling and Sampling Distributions “You don’t have to eat the whole soup to realize that it turned sour” The Problem: One of the fundamental objectives of statistical science is to make inferences about the whole population, examining a small portion of it! Simple Random Sample : A random sample is a collection of independent identically distributed (i.i.d.) random variables X1 , X 2 , , X n or, we can consider it as a small group of size n, randomly taken from the population of size N. To draw the sample, we can use random numbers, generated by computer, a calculator, or from statistical tables. In order to understand some of the important aspects of sampling, we consider a very simple example: Consider a “population” of 4 families in a small village with numbers of children: { 3, 1, 0, 8} Population size: N= 4; Variable of interest: X= the number of children in a family. Note that for this population 1 1 X (3 1 0 8) 3 , N 4 1 (3 3)2 (1 3)2 (0 3)2 (8 3) 2 0 4 9 25 2 ( X )2 9.5 N 4 4 If we take a sample of size n=2, without replacement, the number of possible 4 samples is = 6, and the sampling distribution is the set of all values of X , 2 along with the corresponding probabilities, for all possible samples: Sample (average) X Probability (3,1) 2 (3,0) 1.5 (3,8) 5.5 (1,0) 0.5 (1,8) 4.5 (0,8) 4 totals 18 1/6 1/6 1/6 1/6 1/6 1/6 1.0 Note that the median of the population is 2 and the mode does not exist! Now: 1 ( X ) XP( X ) (2 1.5 ... 4) 3( ) 6 allX 1 Var ( X ) [(2 3) 2 (1.5 3) 2 ... (4 3) 2 ] 3.167 6 Methods of sampling (i) Simple random sampling without replacement. (SRSWOR). Each sample (subset) of the same size from the population has the same probability to be 14 selected. Consequently each unit has the same chance to be chosen in the sample. (ii) Stratified with prior knowledge (Often the best scheme, but not fully random in the above sense); Choose from every stratum a SRSWOR! The size of the sample for each stratum should be proportional to the population stratum size. i.e.: n j j n N j , j 1,2,..., k ; (where k is the number of strata and n N and N j are the sizes of the sample and the population respectively, of the corresponding j th stratum,) (iii) Quota sampling: Selection is based on the characteristics of the population (gender, geographical areas, age, education level and so on…), so that it is essentially a small picture of the population. This is a sort of stratified sample, but not random, although some randomness may be introduced, but it remains biased! (This scheme is preferred for practical considerations and significantly lower cost!) (iv) Cluster sampling: Choose clusters (regions, classes, villages, etc), either as a whole or a random sample within each of the chosen cluster. This is a good scheme when clusters behave similarly, but it is not a fully random sample; there is however a component of randomness. (v) Systematic sampling. If the sampling ratio, k N , then pick a random number n from 01 to k (using tables of random numbers), and then pick every k th member. For example, if the population size N =120 and the sample size n=5, then k=24. Choosing a two digit random number from 01 to 24 from the tables, let say r = 14, we have the sample corresponding to the items with numbers: (14, 38, 62, 86, 110). This sampling scheme is easy, cheap and satisfactory, provided no pattern in the variable of interest exists, along the series of observations in the population. For example, if soldiers in a parade are standing ranked in descending order, with respect to their height, then, systematic sampling, to estimate their mean height, is not a good idea! (It is clear that, choice of a random number like, r = 1 or 2, will tend to grossly overestimate the true mean height, whereas choice of r = 23 or 24 in the above example would underestimate the mean height of the group of soldiers!) Central Limit Theorem: For a sample: ( X1 , X 2 , X 1 n Xi n i 1 N ( , , X n ) of considerable size from any distribution, then 2 ) n . 15 This is why the Normal distribution is the most important distribution in the theory of Statistics Confidence intervals Point estimation (estimating a parameter with a single value), gives us some idea about the value of the parameter, but important information about the precision of this estimate is missing! An estimate of the population mean μ is given by for the variance Var(X) = σ 2 , the estimate is correlation coefficient ρ, is 2 S 2 , for the proportion π, is =X , = p and for the =r. This procedure is useful, but almost certainly out of target. A more realistic and “safe” procedure is often entertained through Confidence Intervals. Given a random sample X1, X2,…, Xn, we define a P% Confidence Interval (CI) about the parameter θ, as a random interval [Τ1, Τ2], such that: Pr(T1 T2 ] P% $ (Central about T , even if the distribution is not symmetric), where Τ1, and Τ2, are the values of the statistic T, obtained from the sample. As a general principle the C.I. for θ is z S.E.( ) Approximately! Where z is the appropriate value from N(0,1) Very commonly used intervals: for μ: X z / n X t (n 1) s/ n for π: p z p(1 p) n (when σ is not known, and sample size n is small) 16 Applications 1. What is meant by a random sample? Here is an extract from a table of random numbers: 86 13 84 10 07 30 39 05 97 96 88 07 37 26 04 89 13 48 19 20 60 78 48 12 99 47 09 46 91 33 17 21 03 94 79 00 08 50 40 16 78 48 06 37 82 26 01 06 64 65 94 41 17 26 74 66 61 93 14 97 (a) Starting from the first line and the third column of the table with the number 84, and reading across the table select and write down 10 random numbers between 01 and 80 from the table. (b) Explain how you could use these random numbers to select a sample of 10 students from 80 students 2. A large civil engineering firm issues all new employees with a safety helmet. Five different sizes are available numbered 1 to 5. A random sample of 90 employees required the following sizes: 2 2 4 2 2 3 4 3 4 3 3 3 2 1 3 2 4 3 2 5 4 3 2 2 2 4 4 3 3 3 5 3 5 5 4 4 4 2 3 4 5 2 5 3 3 2 2 3 4 3 3 3 3 2 4 3 2 4 3 4 4 3 4 2 2 3 2 3 4 4 4 3 4 4 2 2 3 2 3 3 2 2 2 2 4 2 3 3 2 3 Calculate an approximate 90% confidence interval for the proportion of employees requiring size 2 3. A computer program is designed to take a random sample of size 36 from a normal population with mean μ and standard deviation = 5.1. The sample mean for one such sample was X = 26.3. Calculate a 95% confidence interval for μ based on this sample. 4. Prior to an election in the state of Texas, USA, the Opinion Research Centre wishes to take a random sample of voters so large that the probability is at most 0.02 that they will find the proportion supporting the Democratic candidate to be less than 0.5 when it is actually 0.55. Assuming that a continuity correction is unnecessary, calculate the size of the sample needed. 17 WEEKS 6 and 7 : Tests “Doubt is the root of progress” The Problem: Why testing? First, there should be a claim for the value of a parameter (μ, σ, π, λ, and so on) or a model, and then the evidence from the observed data should be left to decide about the claim! A test is a statistical procedure which decides with some confidence whether a statement – hypothesis, about a parameter value, or the whole model, is valid! The Characteristics or the “ingredients” of a test! The supposed claims are formulated as follows: Null Hypothesis: H 0 : This statement should specify completely the value of the parameter of interest! e.g. μ = 50 vs: Alternative Hypothesis: H 1 : This is a complementary statement to H 0 , in some direction! e.g. μ > 50 (upper tail), or μ < 50 (lower tail), one tailed test, or μ 50 (two tailed test); Test Statistic: e.g. Z (X ) n N (0,1) ; This is (used to be called) a statistic from the sample. In fact this is rather a Pivotal Quantity, since by definition; a Statistic is a function of the sample only! Essentially it serves as a criterion for decision: Accept or Reject H 0 Naturally we could also use as a test statistic: n x 1 2 X = i ~ N (μ, σ /n); n 1 Significance level: α = Pr (we observe what we did, under H 0 , or even worse in the direction of H 1 ) ; often called the p-value; Often this is predetermined at 5% or at 1%, but it can take any value! Critical value; Critical region: The value, or rather the set of values of the test statistic Z or X which lead to rejection of H 0 . The Decision Rule: This is a rule which is based on a pre-specified value the test statistic and is of the form: T * , of Reject H 0 , if Tobs T * or Tobs T * Accept H 0 , otherwise Relation with Confidence intervals: For a two tailed test, if the assumed value of μ lies inside the C.I, then we accept H 0 ; If, however it is outside the interval, then we reject the null hypothesis in favor of the alternative. 18 Example: For the purpose of placing the order of the purchase of shoes for the incoming soldiers, it has been suggested that the mean length of their foot is 26 cm. A statistician has been called to decide on the suspicion that during the last ten years, this length has been increased significantly! So the statistician decided to take a random sample of n = 50 measurements on this year’s soldiers, and the results were X = 26.3 and s 2 =1.44 The test was formulated as H 0 : μ = 26 vs H 1 : μ > 26; One tailed test. The test statistic chosen is Z (X ) n N (0,1) If we prespecify the S.L. to α = 5%, then the critical value would be Z=1.645. The observed value of the test statistic (taking σ ≈ s, considering this sample size large), turns out to be: Zobs.= 1.77>1.645, which is significant, so we reject H 0 in favour of H 1 . On the other hand, the p-value is (Observed) S.L.= Pr(Z>1.77) = 0.04. In the light of the above test, the suspicion of increase of “soldiers foot length”, seems to be justified and the order for new boots, with respect to the sizes, has to be adjusted accordingly! PRINCIPLE: The rationale behind a statistical test is the following; “If a small value of the significance level α is observed, or equivalently an extreme value of the test statistic T is realized, then we are faced with two options: (a) Either H 0 is true and we just observed, by chance alone, a quite rare event, or (b) H 0 is false, and that is the reason, we have observed the realized outcome”! Test statistics: For μ : ( X ─μ) n /σ~N(0,1) or ( X ─μ) n /s~ t ( n 1) For π : Z (p ) n (1 ) (n 1) s 2 N 0,1) (approximately) ~ 2 For Two sample problem: (a) Unpaired test Tests for two samples (sizes n, m): Η0 : μx = μy Z 2 : 2 (Y X ) ( y x ) n m 2 x where 2 y s 2 2 ( n 1) N (0,1) , (known variances), or T (Y X ) ( y x ) s (n 1) sx2 (m 1)s y2 nm2 ; 1 1 n m t( nm2) , 19 The pooled estimate of the common, but unknown, population variance σ 2 (b) Paired t test: When natural pairing exists between the observations in the two samples, the most efficient test is the paired one and the set-up is the following: Η0: μx = μy or δ = μx - μy = 0 Paired Sample Assumption: Di = Xi - Yi, (difference) N ( , 2 ) , independent X1, X2,…, Xn Y1, Y2,…, Yn 1 n 1 n 2 D Di , and ,s d ( Di D) 2 n i 1 n 1 i 1 Test statistic: T (D ) n sd t( n 1) Applications-revision 1. An experimental kit, used to illustrate basic statistical ideas to students, contains a fair six sided die and a biased six sided die as well as other equipment. The probability of a 6 on any toss of the biased die is 0.25. One of these dice is tossed 120 times, and of these 29 results in a 6. Assuming that these results are independent, test the null hypothesis that the die is the fair one against the alternative that it is the biased one. Subsequently, it was established that the die tossed is definitely the fair one. Comment upon your test result, in the light of this information. 2. The weight, X grams of soup put in a tin by a machine is normally distributed with a mean of 160 g and a standard deviation of 5 g. A tin is selected at random. (a) Find the probability that this tin contains more than 168 g. The weight stated on the tin is changed to w grams. (b) Find w such that P(X < w) = 0.01. 3. A firm is to buy a fleet of cars for its salesmen and wishes to choose between two alternative models, A and B. It places an advertisement in a local paper offering four gallons of petrol for anyone who has bought a new car of either model in the last year. The offer is conditional on being willing to answer a questionnaire and to note how far the car goes, under typical driving conditions, on the free petrol supplied. The following data were obtained: Miles driven on four gallons of petrol Model A 117 136 108 147 20 Model B 98 124 96 117 115 126 109 91 108 (a) Test at the 5% significance level, whether there is any difference between the mean petrol consumption of the two models (b) How can we improve the experimental design? 4. Weather records for Limassol lead the local weatherman to suggest that the high temperature for November 15 is a normal random variable with mean 22 C and standard deviation 5 C. Find the probability that on the next November 15 the high temperature will: (a) be less than 21 (b) be more than 26 (c) Lie between 20 and 25 C. What temperature would be exceeded with 90% probability? 5.Roastie’s Coffee is sold in packets with a stated weight of 250 g. A supermarket manager claims that the mean weight of the packets is less than the stated weight. She weighs a random sample of 90 packets from their stock and finds that their weights have a mean of 248 g and a standard deviation of 5.4 g. (a) Using a 5% level of significance, test whether or not the manager’s claim is justified. (b) Find the 98% confidence interval for the mean weight of a packet of coffee in the supermarket’s stock. 6. Manuel is planning to buy a new machine to squeeze oranges in his cafe and he has two models, at the same price, on trial. The manufacturers of machine B claim that their machine produces more juice from an orange than machine A. To test this claim Manuel takes a random sample of 8 oranges, cuts them in half and puts one half in machine A and the other half in machine B. The amount of juice, in ml, produced by each machine is given in the table below. Orange 1 2 3 4 5 6 7 8 Machine A 60 58 55 53 52 51 54 56 Machine B 61 60 58 52 55 50 52 58 Test, at the 10% level of significance, the manufacturer’s claim. 7. A random sample of 10 tomato plants had the following height, in mm, after 4 days growth. 5.0, 4.5, 4.8, 5.2, 4.3, 5.1, 5.2, 4.9, 5.1, 5.0 Those grown previously had a mean height of μ = 5.1 mm. Using a 5% significance level, test whether or not the mean height of these plants is less than those grown previously.(Assume that the heights above are normally distributed). 21 WEEK 10: Contingency Table Test for Association or Independence This is one of the most important and very popular tests in Statistical Applications, since it proves possible associations between factors! The values in the cells are frequencies. If the values are percentages, then, before proceeding to the test, we have to transform them to frequencies. H 0 : No association between the two factors, or the two classifications A and B are independent (m rows and n columns) The basic idea is that the estimated probability for row i, is corresponding probability for column j, is p j cj N pi ri and the N . So the expected frequency (under H 0 ), in each cell (i th row, j th column) is eij NPij Np i p j ri c j N = (row total) × (column total)/ (grand total). For the test to be “good” and reliable (according to Pearson), each expected frequency, should always be at least 5; otherwise we have to combine adjacent or similar columns or rows, in order to achieve all expected frequencies to be at least 5. Test statistic (oij eij ) D e i, j ij 2 (2m1)( n 1) 22 Exercises 1. A survey in a college was commissioned to investigate whether or not there was any association between gender and passing a driving test. A group of 50 male and 50 female students were asked whether they passed or failed their driving test at the first attempt. All students asked had taken the test. The results were as follows, Pass Fail Male 23 27 Female 32 18 Stating your hypotheses clearly test, at the 10% level of significance, whether there is any evidence of an association between gender and passing a driving test at the first attempt . 2 Research was carried out to investigate for a possible connection between weekly alcohol consumption and development of Type 2 diabetes. The results are summarised in the table. Type 2 diabetes developed Level of alcohol consumption (gr) Yes Less than 5 38 Between 5 and 30 12 More than 30 35 No 382 653 380 (a) Test, at the 1% level of significance, whether the development of Type 2 diabetes is independent of the average level of weekly alcohol consumption. Assume that the sample was random. (b) A medical reviewer for a newspaper read the report and then he stated that people should increase their weekly alcohol consumption in order to decrease their chance of developing Type 2 diabetes. Make two comments on his statement, referring to both the study and the sources of association, if any, identified when carrying out the test in part (a). 23 WEEK 11: Correlation - Regression (Simple &Multiple) The Problem: Given a set of bivariate observations, we attempt to search and estimate the “best” algebraic relationship between X and Y -We are interested in measuring linear association between: The response (dependent) Variable: Y, vs The explanatory (independent) Variable X -A first indication of the existence of this linear association is revealed, if we draw the scatter diagram between X and Y Other possible shapes Correlation: meaning linear association between X and Y; cause and effect. A numerical measure of the strength of this linear association between X and Y is the sample product moment correlation coefficient (pmcc) r s xy 1 n X iYi nXY , is the sample covariance and , where sxy sx s y n 1 i 1 s x ,s y , are the sample standard deviations of X and Y. Note that always −1≤ r ≤1, with r = 1, for perfect correlation; positive or negative! (See the shapes above) Another way to calculate the pmcc: r SXY ; SXX SYY (can be obtained easily from an advanced calculator! ) where n n i 1 i 1 SXY ( X i X ) (Yi Y ) X i Yi nXY (n 1) s xy , 24 n n SXX ( X i X ) 2 X i2 nX 2 (n 1) s 2x , i 1 i 1 n n i 1 i 1 SYY (Yi Y ) 2 Yi 2 nY 2 (n 1) s 2y Note that: (i) The numerical presence of significant correlation does not necessarily imply association between the two variables, unless of course some natural explanation exists! Spurious (unexplained, nonsense) correlation sometimes occurs! (ii) On the other hand, a value of the pmcc close to zero, implies no linear association, but it might indicate strong non-linear association! (Examples are shown above on the scatter diagrams) Linear Regression “It is our opinion of a situation at one stage, but this must change, if we find, at a later stage, that the facts are against it!” Glancing the scatter diagram, we may suspect a rather strong linear association of the response variable Y on the explanatory variable X.(a p.m.c.c. close to r 1 reveals that!). In this case, we use regression methods to establish the “suspected” linear relationship. For example, for the set of points, given below, we draw the scatter diagram: (X, Y) points: (1, 49), (3, 51), (4, 52), (6, 52), (6, 53), (7, 53), (8, 54), (11, 56), (12, 56), (14, 57), (14, 58), (17, 59), (18, 59), (20, 60), (20, 61) r =0.991 The line of “best” fit is obtained through the statistical regression model: Y i =α+ βX i + ε i , where ε i is the unobservable error. The vertical random errors satisfy the conditions; (i) ε i ~ N (0, σ 2 ), and (ii) they are independent ( i=1, 2,…, n) 25 Where α, is the Y intercept and β, is the slope (gradient) of the straight line Variables: Explanatory, Independent; X Response, Dependent; Y Principle of regression: Applying least squares methods, we obtain the estimates of the parameters α and β, which turn out to be the Statistics a and b, by minimizing the sum of squared vertical errors, with respect to α and β: n n SSE (Yi X i ) 2 i 1 2 i i 1 Solving the resulting two regression equations (see appendix A4), the “best” estimates of the unknown parameters α and β, turn out to be: b = SXY/SXX n n i 1 i 1 SXY ( X i X ) (Yi Y ) X iYi nXY n n i 1 i 1 , SXX ( X i X ) 2 X i2 nX 2 The Y- intercept: a Y bX 2 s2 SSE n2 The variance of the Model: Natural interpretation of the parameters of the regression Model: a: Is the estimated value of the response variable Y, when the explanatory variable X=0 b: Represents the estimated change of the response variable Y, for a unit increase of the explanatory variable X Note that prediction may be obtained through the fitted line. However, this is quite risky, in particular, outside the range of observations (extrapolation), as in any other hard science! How good is the model? The goodness of the model is measured by the value of the coefficient of determination: R2 SSR SSE 1 SST SST 26 Application An agricultural researcher collected the following data showing the annual yield of wheat (in bushels per acre) and the annual rainfall (in inches) Rainfall (x) 9.7 19.0 8.2 11.1 6.9 13.6 13.0 15.0 Yield of wheat (y) 28.0 35.1 23.8 25.6 20.1 30.2 28.5 33.7 (i) (ii) (iii) (iv) (v) (vi) Plot the data on a scatter diagram. Evaluate the product moment correlation coefficient between x and y. Give an interpretation of the value obtained. Find the regression of y on x in the form y=a+bx Give an interpretation of the value of b. Can we predict the yield for rainfall at the level of: (a) 16 inches? (b) 30 inches Multiple Regression For a multiple regression situation, we entertain the matrix model: y X y1 x11 y2 x21 ; In matrix form, ... ... yn xn1 ... x1k 1 1 ... x2 k 2 2 ... ... ... ... ... xnk k n x12 x22 ... xn 2 MVN (0, 2 I n ) b ( X T X )1 X T y with the best solution and E (b) ; with Var (b) ( X T X )1 2 being: The predicted value for a given explanatory vector value xi is: yi xi b x i ( X T X ) 1 X T y, with,Var ( yi ) x i ( X T X ) 1 x i 2 t t t and the estimate of ( y y) ( y y) T s2 (n k ) 2, is 27 Matrix approach to the simple model The simple model may be formulated in a multiple regression context as follows: y X , where y1 1 x1 1 y2 1 x2 y ,X , , 2 ... ... ... ... yn 1 xn n In this context, it turns out that n XT X nx 1 n 2 nx xi x 1 T 1 n (X X ) 2 , and n i 1 , SXX x i x 1 i 1 28 WEEK 12: Time Series and Forecasting “Standing on the past, we live the present, hoping for the future” The Problem: A set of measurements taken at consecutive time points constitutes a time series. A number of ways to analyze the series, estimate the parameters, and forecast future values is considered! 15.1.The Model: Y t = Trend + Seasonal variation (main components) + Short term (non random) variation + Random variation For practical reasons we present some of the popular methods with a particularly simple example: The data below show the sales (Y) of sandwiches at three shifts Morning – Day – Evening, for three consecutive days at a particular Store. Labelling the time as T, we have the following table: Day Mon Tue Wed Shift M D E M D E M D E T 1 2 3 4 5 6 7 8 9 Y 26 53 50 34 64 60 42 73 71 The first thing to do is to plot the values to discover its basic characteristics: From the plot we observe: (i) Upward Trend (ii) Seasonal Variation with period k=3 (iii) Random variation TECHNIQUES (i) Regression with dummies: Yt a bt c j X j t , where the X’s are dummy variables, taking the values 1 or 0, depending on whether we are on the second period, the third, and so on; thus reflecting any existing seasonal component! Application for the first model, regression with dummies: Yt t X 2 X 3 t 29 Time : 1, 2, …,9 Response Y=[26 53 … 71] Matrix of Explanatory Variables and parameters 1 1 T X 0 0 1 1 1 1 1 1 1 1 2 3 4 5 6 7 8 9 ( X T X ) 1 X T y 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 Results: Correlation Matrix: for Y, t, and the two X’s : 1 .70 .50 .36 21.33 .99 .12 .41 .29 .70 1 3.17 0 .27 .03 0 .03 .06 C b Cov(b) .50 0 26.17 1 .50 1.07 .58 1 1.16 .36 .27 .50 20.00 T 1 1 1 21.333 y10 10 11 12 3.167 53.00 T And finally forecasting: y y11 F b 0 1 0 26.167 82.33 79.33 y12 0 0 1 20.000 (ii) The Moving Average Model The second model, often used is the Moving Average (M.A.) of appropriate order (usually the period of the process), to estimate the trend. If k (the period of seasonality) is even, then we need further MA of order two to center the estimates, so that these estimates correspond to the existing observed values. In our example we have the seven Moving Averages of order k=3 as: [43.00, 45.67, 49.33, 52.67, 55.33, 58.33, 62.00] . A simple linear regression of the above smooth values versus T [2, 3, 4, 5, 6, 7, 8] reveals the fitted line Yt 36.56 3.15t , with( R 99.8%) Now the differences for all points [Y-(a+bT)] are calculated as 2 [-13.71, 10.13, 3.98, -15.18, 11.67, 4.51, -16.64, 11.20, 6.05] M D E M D E M D E To estimate the seasonal effects for the three points (shifts) we calculate the averages of the corresponding differences. i.e. 13.71 15.17 16.64 15.18; 3 10.13 11.67 11.20 sD 11.00; 3 3.98 4.51 6.05 sE 4.85 3 sM 30 Finally forecasting for let say T=10, 11, 12, would be trend +seasonal component Y10 36.56 3.15(10) 15.17 68.11 15.18 52.93; Y11 82.26; Y12 79.26 Box and Jenkins models: For stationary series (i.e. after removing the trend) we have a variety of models which cover identification, estimation and prediction based on existing computer packages. These models proved to be very popular and efficient, during the seventies, and they work reasonably well. One of the simple models which is characteristic is an autoregressive of order p p Yt jYt j t , where t N (0, 2 ), independent j 1 The parameters γ and ϕ are estimated by the method of least squares! Very simple linear regression models: Due to the fact that the main components of most time series are the trend and the seasonal effect, it is possible to fit simple models which take into account these two main effects, for example: Yt Yt k t , where t N (0, 2 ), independent and k is the period of seasonality. α and β are the parameters to be estimated from the data. In our example we take k = 3, (k=4 for quarterly data or k = 12 for monthly data)! So we regress Yt= [34, 64, 60, 42, 73, 71] vs Y(t-3)= [26, 53, 50, 34, 64, 60], to obtain the fitted line: Yt 6.37 1.07Yt 3 , with( R 2 99.6%) Hence the corresponding predicted values for comparison is Y10 6.37 1.07Y7 6.37 1.07(42) 51.31; Y11 84.48; Y12 82.34 Note that some differences in decimals are due to the fact that, calculations have been performed straight from calculator with all accuracy provided! Examples – Exercises 1. Oil usage in a small farm is given in the table. Analyse fully the series and predict the usage for first and second quarter of 2001 Quarter Year 1997 1998 1999 2000 1 125 137 117 162 2 96 113 118 155 3 72 88 94 162 4 119 131 142 176 31 2. “HOPE” is a travel company which organizes package holidays that are sold through a number of travel agents. It decides to offer the travel agents a bonus if they can increase the number of holidays sold by 10% or more. The number of “HOPE” holidays sold by Ajay, a travel agent, is shown in the table. 2007 2008 2009 January– May–August–January–May–August–January–May–August 145 98 121 123 85 101 118 76 74 (a) Calculate values of a suitable moving average. (b) Plot the moving averages on the graph on page 6 and draw a trend line. (c) (i) Estimate the seasonal effect for January–April, and hence forecast the number of holidays Ajay will sell during January–April 2010 if current trends continue. (ii) Hence calculate how many holidays Ajay needs to sell during January–April 2010 to exceed current trends by at least 10%. (2 marks) (d) Ajay argues that, if he sells 82 or more holidays during January–April 2010, he will have exceeded the September–December 2009 sales by more than 10% and so should qualify for a bonus. The company argues that, in order to qualify for a bonus, he will need to sell 130 or more holidays, as he sold 118 during January–April 2009. Suggest a suitable value for the number of holidays Ajay will need to sell during January–April 2010 in order to qualify for a bonus. Explain why your value is fairer than either of the values suggested by Ajay and “HOPE”. Revision for the Final Exam ■