“Teach A Level Maths” Vol. 2: A2 Core Modules Least Squares Regression: y on x © Christine Crisp Least Squares Regression Statistics 1 AQA EDEXCEL OCR "Certain images and/or photos on this presentation are the copyrighted property of JupiterImages and are being used with permission under license. These images and/or photos may not be copied or downloaded without permission from JupiterImages" Least Squares Regression We often want to know whether there is a relationship between one variable and another. e.g. Does the number of driving accidents increase with the age of the driver? e.g. Can we predict a student’s mark in a French exam if we know it in an English exam? e.g. Is the weight of a baby at birth related to the height of the father? You met sets of data like these at GCSE and you’ve drawn scatter diagrams and also drawn a line of best fit “by eye”. This line is called the regression line. In this presentation we will see how to calculate a regression line. Least Squares Regression The data I’m going to use is a random sample from the Census at School database. I’ve chosen a random sample from the data for height and foot size of 99 children from the UK and I’ve used the Autograph software package to plot the data and do the calculations. There is a demo showing you how to do this in “Autograph Resources”. To see the demo you need to select: 2D graphing; advanced; plotting statistics in 2D; Using Autograph to display statistical diagrams from Census at School data. The demo then starts automatically. Least Squares Regression This is a scatter diagram of the data. Foot length (cm) Foot length and height of UK children Height (cm) We will find the equation of the line that could be used to predict the foot length of a child whose height is known. Least Squares Regression This is a scatter diagram of the data. Foot length (cm) Foot length and height of UK children e.g. This length . . . is squared Height (cm) Least Squares Regression This is a scatter diagram of the data. Foot length (cm) Foot length and height of UK children e.g. This length . . . is squared and addedHeight to the(cm) other squares. Points below the line result in negative “lengths”, so would cancel out those above if we didn’t square. Least Squares Regression This is a scatter diagram of the data. Foot length (cm) Foot length and height of UK children Height (cm) The line is positioned so that the sum of the squares of the distances of all the points from the line is as small as possible. This makes the line run through the middle of the points. Least Squares Regression This is a scatter diagram of the data. Foot length (cm) Foot length and height of UK children Height (cm) This line is called the least squares regression line of y on x. To find the equation of the regression line we need the values of the gradient and the intercept on the y-axis. Least Squares Regression Calculating the gradient and intercept of the regression line You don’t need to know how to derive the formulae for the gradient and intercept ( although if you have already taken AS you may be interested to see how Calculus is used to get these formulae ). You will need to be able to do the following: • Use calculator functions working with raw data. • Use the formulae in your formula book if you are given summary data. We’ll start with the calculator. Least Squares Regression Calculators vary in the way they do things so you need to make sure that you can use your own calculator efficiently. We’ll start with a very simple set of data so you can do the calculations as we go along. Draw the following data on a scatter diagram ( if you haven’t got squared paper just do a sketch ). x y You get 1 5 2 3 3 1 Draw the regression line by eye. Least Squares Regression Calculators vary in the way they do things so you need to make sure that you can use your own calculator efficiently. We’ll start with a very simple set of data so you can do the calculations as we go along. Draw the following data on a scatter diagram ( if you haven’t got squared paper just do a sketch ). x y 1 5 2 3 3 1 Draw the regression line by eye. You get What do you notice about its gradient? ANS: It’s negative. Least Squares Regression Calculators vary in the way they do things so you need to make sure that you can use your own calculator efficiently. We’ll start with a very simple set of data so you can do the calculations as we go along. Draw the following data on a scatter diagram ( if you haven’t got squared paper just do a sketch ). x y You get 1 5 2 3 3 1 Draw the regression line by eye. Estimate the values of the gradient and yintercept. The gradient is -2 and the intercept 7. Least Squares Regression Calculators vary in the way they do things so you need to make sure that you can use your own calculator efficiently. We’ll start with a very simple set of data so you can do the calculations as we go along. Draw the following data on a scatter diagram ( if you haven’t got squared paper just do a sketch ). x y 1 5 2 3 3 1 Draw the regression line by eye. Now enter the x and y data into your calculator. Select the regression option and you will find the two values -2 ( the gradient ) and 7 ( the intercept ). It’s important to remember which letter is used for the gradient and which for the intercept. so you might want to make a note of them now. Least Squares Regression The equation of any straight line is given by y mx c but for the regression line it is usual to write y a bx where b is the gradient and a is the intercept on the y-axis So, our regression line with gradientThis -2 and intercept equation is in 7your is formulae booklet y 7 - 2x The gradient of the line is called the regression coefficient SUMMARY Least Squares Regression Suppose we have a set of values of 2 variables, x and y. To estimate a value of y for a given value of x, we need the least squares regression line of y on x. The regression line always passes through the point ( x , y ) where x and y are the means of the x- and y- values respectively. The equation of the line is of the form y a bx where b is the gradient and a is the intercept on the y-axis. To find the values of the gradient and intercept on my calculator I . . . ( note down here what you need to do ) • • The gradient is given by b and called the regression coefficient. The intercept is given by a. Least Squares Regression e.g. For the height and foot length data, Foot length (cm) Foot length and height of UK children Height (cm) the equation of the regression line shown is y 1 98 0 14x To estimate the foot length of a child whose height is 130 cm, we substitute x = 130 in the equation: y 1 98 0 14(130) y 20 2 ( 3 s . f .) Least Squares Regression Using a Regression Line We need to watch out for the following when using regression lines: Although we can always find a regression line, it will have no meaning if the points are scattered widely from the line. There may be a relationship between 2 variables that is non-linear so the regression line is inappropriate. The fact that we can find a regression line does not mean that a change in one variable causes a change in the other. Least Squares Regression Exercise 1. Find the equation of the least squares regression line of y on x, for the following sets of data: (a) (b) x y 1 1 3 2 4 4 6 4 8 5 9 7 11 14 8 9 x y 20 28 22 15 18 25 19 16 17 23 15 3 18 13 17 5 10 18 14 8 ( Give the gradient and intercept to 2 d.p. ) 2. Using the answer to 1(b), estimate the values of y for x = 12 and x = 21, giving your answers to 1 d.p. Are these values reliable? If not, why not? Least Squares Regression Solutions: 1(a) y 0 55 0 64 x (b) y 31 65 - 0 96x 2. x 12 in y 31 65 - 0 96 x y 31 65 - 0 96(12) y 20 1 x 21 in y 31 65 - 0 96 x y 31 65 - 0 96( 21) y 11 5 The 1st answer is not reliable since 12 lies outside the range of values used to calculate the regression line. The 2nd gives a reasonable estimate. Least Squares Regression Taking Exams The problem with using a calculator to find the regression line and then directly writing down the answer is that one small error entering the data could mean that in an exam you lose several marks. To avoid this problem we always check the data carefully after entering it. If you you are given summary data instead of raw data, you will need to use the formulae as it isn’t then possible to use the calculator regression function. The formulae are in your formulae booklet but we’ll now see what the terms in the formulae mean. Least Squares Regression Formulae for the regression line I’ll use the simple data set x 1 2 3 again to illustrate the y 5 4 1 method. The gradient of the regression line for y on x is given by S xy b S xx S xy is called the covariance and S xy ( x - x )( y - y ) x y xy n The formulae booklets give both these forms but the 1st form is usually inefficient. Can you see why? ANS: In the 1st form we have to subtract the means from each observation and then multiply instead of multiplying and subtracting once. Least Squares Regression Formulae for the regression line I’ll use the simple data set x 1 2 3 again to illustrate the y 5 4 1 method. The gradient of the regression line for y on x is given by S xy b S xx S xy is called the covariance and S xy ( x - x )( y - y ) xy 16 x 6 x y xy n y 10 (6)(10) S xy 16 -4 3 Least Squares Regression Formulae for the regression line I’ll use the simple data set x 1 2 3 again to illustrate the y 5 4 1 method. The gradient of the regression line for y on x is given by S xy S xy -4 b S xx S xx ( x - x ) 2 x 2 - x 2 n 2 2nd form 2 2 As before, we use the 6 36 x x 1 4 9 14 36 S xx 143 S xx 2 Least Squares Regression Formulae for the regression line I’ll use the simple data set x 1 2 3 again to illustrate the y 5 4 1 method. The gradient of the regression line for y on x is given by b S xy S xx -4 -2 2 S xy -4 S xx 2 Least Squares Regression Formulae for the regression line I’ll use the simple data set again to illustrate the method. The equation of the line is x 1 2 3 y 5 4 1 y a bx b -2 We now use the fact that the regression line passes through the point ( x , y ) so these coordinates satisfy the equation y a bx y a bx So, a y - bx 6 where, x y 10 3 3333 2 3 Now enter the data into your calculator and use 3 a 3 3333 - ( -2to )(2check ) 7 3333 theregression function the result. So, y 7 333- 2 x Least Squares Regression Using Summary Data • • The equation of the regression line of y on x is y a bx The gradient of the line is called the regression coefficient and is given by S xy b S xx ( The 2nd formula given in your formulae booklet for b is not in the most convenient form. It’s best to work out S xy and S xx then divide them as above.) S xx x 2 • x 2 S xy xy - n x y ( x , y ) satisfies the equation so, y a bx y a bx a y - bx n Least Squares Regression e.g.1 The following results are given for 10 pairs of observations relating 2 variables x and y: x 29 y 42 2 2 x 397 y 6728 xy 792 Find the regression coefficient of y on x and the equation of the regression line of y on x. Solution: The regression coefficient is b, the gradient of the regression line of y on x. x y S xy (29)(42) S xy xy 792 670 2 b n 10 S xx S xx x 2 - x 2 292 397 312 9 10 n S xy 670 2 b 2 1419 S xx 312 9 Least Squares Regression e.g.1 The following results are given for 10 pairs of observations relating 2 variables x and y: x 29 y 42 2 2 x 397 y 6728 xy 792 Find the regression coefficient of y on x and the equation of the regression line of y on x. Solution: b 2 1419 a y - bx y y a 4 2 x x 2 9 n n 4 2 - ( 2 1419)(2 9) -2 01 The equation of the regression line of y on x is y -2 01 2 14x Least Squares Regression Exercise Find the regression coefficient of y on x and the equation of the regression line of y on x for each of the following sets of data: 1. x 361 y 379 2 2 x 10421 y 11667 xy 10933 2. x 42 2 y 46 4 2 2 x 291 2 y 290 52 xy 230 42 n 13 n8 Least Squares Regression Solutions 1. x 361 y 379 2 2 x 10421 y 11667 xy 10933 n 13 xdifferent y YourSanswers may be slightly from mine I stored each (361as )(379 ) xy Scalculated 10933 408 46 b value xy xyit- and used the as I fully correct values rather n 13 S xx than rounded ones when I did2 subsequent calculations. This is 2 xnot essential at361 2 good practice but S xx x 10421- this stage. 396 31 13 n b a y - bx S xy S xx 408 46 1 03 396 31 y y 29 15 Regression coef. of y on x x x 27 769 n n a 29 15 - (1 03)(27 77) 0 53 y 0 53 1 03x Least Squares Regression 2. x 42 2 y 46 4 2 2 x 291 2 y 290 52 xy 230 42 n8 Solution: b S xy S xy S xx x y xy n x 2 (42 2)(46 4) 230 42 8 -14 34 2 42 2 S xx x 2 291 2 68 60 8 n S xy -14 34 Regression b -0 21 S xx coef. of y on x 68 60 y y 58 a y - bx x x 5 275 n n a 5 8 - ( -0 21)(5 28) 6 90 y 6 90 - 0 21x Least Squares Regression Explanatory and Response Variables Suppose we have data showing that there is a strong linear relationship between the amount of fertilizer used on some plants and the yield from the plants. The yield clearly depends on the amount of fertilizer, not the other way round. The yield is responding to the fertilizer. In this example, the yield is called the response, or dependent, variable. The amount of fertilizer used is the explanatory, or independent, variable. It will have been controlled in the trial from which the data have been taken. Least Squares Regression The following slides contain repeats of information on earlier slides, shown without colour, so that they can be printed and photocopied. For most purposes the slides can be printed as “Handouts” with up to 6 slides per sheet. SUMMARY Least Squares Regression Suppose we have a set of values of 2 variables, x and y. To estimate a value of y for a given value of x, we need the least squares regression line of y on x. The regression line always passes through the point ( x , y ) where x and y are the means of the x- and y- values respectively. The equation of the line is of the form y a bx where b is the gradient and a is the intercept on the y-axis. To find the values of the gradient and intercept on my calculator I . . . ( note down here what you need to do ) • • The gradient is given by b and called the regression coefficient. The intercept is given by a. Least Squares Regression e.g. x y 1 5 2 3 3 1 We can enter the x and y values into the calculator and get a7 b -2 The equation of the y on x regression line is y 7 - 2x Least Squares Regression Using summary data • • The equation of the regression line of y on x is y a bx The gradient of the line is called the regression coefficient and is given by S xy b S xx ( The 2nd formula given in your formula booklet for b is not in the most convenient form. It’s best to work out S xy and S xx then divide them as above.) S xx x 2 • x 2 n S xy xy - x y ( x , y ) satisfies the equation so, y a bx y a bx a y - bx n Least Squares Regression e.g.1 The following results are given for 10 pairs of observations relating 2 variables x and y: x 29 y 204 2 2 x 397 y 6728 xy 792 Find the regression coefficient of y on x and the equation of the regression line of y on x. Solution: The regression coefficient is b, the gradient of the regression line of y on x. x y S xy (29)(42) S xy xy 792 670 2 b n 10 S xx S xx x 2 - x 2 n 292 39710 312 9 Least Squares Regression b a y - bx S xy S xx 670 2 2 1419 312 9 y y 20 4 n x x 2 9 n a 20 4 - (2 1419)(2 9) 14 1985 The equation of the regression line of y on x is y 14 2 2 142x Least Squares Regression Explanatory and Response Variables Suppose we have data showing that there is a strong linear relationship between the amount of fertilizer used on some plants and the yield from the plants. The yield clearly depends on the amount of fertilizer, not the other way round. The yield is responding to the fertilizer. In this example, the yield is called the response, or dependent, variable. The amount of fertilizer used is the explanatory, or independent, variable. It will have been controlled in the trial from which the data have been taken.