35 Least Squares Regression

advertisement
“Teach A Level Maths”
Vol. 2: A2 Core Modules
Least Squares Regression:
y on x
© Christine Crisp
Least Squares Regression
Statistics 1
AQA
EDEXCEL
OCR
"Certain images and/or photos on this presentation are the copyrighted property of JupiterImages and are being used with
permission under license. These images and/or photos may not be copied or downloaded without permission from JupiterImages"
Least Squares Regression
We often want to know whether there is a relationship
between one variable and another.
e.g. Does the number of driving accidents increase with
the age of the driver?
e.g. Can we predict a student’s mark in a French exam
if we know it in an English exam?
e.g. Is the weight of a baby at birth related to the
height of the father?
You met sets of data like these at GCSE and you’ve
drawn scatter diagrams and also drawn a line of best fit
“by eye”. This line is called the regression line.
In this presentation we will see how to calculate a
regression line.
Least Squares Regression
The data I’m going to use is a random sample from the
Census at School database.
I’ve chosen a random sample from the data for height
and foot size of 99 children from the UK and I’ve used
the Autograph software package to plot the data and do
the calculations.
There is a demo showing you how to do this in “Autograph
Resources”.
To see the demo you need to select:
2D graphing; advanced; plotting statistics in 2D; Using
Autograph to display statistical diagrams from Census
at School data.
The demo then starts automatically.
Least Squares Regression
This is a scatter diagram of the data.
Foot
length
(cm)
Foot length and height
of UK children
Height (cm)
We will find the equation of the line that could be used
to predict the foot length of a child whose height is
known.
Least Squares Regression
This is a scatter diagram of the data.
Foot
length
(cm)
Foot length and height
of UK children
e.g. This length
. . . is squared
Height (cm)
Least Squares Regression
This is a scatter diagram of the data.
Foot
length
(cm)
Foot length and height
of UK children
e.g. This length
. . . is squared
and addedHeight
to the(cm)
other squares.
Points below the line result in negative “lengths”, so
would cancel out those above if we didn’t square.
Least Squares Regression
This is a scatter diagram of the data.
Foot
length
(cm)
Foot length and height
of UK children
Height (cm)
The line is positioned so that the sum of the squares of
the distances of all the points from the line is as small as
possible.
This makes the line run through the middle of the points.
Least Squares Regression
This is a scatter diagram of the data.
Foot
length
(cm)
Foot length and height
of UK children
Height (cm)
This line is called the least squares regression line
of y on x.
To find the equation of the regression line we need the
values of the gradient and the intercept on the y-axis.
Least Squares Regression
Calculating the gradient and intercept of the regression
line
You don’t need to know how to derive the formulae for
the gradient and intercept ( although if you have already
taken AS you may be interested to see how Calculus is
used to get these formulae ).
You will need to be able to do the following:
•
Use calculator functions working with raw data.
•
Use the formulae in your formula book if you
are given summary data.
We’ll start with the calculator.
Least Squares Regression
Calculators vary in the way they do things so you need to
make sure that you can use your own calculator
efficiently.
We’ll start with a very simple set of data so you can do
the calculations as we go along.
 Draw the following data on a scatter diagram ( if
you haven’t got squared paper just do a sketch ).
x
y
You get
1
5
2
3
3
1
Draw the regression line
by eye.
Least Squares Regression
Calculators vary in the way they do things so you need to
make sure that you can use your own calculator
efficiently.
We’ll start with a very simple set of data so you can do
the calculations as we go along.
 Draw the following data on a scatter diagram ( if
you haven’t got squared paper just do a sketch ).
x
y
1
5
2
3
3
1
Draw the regression line
by eye.
You get
What do you notice
about its gradient?
ANS: It’s negative.
Least Squares Regression
Calculators vary in the way they do things so you need to
make sure that you can use your own calculator
efficiently.
We’ll start with a very simple set of data so you can do
the calculations as we go along.
 Draw the following data on a scatter diagram ( if
you haven’t got squared paper just do a sketch ).
x
y
You get
1
5
2
3
3
1
Draw the regression line
by eye.
Estimate the values of
the gradient and yintercept.
The gradient is -2
and the intercept 7.
Least Squares Regression
Calculators vary in the way they do things so you need to
make sure that you can use your own calculator
efficiently.
We’ll start with a very simple set of data so you can do
the calculations as we go along.
 Draw the following data on a scatter diagram ( if
you haven’t got squared paper just do a sketch ).
x
y
1
5
2
3
3
1
Draw the regression line
by eye.
Now enter the x and y data into your calculator.
Select the regression option and you will find the two values
-2 ( the gradient ) and 7 ( the intercept ).
It’s important to remember which letter is used
for the gradient and which for the intercept.
so you might want to make a note of them now.
Least Squares Regression
The equation of any straight line is given by y  mx  c
but for the regression line it is usual to write
y  a  bx
where b is the gradient and a is the intercept on the y-axis
So, our regression line with gradientThis
-2 and
intercept
equation
is in 7your
is
formulae booklet
y  7 - 2x
The gradient of the
line is called the
regression coefficient
SUMMARY
Least Squares Regression
Suppose we have a set of values of 2 variables, x and y.
 To estimate a value of y for a given value of x, we
need the least squares regression line of y on x.
 The regression line always passes through the point ( x , y )
where x and y are the means of the x- and y- values
respectively.
 The equation of the line is of the form
y  a  bx
where b is the gradient and a is the intercept on
the y-axis.
 To find the values of the gradient and intercept on my
calculator I . . . ( note down here what you need to do )
•
•
The gradient is given by b and called
the regression coefficient.
The intercept is given by a.
Least Squares Regression
e.g. For the height and foot length data,
Foot
length
(cm)
Foot length and
height of UK
children
Height (cm)
the equation of the regression line shown is
y  1  98  0  14x
To estimate the foot length of a child whose height is 130
cm, we substitute x = 130 in the equation:

y  1  98  0  14(130)
y  20  2 ( 3 s . f .)
Least Squares Regression
Using a Regression Line
We need to watch out for the following when using
regression lines:
 Although we can always find a regression line, it will
have no meaning if the points are scattered widely
from the line.
 There may be a relationship between 2 variables that
is non-linear so the regression line is inappropriate.
 The fact that we can find a regression line does not
mean that a change in one variable causes a change in
the other.
Least Squares Regression
Exercise
1. Find the equation of the least squares regression line
of y on x, for the following sets of data:
(a)
(b)
x
y
1
1
3
2
4
4
6
4
8
5
9
7
11 14
8 9
x
y
20 28 22 15 18 25 19 16 17 23
15 3 18 13 17 5 10 18 14 8
( Give the gradient and intercept to 2 d.p. )
2. Using the answer to 1(b), estimate the values of y
for x = 12 and x = 21, giving your answers to 1 d.p.
Are these values reliable? If not, why not?
Least Squares Regression
Solutions:
1(a)
y  0  55  0  64 x
(b)
y  31 65 - 0  96x
2.
x  12 in y  31  65 - 0  96 x


y  31 65 - 0  96(12)
y  20 1
x  21 in y  31  65 - 0  96 x


y  31 65 - 0  96( 21)
y  11 5
The 1st answer is not reliable since 12 lies outside the
range of values used to calculate the regression line.
The 2nd gives a reasonable estimate.
Least Squares Regression
Taking Exams
The problem with using a calculator to find the
regression line and then directly writing down the answer
is that one small error entering the data could mean that
in an exam you lose several marks.
To avoid this problem we always check the data
carefully after entering it.
If you you are given summary data instead of raw data,
you will need to use the formulae as it isn’t then possible
to use the calculator regression function.
The formulae are in your formulae booklet but we’ll now
see what the terms in the formulae mean.
Least Squares Regression
Formulae for the regression line
I’ll use the simple data set
x
1
2
3
again to illustrate the
y
5
4
1
method.
The gradient of the regression line for y on x is given by
S xy
b
S xx
S xy is called the covariance and
S xy   ( x - x )( y - y )  
x y

xy n
The formulae booklets give both these forms but the 1st
form is usually inefficient. Can you see why?
ANS: In the 1st form we have to subtract the means
from each observation and then multiply instead of
multiplying and subtracting once.
Least Squares Regression
Formulae for the regression line
I’ll use the simple data set
x
1
2
3
again to illustrate the
y
5
4
1
method.
The gradient of the regression line for y on x is given by
S xy
b
S xx
S xy is called the covariance and
S xy   ( x - x )( y - y )  
 xy  16  x  6

x y

xy n
 y  10
(6)(10)
S xy  16  -4
3
Least Squares Regression
Formulae for the regression line
I’ll use the simple data set
x
1
2
3
again to illustrate the
y
5
4
1
method.
The gradient of the regression line for y on x is given by
S xy
S xy  -4
b
S xx
S xx   ( x - x ) 2   x 2 -

 x 2

n
2 2nd form
2
2
As
before,
we use
the
6
 36
x

x

1

4

9

14




36
S xx  143
S xx  2
Least Squares Regression
Formulae for the regression line
I’ll use the simple data set
x
1
2
3
again to illustrate the
y
5
4
1
method.
The gradient of the regression line for y on x is given by
b
S xy
S xx

-4
 -2
2
S xy  -4
S xx  2
Least Squares Regression
Formulae for the regression line
I’ll use the simple data set
again to illustrate the
method.
The equation of the line is
x
1
2
3
y
5
4
1
y  a  bx
b  -2
We now use the fact that the regression line passes
through the point ( x , y ) so these coordinates satisfy the
equation y  a  bx
 y  a  bx
So, a  y - bx
6
where, x 
y  10  3  3333
2
3
Now enter the data
into your calculator
and use
3
a  3  3333
- ( -2to
)(2check
)  7  3333
theregression
function
the result.
So,
y  7  333- 2 x
Least Squares Regression
Using Summary Data
•
•
The equation of the regression line of y on x is
y  a  bx
The gradient of the line is called the regression
coefficient and is given by
S xy
b
S xx
( The 2nd formula given in your formulae booklet for b is
not in the most convenient form. It’s best to work out
S xy and S xx then divide them as above.)
S xx   x 2
•
 x 
2
S xy   xy -
n
 x  y 
( x , y ) satisfies the equation so, y  a  bx

y  a  bx

a  y - bx
n
Least Squares Regression
e.g.1 The following results are given for 10 pairs of
observations relating 2 variables x and y:
 x  29  y  42
2
2
x

397
y

 6728  xy 792
Find the regression coefficient of y on x and the equation
of the regression line of y on x.
Solution: The regression coefficient is b, the gradient
of the regression line of y on x.
x y
S xy
(29)(42)

S xy   xy  792 670 2
b
n
10
S xx
S xx   x 2 -
 x 
2
292
 397 312 9
10
n
S xy
670 2
b

 2  1419
S xx
312 9
Least Squares Regression
e.g.1 The following results are given for 10 pairs of
observations relating 2 variables x and y:
 x  29  y  42
2
2
x

397
y

 6728  xy 792
Find the regression coefficient of y on x and the equation
of the regression line of y on x.
Solution:
b  2 1419
a  y - bx

y

y

a
4 2
x

x
 2 9
n
n
4  2 - ( 2  1419)(2  9)  -2 01
The equation of the regression line of y on x is
y  -2  01  2  14x
Least Squares Regression
Exercise
Find the regression coefficient of y on x and the equation
of the regression line of y on x for each of the following
sets of data:
1.
 x  361  y  379
2
2
x

10421
y

 11667  xy 10933
2.
 x  42 2  y  46 4
2
2
x

291

2
y

 290 52  xy 230 42
n  13
n8
Least Squares Regression
Solutions
1.
 x  361  y  379
2
2
x

10421
y

 11667  xy 10933
n  13
xdifferent
y
YourSanswers
may be slightly
from mine
I stored
each
(361as
)(379
)


xy
Scalculated
 10933 408 46
b value

xy   xyit- and used the
as
I
fully
correct
values
rather
n
13
S xx
than rounded ones when I did2 subsequent calculations.
This is
2
xnot essential at361

2
good
practice
but
S xx   x  10421- this stage.
 396 31
13
n

b
a  y - bx

S xy
S xx

408 46

 1  03
396 31
y

y
 29 15
Regression
coef. of y on x
x

x
 27 769
n
n
a  29  15 - (1  03)(27  77)  0  53

y  0  53  1  03x
Least Squares Regression
2.
 x  42 2  y  46 4
2
2
x

291

2
y

 290 52  xy 230 42
n8
Solution:
b
S xy
S xy  
S xx
x y

xy 
n
 x 2
(42  2)(46  4)
230  42 8
 -14 34
2
42

2
S xx   x 2  291 2  68 60
8
n
S xy
-14 34
Regression
b

 -0  21
S xx
coef. of y on x
68 60
y

y
 58
a  y - bx

x

x
 5  275
n
n
a  5  8 - ( -0  21)(5  28)  6  90

y  6  90 - 0  21x
Least Squares Regression
Explanatory and Response Variables
Suppose we have data showing that there is a strong linear
relationship between the amount of fertilizer used on some
plants and the yield from the plants.
The yield clearly depends on the amount of fertilizer, not
the other way round. The yield is responding to the
fertilizer.
In this example, the yield is called the response, or
dependent, variable.
The amount of fertilizer used is the explanatory, or
independent, variable. It will have been controlled in the
trial from which the data have been taken.
Least Squares Regression
The following slides contain repeats of
information on earlier slides, shown without
colour, so that they can be printed and
photocopied.
For most purposes the slides can be printed
as “Handouts” with up to 6 slides per sheet.
SUMMARY
Least Squares Regression
Suppose we have a set of values of 2 variables, x and y.
 To estimate a value of y for a given value of x, we
need the least squares regression line of y on x.
 The regression line always passes through the point ( x , y )
where x and y are the means of the x- and y- values
respectively.
 The equation of the line is of the form
y  a  bx
where b is the gradient and a is the intercept on
the y-axis.
 To find the values of the gradient and intercept on my
calculator I . . . ( note down here what you need to do )
•
•
The gradient is given by b and called
the regression coefficient.
The intercept is given by a.
Least Squares Regression
e.g.
x
y
1
5
2
3
3
1
We can enter the x and y values into the calculator
and get
a7
b  -2
The equation of the y on x regression line is
y  7 - 2x
Least Squares Regression
Using summary data
•
•
The equation of the regression line of y on x is
y  a  bx
The gradient of the line is called the regression
coefficient and is given by
S xy
b
S xx
( The 2nd formula given in your formula booklet for b is not
in the most convenient form. It’s best to work out
S xy and S xx then divide them as above.)
S xx   x 2
•
 x 
2
n
S xy   xy -
 x  y 
( x , y ) satisfies the equation so, y  a  bx

y  a  bx

a  y - bx
n
Least Squares Regression
e.g.1 The following results are given for 10 pairs of
observations relating 2 variables x and y:
 x  29  y  204
2
2
x

397
y

 6728  xy 792
Find the regression coefficient of y on x and the equation
of the regression line of y on x.
Solution: The regression coefficient is b, the gradient
of the regression line of y on x.
x y
S xy
(29)(42)

S xy   xy  792 670 2
b
n
10
S xx
S xx   x 2 -
 x 
2
n
292
 39710
 312 9
Least Squares Regression
b
a  y - bx

S xy
S xx

670 2
 2  1419
312 9
y

y
 20 4
n
x

x
 2 9
n
a  20  4 - (2  1419)(2  9) 
14 1985
The equation of the regression line of y on x is
y  14  2  2  142x
Least Squares Regression
Explanatory and Response Variables
Suppose we have data showing that there is a strong linear
relationship between the amount of fertilizer used on some
plants and the yield from the plants.
The yield clearly depends on the amount of fertilizer, not
the other way round. The yield is responding to the
fertilizer.
In this example, the yield is called the response, or
dependent, variable.
The amount of fertilizer used is the explanatory, or
independent, variable. It will have been controlled in the
trial from which the data have been taken.
Download