Regression Analysis

advertisement
1
Regression Analysis
The study of relationships between variables.
A very large part of statistical analysis revolves
around relationships between a response variable
Y one or more predictor variables X.
In mathematics a relationship occurs when a
formula links two quantities:
Diameter and Circumference of a Circle
diameter
Circumference
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
1.57079633
3.14159265
4.71238898
6.28318531
7.85398163
9.42477796
10.9955743
12.5663706
14.1371669
15.7079633
17.2787596
18.8495559
20.4203522
21.9911486
2
Circumference
25
20
15
10
5
0
0
2
4
6
8
The circumference of a circle c = d
3
Compound Interest
Capital C is invested at a constant interest rate r
(added annually)
Say C = 1000, r = 0.05 (5%)
The cumulated capital and interest after t years
C(t) is:
C(0) = 1000
C(1) = 1000 + 0.05 * 1000 =1050
C(2) = 1050 + 0.05 *1050 = 1050 + 52.5
= 1102.5
…
A little bit of algebra gives the following :
C(t )  1000(1  r )
t
A formula links C(t) with t.
This is a deterministic relationship - C(t) can be
exactly calculated for every value of t (an integer) .
4
Capital
1800
1600
1400
1200
1000
800
600
400
200
0
0
1
2
3
4
5
6
7
8
9
10
Here we have some differences - the graph C does
not start at 0.
Also, although this is not clearly apparent from the
graph the relationship is a curve.
If we extend the time axis to 100 years we get:
Capital
140000
120000
100000
80000
60000
40000
20000
0
0
10
20
30
40
50
60
70
80
90
100
5
A relationship of the form y = a + bx is linear
Some relationships however are stochastic involve a random component
6
Westwood Co
Data on Lot Size and Number of
Man-Hours
Westwood Company Example
Production Run
run
1
2
3
4
5
6
7
8
9
10
Lot Size
Xi
size
30
20
60
80
40
50
60
30
70
60
Man-Hours
Yi
MH
73
50
128
170
87
108
135
69
148
132
7
In this example although the relationship is clear
from the graph we do not have a theory that would
allow us to link number of man-hours to lot size
(without some empirical observation).
It is in fact clear that such an exact relationship
cannot exist:
We observe that for lot size 60 we had three
observations and they all required a different
number of man-hours to process.
Nevertheless it would be useful to derive some
approximate formula to allow for costing and/or
scheduling of jobs.
Weight of a Car and MPG (USA 1991)
We might expect heavier cars to have an inferior
efficiency.
8
MPG and weight
We would not expect however that the weight of
cars would give us a complete description of MPG
(but it may an important factor).
In Ex 3 we can actually apply a bit of "theory"
(common sense) to suggest that:
MH = job set-up time + lot size * time process
unit + variation
variation - random component due to a multitude of
factors affecting an individual job.
If the equation is to be useful in predicting
workload variation will be small.
9
The suggested relationship is linear.
Y = a + bX + error
The statistical task is to chose a and b in some
sensible way
Least Squares
R  y  (a  bx )
i
i
Best a and b to minimise  R
i
2
i
10
Why ?
 ensures that all points are involved in the
choice of a and b
 R not a good idea (see example)
i
 | R | too difficult mathematically.
i
 R is the next choice - Also as we shall see
2
i
the solution has some very nice properties.
More importantly why R the vertical distance?
i
In what follows it is assumed that there is an
asymmetry between Y and X.
X’s are regarded as fixed known constants, only the
Y's involve a random component.
X’s lot size - treated as if the data were collected
as follows:
Decide on 10 lots of sizes x1=30 , x2 = 20 …
Then observe the number of man hours each
takes to process.
11
Weight of a car affects MPG not the other way
round.
The Weight of the car is a fixed known
quantity.
In many examples where regression is used the
direction of the relationship is much less obvious
(example Heights and Weights of people, imports
and GNP) - It has be assumed by the analyst.
In particular
in mathematics:
if y = a + bx then x=c+dy where d 
1
-a
and c =
b
b
however in regression
y=a+bx + r , r are "errors" in y.
but
x=c+dy+s , s are "errors" in x.
If we are given only a and b, c and d cannot be
calculated. We need to look at other aspects of the
data to compute them.
12
Solution to Least Squares
SSE   R
2
i
SSE - Sum of Squares Error (also called
Residual sum of squares)
Task: choose a and b to minimise SSE
Algebraic theory tells us that this is achieved
by solving:
 ( SSE )
0
a
 ( SSE )
0
b
SSE   R   ( y  a  bx )
2
i
i
2
i
 ( SSE )


( y  a  bx )
a
a
i
2
i
(derivative of sum = sum of derivatives)


( y  a  bx )  2( y  a  bx ) ( y  a  bx ) (chain rule)
a
a

( y  a  bx )  1
a
2
i
i
i
i
i
i
i
i
13
So
 ( SSE )
 2 ( y  a  bx )
a
i
i
 2(  y  na  b x )
i
i
Equating this to 0 gives:
na   y  b  x
or
a  y  bx
Equation 1
 ( SSE )
 2 x ( y  a  bx ) and setting this to 0
b
gives:
i
i
i
 xy  a  x  b  x  0 Equation 2
2
Eqs 1 and 2 are called the normal equations - we
shall use them again later:
14
Using 1 to get rid of a in 2 gives
 xy  ( y  bx )nx  b  x  0
 xy  nxy  b(  x  nx )
or
 xy  nxy
b
 x  nx
2
2
2
2
2
Again:
b
 xy  nxy
 x  nx
2
2
a  y  bx
So to compute a and b we need:
 xy ,  x ,  y ,  x and n
2
Example Westwood Co
 xy  61800,  x  500,  y  1100,  x  28400, n  10
2
61800  10 * 50 * 110
2
28400  10 * 2500
a  110  2 * 50  10
b
Man-hours = 10 + 2 * Lot size.
15
Interpretation: it takes 10 man-hours to set up a job,
2 man-hours to process a unit. (on average).
We can use the equation to predict the work load
required for a job of any size:
lot size 35 estimate 10+2*35 =80 mh.
lot size 60 estimate 10+ 2*60 =130 mh
note: we had three cases of lot 60 in the data
128, 135 and 132
An estimate for size 60 based on those 3 would
be
395
 13167
. -> regression is not just
3
interpolation.
Lot size 100 estimate 10+2*100=210 extrapolation
(beyond observed range of x).
Lot size 10000 estimate 10+2*10000=20010 ???
Do we believe this?!
In practice (i.e. except for exam questions)
regression estimates would be calculated using a
computer package.
16
MPG on Weight
Variable Coefficient s.e. of Coeff
48.7393
1.976
-0.00821362
0.0006738
t-ratio
prob Constant
24.7 < 0.0001 Weight
-12.2 < 0.0001
Computers being computers tend to output a lot
more information and lots of spurious decimal
places.
(most of the output was deleted here).
a = 48.74 , b = -0.008214
b is very small - does this mean that Weight has
little effect on MPG?
NO !
b is small here because MPG is in miles per gallon
20 - 40 say, Weight is in lbs 1700 - 4200. If Weight
was measured in tons b =-18.4 =2240*-0.00824 (a
would stay the same), if y was YPG (yards per
gallon) both b and a would change (x 1760).
Predictions can be made as above but note that no
vehicle will ever exceed 48 (a ) MPG, more
seriously a 3 tonne car can only go backwards will have negative MPG.
17
Notation:
yi = a + bxi +ri
y - dependent variable
x - independent variable
r - residual
a , , constant. The estimated value at x = 0
b, , slope.
The change in y for 1 unit change
in x.
y (y-hat) , prediction, fit - The estimate value for
some x.

y  a  bx
i
i
Hats are used to indicate an estimated value.
r  y  y residual (at xi).
i
i
i
Fit summaries:
Error sum of squares = SSE =  r
2
i
Total sum of squares = SSTO =  ( y  y )
(error sum of squares if b=0)
i
2
18
Regression Sum of Squares = SSR = SSTO - SSE.
SSR is interpreted as the sum of squares explained
by the regression.
These lead to a basic indicator of the quality of the
regression
Coefficient of determination = r-squared = r2 =
R2 =
SSR
SSTO
= proportion of explained variation .
0  r2  1
For the Westwood data r2 = 0.996 = 99.6%
For MPG data r2 = 75.6%
r2 is invariant to scale changes in y or x = corr(x,y)2
Although not perfect (no single summary can be) r2
is a primary indicator of the usefulness of a
regression.
19
What is a good (high) r2 ?
No absolute answer - although r2=0 -> b=0 -> no
relationship quite small r2 can be statistically
significant (with enough data)
r2 will depend on the inherent variability in the y's
In data with 'no problems'
Social surveys, education, medical, biological
a good value of r2 is around 0.4.
eg. A - level results to predict University
performance. 0.4 - 0.5
bio-mechanical, industrial data 0.6 + should be
expected
eg. r2 between breaking and bending
moment of planks 0.6.
engineering, scientific experiments - accurate
measurements and good theoretical
background for the model:
0.9 +
20
Care must be taken though as one or two points
may severely affect r2. It is quite easy to construct
where the 'relationship' between x and y is confined
to a single data point and r2 is very close to 1.
Mean Squared Error
MSE 
SSE
S
n2
2
Variance of the residuals, (n-2) is used so that the
estimate is unbiased.
S
MSE standard deviation about the line.
Is a basis for computing accuracy of estimates of
coefficients and predictions.
21
Properties of the regression line:
In the derivation of eq1 we had
 ( SSE )
 2 ( y  a  bx )  0
a
i
i

 ( y  a  bx )   R  0
i
i
i
The sum of the residuals = vertical departures = 0
a is chosen in such a way that the line is central in
the vertical direction
a  y  bx if b = 0 (y doesn't depend on x)
then a = average of y's.
This form also means that the point ( x , y ) the
centroid of the data lies on the estimated line.
The properties emerging from equation 2 are more
subtle.
 x R  0  (x  x)R  0
i
i
i
i
22
Imagine a stick (weightless) at lengths Xi tie
weights of magnitude Ri

The stick is balanced around x there is no tendency
for the line to twist.
Rewriting the formula for b is useful
 xy  nxy
 x  nx
1
nx

(  xy   y )
n
 x  nx
b
2
2

2
2
1
 (x  x) y
x

nx


2
i
2
i
(x  x)
y  k y
 x  nx
i
2
2
i
i
i
b = k1y1+k2y2 …
b is a linear combination of y's ( doesn't depend on
y2) this makes the properties of b simple (for
testing, confidence intervals etc)
further the k's have some nice properties:
k 
i
x x
1
where SSX =
SSX
 x  nx
i
2
2
23
1
 (x  x)  0
SSX
k 
i
i
as  ( x  x )  0
i
Sum of the k's is 0.
(x  x)
SSX
1


k  
SSX
SSX
SSX
2
2
i
i
2
2
 k x   k ( x  x ) adding x  k  0
( x  x )( x  x )
1
SSX


(
x

x
)

1

SSX
SSX
SSX
i
i
i
i
i
i
2
i
i
Sum of kx is 1.
These results make the calculations of the statistical
properties of b relatively simple.
Finally a is also a linear combination of the y's
A prediction Y = a+bx is also LC of Ys.
A residual Yi -Yh is also LC of Ys.
24
Statistical Model for regression.
To be able to conduct tests (is slope = 0,
constant = 0) make confidence statements etc. we
need a statistical framework for the regression
model.
Errors are independent Normal with 0 mean.
Y  a  bx  e
where e ~ N(0,2)
or equivalently
Y ~ N(a+bx,2)
The assumption is not too serious in practice the normal model is usually quite good for
'errors' in continuous variables.
There are some situations however where the
regression idea needs to be adapted (a different
algorithm is needed). Most commonly these
arise if Y is confined to a small number of
values:
Y - binary = (Yes or No)=(1 or 0)
Y - takes on a small number of values:
25
eg. (none, some, lots) = ? (0, 1, 2) ????
goals in soccer x = 0, 1, 2, … but mostly
small values.
But with more than 5 realistic, ordered
categories I would probably use ordinary
regression.
The independence assumption is more frequently
violated and a different approach might be
required:
example time series Y - house prices , x- time suppose the average growth is 5% . For random
(unaccounted for reasons) the x=2000 is below
average it is quite likely that the growth will not
fully recover during 2001 this introduces a
dependence between the 'errors'. (the errors could
still be Normal though)
This problem will also arise if the relationship is
miss-specified
- the relationship between x and y is a curve and a
line is fitted - this will introduce sequential
dependence between the residuals (see examples
later).
26
The two cases require different approaches - in
times series we should account of - in fact take
advantage of the correlation in the modelling. In
the second case we should specify the model
correctly and thus get rid of the problem
A third assumption that is a bit hidden is that the 2
constant across the range of the data - the variation
around the line is the same all along it.
In many situations the variance in y (and hence e)
may change with y.
In a real 'Westwood' example processing smaller
lots will involve less variation than processing
large lots in which there is more opportunity for
things to go right/wrong.
We shall examine some ways of correcting such a
problem.
We note at this stage that we shall need some way
of assessing whether our particular regression
suffers from any of these departures.
27
Statistical properties of the estimates.
Having a statistical model for the data allows to
properly assess the estimation procedure.
The estimates a, b, y and r are all linear
combinations of the y's . Linear combinations of
Normals are Normally distributed so all the
estimates are Normally distributed - all that we
have to do is to compute the means and variances
of these estimates and we have a complete
specification of their distribution.
Reminder:
If L  k y  k y ...   k y
1
1
2
2
i
i
then
E ( L)   k E ( y )
V ( L)   k V ( y ) if the y's are indpendent
=   k if y's all have variance  .
i
i
2
i
i
2
2
2
i
E (b)   k E ( y )   k (  x )    k    k x
  0  1  
i
i
i
i
i
i
i
b is an unbiased estimate of  - on average the
estimate equal to true value.
28
V (b)   k V ( y )    k 
2
2
i
2
i
i

2
SSX


2
 x  nx
2
2
i
In the estimate of V(b) 2 is replaced by its
estimate s2.
Observe that the greater SSX is, the higher the
precision in the estimate of b. SSX measures the
spread of Xs =(n-1)V(X) - a wide base for X
provides a good estimate of the slope.
1
a  y  bx   y  x  k y
n
1
a   (  k x) y
n
i
i
i
i
i
Note that a is a prediction from the regression at x
= 0. We will evaluate the properties of a general
prediction.
y( x )  a  bx
 y  bx  bx  y  b( x  x )
h
h
h
h
E ( y)  E ( y )  ( x  x ) E (b)
   x  x   x     x 
h
h
Unbiased prediction.
h
29
V ( y)  V ( y  b( x  x ))
 V ( y )  ( x  x ) V (b)  2Cov ( y , b)
 (x  x) 


0
n
SSX
1 (x  x)
 ( 
)
n
SSX
h
2
h
2
2
2
h
2
2
h
because Cov( y , b)  0
1
1

Cov ( y , b)  Cov (  y ,  k y )   kV ( y )   k  0
n
n
n
i
i
i
i
i
i
(the 2nd equals requires y's independent, the third
V(y)=2)
 1 (x  x) 
V ( y)    

n
SSX 
2
2
h
The variance of a prediction at xh depends on the
variation in slope and the number of data points but
also on the distance of xh form the centre of the x's.
Means and individual predictions:
Scenario 1
The accountant of Westwood Co. wants to evaluate
the profit/loss on jobs with a lot size of 50.
30
Based on the data the estimate is 10+2*60 =130
This estimate is based on 'only' 10 points so it
would be useful to know how accurate it is:
x  50 , xh=60, SSX=28400-25000=3400, n=10
and SSE=60
SSE is computed by computing ri s and summing
their squares, a messy formula, or as actually as
output from package.
So s2 =60/8 = 7.5 .
1 (60  50) 
V ( y)  7.5 
  0.970588
 10
3400 
2
SD( y)  0.9852
So the precision of the estimate is approx. 2
(=2*1 which is an approximation to
t(8,0.05)*0.9852).
The estimate is accurate to with 2% with 95%
confidence - pretty good for 'only' 10 points.
We remarked earlier that we had an alternative
estimate at x = 60
note: we had three cases of lot 60 in the data
128, 135 and 132
31
An estimate for size 60 based on those 3 would
be
395
 13167
. . The standard error of this estimate
3
SD(u )  2.028 about twice that of the regression.
There is no such thing as a free lunch…,definitely
not in statistics, where did the extra precision come
from?
There must be more information used to get y for it
to have higher precision.
The extra information comes from the other points
in the data,
To use that information we have assumed y = a +
bx this links all the points together.
It will not always be the case that you will observe
such an improvement - this is an artificial data set
for which the linear relationship is an excellent
explanation.
32
Scenario 2.
The sales manager of Westwood Co wants to
quote a price (in Mhs) for a specific job with lot
size 60.
How does this differ from scenario 1? The
accountant wanted the average price of a job, here
we are talking about a specific job.
Unless the sales manager knows of some specific
reasons why this job may require more (or less)
than average man-hours
the best estimate is the average:
~
y  10  2 * 60  130
but
V (~
y )  V ( y)  V ( y) and is estimated by
 1 (x  x) 
 1 (x  x) 
s s  
  s 1  

n


SSX
n
SSX 
2
2
2
h
2
2
h
=7.5 + 0.97 = 8.5
The model assumption was that the variation about
the line was 2 which is estimated by s2.
33
In Scenario 1 we were after the mean of Y at x=60
and errors that arise are due to errors in the
estimates a,b, as these are based on samples.
In Scenario 2 to these we have to add the inherent
variation in Y - not all jobs of size 60 will take the
same time.
Testing.
As always statistical testing parallels confidence
intervals just the emphasis is a different way round.
The most common form of test in regression is b=0
– i.e. is there any relationship between x and y at
all, at all. The test takes the standard form:
t
bb
as t distributed
sd (b)
0
with df = df associated with s = n-2
b - observed value, b0 is hypothesised value (often
= 0), tests about a and predictions can be conducted
in a similar way.
An excellent rule is |t|<2 do not reject the
hypothesis.
34
Traditionally F-tests are employed in regression
and virtually every package quotes them (even
though they are not really interesting)
2
s / df
f 
as F(df , df )
s / df
1
1
1
2
2
2
2
s12/df1 is an estimate of variance if hypothesis is
true (something larger if it is false)
s22/df2 is an estimate of variance anyway.(doesn't
depend on hypothesis)
The "overall F" is a test of b = 0 .
The numerator is an estimate of 2 if b = 0 (SSR) ,
the denominator is MSE which estimates 2
whatever b is
This is compared to F(1, n-2) .
F(1,df)=(t(df))2 and as the t-test is simpler and
nearly always quoted the F-test is redundant.
F-tests are important in analysis of variance
(several t-tests put together).
35
In multiple regression where there is more than
one explanatory variable (x) and more than one b
the F tests all b=0 simultaneously - If F not
significant the data are only suitable for the waste
bin.
36
Computer output
The Westwood Co example has been analysed
using x different packages to illustrate the output.
37
Data Desk
Line 1 = dependent variable
Line 2 = Data desk allows regression on a subset of
points defined by selector.
L3 = r2
(adjusted = garbage)
L4 = estimate of s and df
L6 =SSR and overall F-test
L7 = SSE , df, MSE
L9 = a , SD(a), t-value for a=0, significance a=0
L10 = b , SD(b), t-value for b=0, significance b=0
38
MINITAB
Regression Analysis
The regression equation is
C3 = 10.0 + 2.00 C2
Predictor
Constant
C2
S = 2.739
Coef
10.000
2.00000
StDev
2.503
0.04697
R-Sq = 99.6%
T
4.00
42.58
P
0.004
0.000
R-Sq(adj) = 99.5%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
8
9
SS
13600
60
13660
MS
13600
7
F
1813.33
P
0.000
C3 = man-hours, C2=lot size
Exactly the same stuff but in a different order.
39
Splus
*** Linear Model ***
Call: lm(formula = V3 ~ V2, data = new.data,
na.action = na.omit)
Residuals:
Min 1Q Median 3Q Max
-3 -2
-0.5 1.5
5
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 10.0000 2.5029
3.9953 0.0040
V2 2.0000 0.0470
42.5833 0.0000
Residual standard error: 2.739 on 8 degrees of
freedom
Multiple R-Squared: 0.9956
F-statistic: 1813 on 1 and 8 degrees of freedom,
the p-value is 1.02e-010
V3 – Man-hours, V2 - lot size
Line 3-Line 5 basic summary of residuals
Followed by coefficients
estimate  ( s)
r2 (no adjusted r2 this is a statistician's package)
The good old F-test
Splus also outputs some graphs etc - discuss such
things shortly.
40
Checking the assumptions.
The procedures for validating a regression model
are graphical, no tests are (usually) conducted.
Graphs are used to identify problems with model
specification (is it linear), outliers - odd
observations that do not fit , observation that have a
large impact on the estimates (influence) ,
heteroscedascity - no constant variance in English,
independence of residuals and normality of
residuals.
The basic tool is the scatter plot of something
against something (possibly, and something if the
package supports 3D plots -Data Desk and S-plus).
The first plot is Y against X usually conducted
before a regression is fitted to see if the relationship
is linear.
41
Did this with Westwood and it seemed OK.
Another look at the MPG plot
Is it a curve or my bad drawing?
You wouldn't try a straight line fit to the data
below:
42
The MPG data is probably most typical of real data
sets , not really clear is it a line or not.
If there is no strong theoretical reason for believing
the relationship to be non-linear - start with a line
and look if there are any problems. Even in
example 3 if we had only half the data (below 50 or
above 50) we would probably fit lines to start with.
Fitting a line to data removes the basic trend
(increase or decrease). It may be easier to see what
is going on.
The residual plot
r  y  y
The plot of r against x - in practice y is a basic tool
for seeing all is OK.
However as computing power we can get fancier
r is scale dependent can't say are the residuals big
or small.
43
To get round this
standardised residuals = r/s
These are approximately N(0,1) in theory so
anything above 2.5 or below -2.5 is suspicious,
larger in magnitude than 3 doubtful , larger than 4
ridiculous -> model doesn't fit for this point.
But that is not quite correct, the variances of the
observed residuals are not actually equal, they get
larger further away from the centre data. With
computers being good at arithmetic most offer
studentised residuals these take account of the
changing variance - more accurately N(0,1).
Internally and externally studentised residuals
differ by the way the variance is estimated.
These are refinements that are helpful but not
strictly necessary, using old fashioned standardised
residuals is adequate but you might as well use
better tools if they are available.
44
Raw residuals Westwood
No obvious problems unless the 4+ is rather big . s
= sqrt(7.5) 2+ so probably not suspicious when
divided by s.
45
ESR Westwood
We now know that it is about 2.5 in standard units
no great concern - the refinement is practically
irrelevant.
MPG data
46
The curvature hypothesis not really borne out by
this plot
Some tendency for positive residuals at the edges
and negative ones in the centre (might indicate a
concave curve) but not really enough to worry
about. Also there is a car with a lousy MPG for its
weight (large negative residual) - (Ford Mustang
sporty - light and big engine) but the residual not
really big enough to be worrying.
Before going further we need to look at some
residual plots with obvious problems so that we can
see what to look for.
Real data sets often have lots of problems or subtle
problems and it is difficult to unambiguously
decide what is going on. The simplest way is to
illustrate is on artificial data sets that have one
specific problem each.
47
In the examples below modified Westwood Co data
are used.
Outlier - Suppose there was a typo in the recording
of the data a 40 become 90 .
or…
due to poor workmanship the lot had to be
processed again. (total workload recorded)
Either way this observation is unlike the others.
Scatter
Residual
Dependent variable is: outlierNo
Selector R squared = 85.8% , s
= 15.56 with 10 - 2 = 8
degrees of freedom
Variable
Coefficient
Constant
22.3529
Size
1.85294
48
r2 lower but still OK.
Line is affected
b decr. a inc.
Remedy : This observation is NOT part of this
model . Remove it from the data set. Seek
explanation why is it different.
Curvature
ynew = 30 + sqrt(y-30)
Economies of scale: with larger lot sizes it takes
less time to process a unit.
Scatter
Residuals
Scatter hardly any indication of problems.
49
Residuals tend to be the same sign at the edges of
the data. (here negative - convex relationship):
the residual signs here are:
- (- +)+++(++-)-r2 = 98.2% no indication of problems
Coefficients not comparable.
Remedy - Get the form of the relationship.
Transform the X, Ys or both - see later.
USA population.
Note:
Non-independence of residuals will produce a
similar plot. A curved relationship will induce nonindependence
Heteroscedascity - changes in variance.
Ynew = Y + random*(Y)
Variance of Y proportional to Y2
50
Diagnostic - triangular shape to residual plot.
Effects - sub-optimal line
No bias but as (here) small (X,Y) more accurate
they should be given more say in determining the
line.
Tests, expressions for variances invalid.
Remedy: Correct model: weighted regression.
Transformations may help, a curved model often
more appropriate.
These are the most common features of
problematic residual plots Residual analysis is
often more informative than the regression itself it
identifies the odd cases - Mustang and MPG.
51
The rule is no patterns in residual plots ?
Unevenly spaced integer xs . NO problems
Rounded Y - regression almost perfect (Don't have
exact normality).
Some case studies.
Download