What is regression?
( source: Roger Hadgraft roger.hadgraft@eng.monash.edu.au
Fitting models to data sets
Linear regression is most common
“Line of best fit”
Example
Look at the data
Fuel efficiency
25
20
15
10
5
0
800 1000 1200
M ass (kg) e(i)
1400 1600
Chart | Add trendline
Fuel efficiency y = 0.0126x - 0.8763
R
2
= 0.3427
25
20
15
10
5
0
800 1000 1200
M ass (kg)
1400 1600
Some maths
Assume we have paired data [x(i), y(i)]
Simplest model is:
Y(i) = b1.x(i) + bo
Where Y is model and y is original data
This notation matches Excel ’s
Residual or Error: e(i) = Y(i) - y(i)
Minimise
e(i) 2 - Least Squares approach
Chart | Add trendline
Fuel efficiency y = 0.0126x - 0.8763
R
2
= 0.3427
25
20
15
10
5
0
800 1000 e(i)
1400 1600 1200
M ass (kg)
Regression coefficients e i
2
y i
b 1 .
x i
Choose b1 and
bo
2
SSE = Sum of the
Squares of the Errors b0 to minimise SSE
n e i
2
(SSE)
0
bo
2
n
( y i
b 1 .
x i
bo )
(SSE)
0
b 1
2
n
( y i
b 1 .
x i
bo ) x i
Rearranging n .
bo
b 1
n x i
n y i bo
x i
b 1
x i
2 n
Thus : b 1
n n
n x i y i
n
n x i n
n x i
2
n x i y i x i
n
2 y i bo
y
b 1 .
x
Data Analysis | Regression
We can do the calculations by hand, or we can use Excel ’s Data Analysis Toolpak
Tools | Add-ins | Data Analysis
Once only to activate it
Tools | Data Analysis | Regression
Demonstration
Example
Chart | Add trendline
Fuel efficiency y = 0.0126x - 0.8763
R
2
= 0.3427
25
20
15
10
5
0
800 1000 1200
M ass (kg)
1400 1600
Tools | Data Analysis |
Regression
This means that 34% of the variance in fuel consumption is explained by vehicle mass. The remaining
66% belongs to other factors (eg driver behaviour, etc
Is the model any good?
R 2 = proportion of variance of y data explained by regression equation
=SSR/SST
SSR = unexplained variance
Total
SST
Sum
n of Squares
y i
y
2
Error Sum of Squares
SSE
n e i
2
Regression Sum of Squares
SSR
SST
SSE
Tools | Data Analysis |
Regression
R = sqrt(R 2 )
Tools | Data Analysis |
Regression
Compensates for different number of model parameters (in multiple linear regression).
Text page 587
Tools | Data Analysis |
Regression standard deviation of the residuals (but divide by (n-
2) rather than (n-1))
Questions?
Tools | Data Analysis |
Regression
ANOVA = Analysis of
Variance
Tools | Data Analysis |
Regression
SSR, SSE and SST
Tools | Data Analysis |
Regression
Regression df = k-1
Total df = n-1
Residual df=(n-1)-(k-1)=(n-k) k=number of parameters n=number of data points
Tools | Data Analysis |
Regression
Regression MS = SSR/df1
Residual MS = SSE/df2
Tools | Data Analysis |
Regression
F = Reg MS / Residual MS
Tools | Data Analysis |
Regression
Probability of F statistic given df1=1 and df2=18.
This is the probability of no relationship.
Analisis
Other regressions
Multilinear regression
Non-linear equations
Transform the variables, eg logs, powers, etc
use multi-linear regression to determine coefficients