Statistical Modelling Simple Linear Regression Single explanatory variable • Parameters

advertisement
Statistical Modelling
Simple Linear Regression
Models for explanation and prediction
Single explanatory variable
• Response variable - y
• Explanatory variables –
– Can be continuous, categorical or a combination
• Parameters
– Intercept
– Slope
• Simple random sample of n observations
• Seek relationship between y and x’s
– Aim for a parsimonious model
• Errors
1
Assumptions
2
Fitting
Least Squares - find parameters to
minimize error sum of squares
• Linearity
• Error terms
– Mean zero
– Constant variance
– Independence
– Often also assume normality, i.e.
For normality this is equivalent to
maximum likelihood
3
4
Linear Regression - Mussels
Fitting the Least squares Linear Regression Line
0.4
• Consider relationship between edible mass
and shell mass
clear
0.3
Scatterplot of edible vs shell
0.2
50
0.1
40
1
2
3
4
5
edible
0.0
6
weight
30
20
10
0
0
100
5
200
shell
300
400
6
Regression Analysis: edible versus shell
Regression Output
The regression equation is
Predictor
edible = 4.42 + 0.133 shell
Predictor
Coef
SE Coef
T
P
4.4234
0.8569
5.16
0.000
0.132712
0.005768
23.01
0.000
Constant
shell
S = 4.27444
R-Sq = 86.9% R-Sq(adj) = 86.7%
Constant
shell
Coef
SE Coef
T
P
4.4234
0.8569
5.16
0.000
0.132712
0.005768
23.01
0.000
Gives estimate and variability
Test of significance given by t-value, e.g.
for testing H0: = 0 vs H1:
z0
use the t-value and associated p-value
Rule-of-thumb: t-values with absolute value
greater than 2 are significant
7
8
• Estimate of V
Analysis of Variance
S = 4.27444
Source
Regression
Residual Error
Total
based on 80 degrees of freedom
(= 82 obs – 2 estimated parameters)
• Proportion of variation explained
DF
1
80
81
SS
9671.1
1461.7
11132.8
MS
9671.1
18.3
F
529.32
F-test gives overall test of significance of
regression
R-Sq = 86.9%
For simple regression equivalent to the ttest for the slope estimate
Residual Error MS gives error variance s2
multiple R2 coefficient – here squared
correlation coefficient
9
Fitted Line Plot
Key components for model checking are:
• Fitted values
edible = 4.423 + 0.1327 shell
S
R-Sq
R-Sq(adj)
4.27444
86.9%
86.7%
40
edible
10
Model Checking
Fitted Line Plot
50
P
0.000
• Residuals
30
observed - fitted
20
Can be obtained as additional output.
Basic diagnostic plots can also be requested
Minitab automatically prints out a list of points
with large residuals (R) or large influence (X)
10
0
0
100
200
shell
300
400
11
12
Basic Residual Plots
• Plot residuals vs x-variable – linearity
• Plot residuals vs fitted values
– Linearity
– Constant variance
Plot residuals against included variables and
other variables of potential importance
• Normal probability plot of residuals
– normality
13
14
Residual Plots for edible
Normal Probability Plot of the Residuals
Percent
99
90
50
10
1
0.1
-4
-2
0
2
Standardized Residual
4
4
2
0
-2
-4
Frequency
30
20
10
-3.2
-1.6
0.0
1.6
Standardized Residual
0
12
24
36
Fitted Value
48
Residuals Versus the Order of the Data
Standardized Residual
Histogram of the Residuals
40
0
Interval Estimates
Residuals Versus the Fitted Values
Standardized Residual
99.9
3.2
4
2
0
-2
-4
1
10
20
30
40
50
60
Observation Order
70
• Parameter estimates:
coefficient r 2 * se of coefficient
• Fitted values – confidence interval
• Predicted values – prediction interval
New
Obs
1
- 150
Fit
24.330
SE Fit
0.495
95% CI
(23.344, 25.316)
95% PI
(15.767, 32.894)
New
Obs
1
- 400 extreme
Fit SE Fit
57.508
1.661
95% CI
(54.203, 60.813)
95% PI
(48.382, 66.634)XX
80
What do we see?
15
16
Download