9. SIMPLE LINEAR REGESSION AND CORRELATION

advertisement
9.
SIMPLE LINEAR REGESSION AND
CORRELATION
9.1
9.2
9.3
9.4
9.5
9.6
9.7
9.8
Regression and Correlation
Regression Model
Probabilistic Models
Fitting The Model: The Least Square Approach
The Least-Square Lines
The Least-Squares Assumption
Model Assumptions of Simple Regression
Assessing the utility of the model-making inference about
the slope
The Coefficient of Correlation
Calculating r2
Correlation Model
Correlation Coefficient
The Coefficient of Determination
Using The Model for Estimation and Prediction
9.9
9.10
9.11
9.12
9.13
9.14
9.1 Regression and Correlation
• Regression: Helpful in ascertaining the probable form of
the relationship between variables, and predict or estimate
the value corresponding to a given value of another
variable.
• Correlation: Measuring the strength of the relationship
between variables.
9.2 Regression Model
Two variables, X and Y, are interest.
Where, X = independent variable
Y = dependent variable
E ( y)   0  1 x 
 = Random error component
 0 (beta zero) = y intercept of the line
1
(beta one) = Slope of the line , the amount of
increase or decrease in the
deterministic of y for every 1 unit of x
increase or decrease.
Figure 9a. Regression Model
9.3 Probabilistic Models
9.3.1 General Form of Probabilistic Models
y = Deterministic component + Random error
Where y is the variable of interest. We always assume that the mean
value of the random error equals 0. This is equivalent to assuming
that the mean value of y, E(y), equals the deterministic component
of the model; that is,
E(y) = Deterministic component
9.3.2 A First-Order (Straight-line) Probabilistic
Model
y   0  1 x 
Where
y
= Dependent or response variable (variable to be
modeled)
x
= Independent or predictor variable (variable used
as a predictor of y)
E(y) =  0  1 x = Deterministic component
 (epsilon) = Random error component
 0 (beta zero) = y-intercept of the line, that is, the point at which
the line intercepts or cuts through the y-axis
(see figure 9b below)
1 (beta one) = Slope of the line, that is, the amount of increase (or
decrease) in the deterministic component of y
for every one-unit increase in x.
Figure 9b.
The straight-line model
9.4 Fitting The Model:
The Least Square Approach
Table 9a.
Reaction Time Versus Drug Percentage
Subject
Amount of
Drug x (%)
Reaction Time y
(seconds)
1
2
3
4
5
1
2
3
4
5
1
1
2
2
4
Figure 9c.
1) Scattergram
(for data in table above)
2) Visual straight line
(fitted for the data above)
9.5 The Least-Squares Lines
• The Least-Squares Line is result in obtaining the desired
line which called method of lease-squares.
y   0  1 x
Where,
y = value on the vertical axis
x = value on the horizontal axis
 0= point where the line crosses the vertical
axis
1= shows the amount by which y changes for
each unit change in x.
9.5.1 Definition of Least Square Line
The least square line is one that has the following two properties:
1.
2.
The sum of the errors (SE) equals 0
The sum of squared errors (SSE) is smaller than that for any
other straight-line model
9.5.2 Formulas for the Least Squares Estimation
SS xy
ˆ
Slope : b1 
SS xx
y  intercept : bˆ0  y  bˆ1 x
Where;
(xi )(yi )
SS xy  ( xi  x )( yi  y )  xi yi 
n
2
(

x
)
i
SS xx  ( xi  x ) 2  x 2 
n
n  Sample size
i
Figure 9d. Scatter Diagram
The total deviation:- measuring the vertical distance from
line.
( yi  y )
The explained deviation:- shows how much the total deviation
is reduced when the regression line is fitted to the points.
( yˆ  y )
Unexplained deviation:- shows the proportion of the total
deviation accounted for by the introduction of the regression
line.
( yi  y )  ( yˆ  y )  ( yi  y )
total deviation
Explained Unexplained
deviation
deviation
Total sum of squares (SST):- to measure of the total variation in
observed values of Y.
Explained sum of squares (SSR) :- measures the amount of the
total variability in the observes values of Y that is accounted
for by the linear relationship between the observed values of X
and Y.
( yˆ  y ) 2
Unexplained sum of squares (SST):-measure the dispersion of
the observed Y values about the regression line.
( yi  yˆ ) 2
9.6 The Least-Squares
Assumption
Consider now a reasonable criterion for
estimating  and  from data. The method of
ordinary least squares (OLS) determines
values of  and  (since these will be
estimated from data, we will replace  and 
with Latin letters a and b).
so that the sum of the squared vertical
deviations residuals) between the data and
the fitted line,
Residuals = Data -Fit,
is less than the sum of the squared vertical
deviations from any other straight line that
could be fitted through the data:
Minimum of (Data - Fit)²
A "vertical deviation" is the vertical
distance from an observed point to the
line. Each deviation in the sample is
squared and the least-squares line is
defined to be the straight line that makes
the sum of these squared deviations a
minimum:
Data = a + bX + Residuals.
Figure 1 (a) illustrates the regression
relationship between two variables, Y and X.
The arithmetic mean of the observed values
of Y is denoted by y . The vertical dashed
lines represent the total deviations of each
value y from the mean value y .
Part (b) in Figure 1 shows a linear leastsquares regression line fitted to the observed
points
Figure 9e: The total variation of Y and the leastsquares regression between Y and X.
(a) Total variation
(b) Least-squares regression
The total variation can be expressed in terms
of (1) the variation explained by the
regression and (2) a residual portion called
the unexplained variation.
Figure 2 (a) shows the explained variation,
which is expressed by the vertical distance
between any_ fitted (predicted) value and the
mean or yˆ i - y . The circumflex (^) over the y
is used to represent fitted values determined
by a model. Thus, it is also customary to
write a = ̂ and b = ̂ . Figure 2 (b) shows
the unexplained or residual variation-the
vertical distance between the observed values
and the pre-dicted values (y - yˆ i )
Figure 9f. The explained and unexplained
variation in least-squares regression.
(a) Explained variation
(b) Unexplained variation
9.7 Model Assumptions
of Simple Regression
y   0  1 x 
Assumption 1:
The mean of the probability distribution of  is 0. That is, the average
of the values of  over an infinitely long series of experiments is 0
for each setting of the independents variable x. This assumption
implies that the mean value of y, E(y), for given value of x is
E ( y)   0  1 x .
Assumption 2:
The variance of the probability distribution of  is constant for
all settings of the independent variable x. For our straight-line
model, this assumption means that the variance of  is equal
to a constant, say  2 , for all values of x.
Assumption 3:
The probability distribution of  is normal.
Assumption 4:
The values of  associated with any two observed values of y
are independent. That is, the value of  associated with one
value of y has no effect on the values of  associated with
other y values.
Figure 9g.
The probability distribution of
9.8 Assessing the Utility of the Model:
Making Inference About the Slope
9.8.1 A Test Of Model Utility: Simple Linear
Regression
One-Tailed Test
Two-Tailed Test
Where t and t 2 are based on ( n  2) degrees of freedom
Assumption: Refer the four assumption about 
Figure 9h.
Rejection region and calculated t value
for testing
versus
9.8.2 A
Confidence Interval for
the Simple Linear Regression Slope
Where the estimated standard error of
And
is calculated by
is based on (n-2) degrees of freedom.
Assumption: Refer the four assumption about
9.9 The Coefficient of Correlation
Definition:
The Pearson product moment coefficient of
correlation, r, is a measure of a strength of the linear
relationship between two variables x and y. It is
computed (for a sample of n measurements on x and
y) as follows:
r
SS xy
SS xx SS yy
Figure 9i.
Value of r and their implication
1) Positive r : y increases as x increases
2) r near zero: little or no relationship between y
and x
3) Negative r : y decreases as x increases
4) r = 1: a perfect positive relationship between y
and x
5) r = -1: a perfect negative relationship between y
and x
6) r near 0: little or no relationship between y and x
9.10 Calculating r2
Where:
r = The sample correlation
Fig 9j. r2 as a measure of closeness of fit of the sample regression line to
the sample observation
9.11
Correlation Model
• We have what is called Correlation Model, when Y and
X are random variable.
• Involving two variables implies a co-relationship
between them.
• One variable as dependent and another one as
independent.
9.12 The Correlation Coefficient ( )
• Measures the strength of the linear relationship
between X and Y.
• May assumed any value between –1 and +1.
• If = 1, there is perfect direct linear.
• If = -1, indicates perfect inverse linear
correlation.
9.13 The Coefficient of Determination
• Figure 9k. A comparison of the sum of
squares of deviations for two models
b
c
9.13.1 Coefficient of Determination Definition
r 
2
SS yy  SSE
SS yy
SSE
1
SS yy
• It represents the proportion of the total sample variability
around y that is explained by the linear relationship
between y and x. (In simple linear regression, it may also
be computed as the square of the coefficient of correlation
r.
9.14 Using The Model for Estimation and
Prediction
9.14.1
1.
Sampling Errors for the Estimator of the Mean
of y and the Predictor of an Individual New
Value of y
The standard deviation of the sampling distribution of the
estimator ŷ of the mean of y at a specific value of x, say xp is
2
1 (xp  x)
 yˆ  

n
SS xx
Where  is the standard deviation of the random error  . We
refer to  ŷ as the standard error of ŷ .
2.
The standard deviation of the prediction error for the
predictor ŷ of an individual new y value at a specific
value of x is
 y  yˆ
1
  1 
n
( x p  x )2
SS xx
Where  is the standard deviation of the random error  .
We refer to  y  yˆ as the standard error of prediction.
9.14.2
Where
A
Confidence Interval for
the Mean Value of y at x = xp
is based on (n-2) degrees of freedom.
9.14.3 A
Prediction Interval for an
Individual New Value of y at x = xp
Where
is based on (n-2) degrees of freedom.
Figure 9l.
A 95% confidence interval for mean sales
and a prediction interval for drug
concentration when x = 4
Figure 9m. Error of estimating the mean value of y
for a given value of x
Figure 9n.
Error of predicting a future value of y
for a given value of x
Figure 9o.
Confidence intervals for mean value
and prediction intervals for new values
Download