Uploaded by Mike Mubita

Correlation and Regression Analysis

advertisement
Correlation and Simple Linear Regression Analysis
Lecture Number 12
March 7, 2017
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
1 / 48
Outline of Lecture 12
1
Introduction
2
Scatter Plots for Two Quantitative Variables
3
Correlation Analysis
4
Calculating the Sample Correlation Coefficient r
5
Assumptions of Correlation Analysis
6
Simple Linear Regression Analysis
7
ANOVA for Simple Linear Regression Analysis
8
Inferences Concerning β1 , the Slope of the Line
9
Measuring the Strength of the Relationship: Coefficient of
Determination R 2
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
2 / 48
Introduction
Biologists, and researchers in general, commonly record or observe
more than one variable from each sampling or experimental unit.
For example, a forester may measure tree height and dbh on each
tree, an aquaculturist may measure the length and weight of each fish
in an aquarium, a physiologist may record blood pressure and body
weight from experimental animals, or an ecologist may record the
abundance of a particular species of shrub and soil pH from a series of
plots during vegetation sampling.
When two variables are measured on a single experimental unit, the
resulting data are called bivariate data and multivariate when we
have more than two random variables recorded from each unit.
The focus for this lecture will be bivariate quantitative data.
Note that: ”Bi” means ”two”; and thus bivariate data generates
pairs of measurements or observations.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
3 / 48
Introduction...cont’d
With such kind of data (bivariate), we may ask the following
questions:
(1)
(2)
(3)
(4)
Are the two variables related?
If so, what is the strength of the relationship?
What type of relationship exists?
What kind of predictions can be made from the relationship?
The first two questions can be answered by carrying out a statistical
technique called Correlation Analysis while the last two can be
answered using Regression Analysis.
Correlation is a statistical method used to determine whether a
relationship between variables exists and the strength of the
relationship.
Regression is a statistical method used to describe the nature of the
relationship between variables, that is, positive or negative, linear or
nonlinear.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
4 / 48
Introduction...cont’d
The relationship between variables may be simple or multiple.
In a simple relationship, there are two variables - an independent
variable, also called an explanatory variable or a predictor variable,
and a dependent variable, also called a response variable.
Simple relationships can also be positive or negative. A positive
relationship exists when both variables increase or decrease at the
same time.
In a negative relationship , as one variable increases, the other
variable decreases, and vice versa.
Thus, in simple correlation and regression studies, the researcher
collects data on two numerical or quantitative variables to see
whether a relationship exists between the variables.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
5 / 48
Scatter Plots for Two Quantitative Variables
Consider a random sample of n observations of the form
(x1 , y1 ), (x2 , y2 ), ..., (xn , yn ), where x is the independent variable and
y is the dependent variable, both being scalars.
A preliminary descriptive technique for determining the form of
relationship between x and y is the scatter diagram or scatter plot.
Each pair of data values is plotted as a point on this two-dimensional
graph (so that the graph takes the form of a plot on the (x, y ) axes),
called a scatter plot.
Scatter Plot
A scatter plot is a graph of the ordered pairs (x, y ) of numbers or values
consisting of the independent variable x and the dependent variable y .
A scatter plot is the two dimensional extension of the dotplot we use
to graph one quantitative variable.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
6 / 48
Scatter Plots for Two Quantitative Variables
After the scatter plot is drawn, it should be analyzed to determine
which type of relationship, if any, exists.
You can describe the relationship between two variables, x and y ,
using the patterns shown in the scatterplot:
(i) What type of pattern do you see?Is there a constant upward or
downward trend that follows a straight-line pattern? Is there a curved
pattern? Is there no pattern at all, but just a random scattering of
points?
(ii) How strong is the pattern? Do all of the points follow the pattern
exactly, or is the relationship only weakly visible?
(iii) Are there any unusual observations (outliers)? An outlier is a
point that is far from the cluster of the remaining points. Do the
points cluster into groups? If so, is there an explanation for the
observed groupings?
The next slide shows some examples of scatter plots and the pattern
of relationship suggested by the data.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
7 / 48
Scatter Plots for Two Quantitative Variables
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
8 / 48
Correlation Analysis
As earlier stated, correlation is used to determine if a relationship
exists between two quantitative variables.
A numerical measure used to determine whether two or more
variables are related and to determine the strength of the relationship
between or among the variables is called a correlation coefficient.
There are several types of correlation coefficients but our focus will be
on the Pearson product moment correlation coefficient (PPMC).
The correlation coefficient computed from the sample data measures the
strength and direction of a linear relationship between two variables. The
symbol for the sample correlation coefficient is r while the symbol for the
population correlation coefficient is ρ (Greek letter rho).
The range of the correlation coefficient is from −1 to +1.
If there is a strong positive linear relationship between the
variables , the value of r will be close to +1.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
9 / 48
Correlation Analysis
If there is a strong negative linear relationship between the
variables , the value of r will be close to −1.
When there is no linear relationship between the variables or only
a weak relationship , the value of r will be close to 0.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
10 / 48
Calculating the Sample Correlation Coeffienct r
Let (x1 , y1 ), (x2 , y2 ), ..., (xn , yn ) be a random sample (of size n) from
a bivariate normal distribution.
The unbiased estimator of ρ is the sample correlation coefficient
defined by ρ̂ or r ,
Pn
(xi − x̄)(yi − ȳ )
sxy
=
r = pPn i=1
P
n
2
2
sx sy
i=1 (xi − x̄)
i=1 (yi − ȳ )
The quantities in the denominator, i.e. sx and sy are the standard
deviations for the variables x and y , respectively, which can be found
by using the statistics function on your calculator or the computing
formulas earlier discussed.
The new quantity, in the numerator, sxy is called the covariance
between x and y and is defined as:
P
P
Pn
Pn
( ni=1 xi )( ni=1 yi )
x
y
−
(x
−
x̄)(y
−
ȳ
)
i
i
i
i
i=1
n
sxy = i=1
=
n−1
n−1
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
11 / 48
Calculating the Sample Correlation Coeffienct r
When r > 0, values of y increase as the values of x increase, and the
data set (for the two variables) is said to be positively correlated.
When r < 0, values of y decrease as the values of x increase, and the
data set is said to be negatively correlated
Example
(1) Construct a scatter plot for the data obtained in a study on the number of
absences and the final grades of seven randomly selected students from a
Biometry class.
(2) Compute the correlation coefficient for the data.
The data are shown on the next slide.
Note that r can easily be computed using the formula:
P
P
P
n ni=1 xi yi − ( ni=1 xi )( ni=1 yi )
r = q P
P
P
P
n ni=1 xi2 − ( ni=1 xi )2 n ni=1 yi2 − ( ni=1 yi )2
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
12 / 48
Calculating the Sample Correlation Coeffienct r
Student
Number of absences x
Final grade y (%)
A
B
C
D
E
F
G
6
2
15
9
12
5
8
82
86
43
74
58
90
78
For the scatter plot,
(1) Draw and label the x and y axes using an appropriate scale, e.g. for %
you could use a 10% interval from 30% to 100%; while for number of
absences, from 0 to 15, with a 1 unit interval.
(2) Plot each point on the graph,
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
13 / 48
Calculating the Sample Correlation Coeffienct r
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
14 / 48
Calculating the Sample Correlation Coeffienct r
For computations of r , proceed as follows:
(1) Make a table comprising of at least 5 columns; for x, y , xy , x 2 and y 2 .
(2) Compute the sum of values for x, y , xy , x 2 and y 2 and place these
values in the corresponding columns of the table.
Student
x
y
xy
x2
y2
A
B
C
D
E
F
G
P
6
2
15
9
12
5
8
82
86
43
74
58
90
78
492
172
645
666
696
450
624
36
4
225
81
144
25
64
6,724
7,396
1,849
5,476
3,364
8,100
6,084
x = 57
y = 511
xy = 3, 745
x 2 = 579
y 2 = 38, 993
(3) Substitute in the formula and solve for r .
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
15 / 48
Calculating the Sample Correlation Coeffienct r
r is calculated as:
Pn
P
P
xi yi − ( ni=1 xi )( ni=1 yi )
i=1
q P
P
P
P
n ni=1 xi2 − ( ni=1 xi )2 n ni=1 yi2 − ( ni=1 yi )2
n
=p
7(3745) − (57)(511)
[(7)(579) − (57)2 ] [(7)(38, 993) − (511)2 ]
= −0.944
The value of r suggests a strong negative relationship between a
student’s final grade and the number of absences a student has. That
is, the more absences a student has, the lower is his or her grade.
Note: We can conduct a test of hypothesis concerning the
correlation coefficient. In that case:
H0 : ρ = 0 and Ha : ρ 6= 0
We can also construct a confidence interval for ρ.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
16 / 48
Assumptions of Correlation Analysis
The methods used to estimate and test a hypothesis on a population
correlation are based on assumptions that:
(1) the sample of data sets is a random sample from the population; and
(2) the measurements or observations have a bivariate normal
distribution in the population.
A bivariate normal distribution is a bell-shaped probability distribution
in two dimensions rather than one.
A bivariate normal distribution has the following features:
(1) the relationship between the two variables, say x and y , is linear;
(2) the cloud of points in a scatter plot of x and y has an elliptical shape;
and
(3) the frequency distributions of x and y separately are normal.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
17 / 48
Calculating the Sample Correlation Coeffienct r
Exercise
Consider the set of bivariate data given below:
x
1
2
3
4
5
6
y
5.6
4.6
4.5
3.7
3.2
2.7
(a) Draw a scatter plot to describe the data.
(b) Does there appear to be a relationship between x and y ? If so, how
do you describe it?
(c) Calculate the correlation coefficient, r . Does the value of r confirm
your conclusions in part (b)? Explain.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
18 / 48
Simple Linear Regression Analysis
If one of the two variables can be classified as the dependent variable
y and the other independent variable x, and if the data exhibit s
straight line pattern (i.e. if the value of the correlation coefficient is
significant), it is possible to describe the relationship relating y to x
using a straight line given by the equation:
y = a + bx
The relationship can be shown as below:
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
19 / 48
Simple Linear Regression Analysis
From the figure on previous slide, we see that a is where the line
crosses or intersects the y -axis: a is called the y -intercept.
We can also see that for every one-unit increase in x, y increases by
an amount b.
The quantity b determines whether the line is increasing (b > 0),
decreasing (b < 0), or horizontal (b = 0) and is appropriately called
the slope of the line.
The scatter diagrams or plots we constructed showed that not all the
data values (x, y ) fall on a straight line but they do show a trend that
could be described as a linear pattern.
We can describe this trend by fitting a line as best we can through
the points.
Thus, given a scatter plot, you must be able to draw the line of
best fit.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
20 / 48
Simple Linear Regression Analysis
Best fit means that the sum of the squares of the vertical distances
from each point to the line is at a minimum.
The reason you need a line of best fit is that the values of y will be
predicted from the values of x; hence, the closer the points are to the
line, the better the fit and the prediction will be.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
21 / 48
Simple Linear Regression Analysis
The basic idea of simple linear regression is to use data to fit a
prediction line that relates a dependent variable y and a single
independent variable x.
Assuming linearity, we would like to write y as a linear function of x:
y = β0 + β1 x
However, according to such an equation, y is an exact linear function
of x; no room is left for the inevitable errors (deviation of actual y
values from their predicted values).
Therefore, corresponding to each y we introduce a random error term
εi and assume the model to be:
yi = β0 + β1 xi + εi ; i = 1, 2, ..., n.
In the above model, we assume the random variable y to be made up
of a predictable part (a linear function of x) and an unpredictable part
(the random error εi ).
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
22 / 48
Simple Linear Regression Analysis
The coefficients β0 and β1 are interpreted as the true, underlying
intercept and slope respectively.
The error term ε includes the effects of all other factors, known or
unknown, i.e. the combined effects of unpredictable and ignored
factors yield the random error terms ε.
In regression studies, the values of the independent variable (the xi
values) are usually taken as predetermined constants, so the only
source of randomness is the εi terms.
Thus, when we assume that the xi s are constants, the only random
portion of the model for yi is the random error term εi .
With these definitions, the formal assumptions of simple linear
regression analysis are as given on the next slide.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
23 / 48
Simple Linear Regression Analysis
Assumptions of Simple Linear Regression Analysis
(1) The relation is, in fact, linear, so that the errors all have expected
value or mean zero: E (εi ) = 0 for all i.
(2) The errors all have the same variance: Var (εi ) = σ 2 for all i.
(3) The errors are independent of each other; Var (εi , εj ) = 0
(4) The errors, εi , are all normally distributed.
Because we have assumed that E (εi ) = 0, the expected value of y is
given by:
E (y ) = β0 + β1 x.
The estimator of the E (y ), denoted by ŷ , can be obtained by using
the estimators βˆ0 and βˆ1 of the parameters β0 and β1 , respectively.
Then, the fitted regression line we are looking for is given by:
ŷ = βˆ0 + βˆ1 x
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
24 / 48
Simple Linear Regression Analysis
The assumptions about the random error ε are shown in the Figure
below for three fixed values of x - say, x1 , x2 , and x3 .
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
25 / 48
Simple Linear Regression Analysis
For observed values (xi , yi ), we obtain the estimated value of yi as:
yˆi = βˆ0 + βˆ1 xi .
The deviation of observed yi from its predicted value yˆi , called the ith
residual, is defined by:
h
i
εi = (yi − yˆi ) = yi − (βˆ0 + βˆ1 xi ) .
The residuals, or errors εi , are the vertical distances between observed
and predicted values of yi ’s
The regression analysis problem is to find the best straight-line
prediction.
The most common criterion for ”best” is based on squared prediction
error.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
26 / 48
Simple Linear Regression Analysis
We find the equation of the prediction line - that is, the slope βˆ1 and
intercept βˆ0 that minimize the total squared prediction error or total
sum of squares for errors (SSE ) or sum of squares of the residuals for
all of the n data points.
The method that accomplishes this goal is called the least-squares
method or Ordinary Least Squares (OLS) because it chooses βˆ0
and βˆ1 to minimize the SSE :
SSE =
n
X
i=1
εi =
n
X
i=1
(yi − yˆi )2 =
n h
X
yi − (βˆ0 + βˆ1 xi )
i2
i=1
The least-squares estimates of slope and intercept are obtained as
follows:
sxy
βˆ1 =
and βˆ0 = ȳ − βˆ1 x̄
sxx
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
27 / 48
Simple Linear Regression Analysis
sxy
P
P
n
n
X
X
( ni=1 xi ) ( ni=1 yi )
=
(xi − x̄)(yi − ȳ ) =
xi yi −
n
i=1
sxx =
i=1
n
X
i=1
2
(xi − x̄) =
n
X
i=1
xi2
−
(
Pn
i=1 xi )
2
n
Thus, sxy is the sum of x deviations times y deviations and sxx is the
sum of x deviations squared.
Example
Find the equation of the regression line for the data in Example on number
of absences and final grades, and graph the line on the scatter plot.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
28 / 48
Simple Linear Regression Analysis
The values
equation are:
Pn needed forPthe
Pn
Pn
n
2
n = 7,
i=1 xi =
i=1 xi yi = 3, 745,
Pn i=12xi = 57, i=1 yi 1=P511,
n
1
579,
i=1 xi = 7 ∗ 57 = 8.1429 and ȳ =
i=1 yi = 38, 993, x̄ = n
1 Pn
1
i=1 yi = 7 ∗ 511 = 73
n
Substituting in the formulas, we get:
Pn
Pn
Pn
n
x
y
−
(
x
)
(
s
7(3, 745) − (57)(511)
xy
i
i
i
i=1
i=1
i=1 yi )
=
=
βˆ1 =
Pn
P
2
n
sxx
7(579) − (57)2
n i=1 xi2 − ( i=1 xi )
βˆ1 = −3.622
βˆ0 = ȳ − βˆ1 x̄ = 73 − (−3.622 ∗ 8.1429) = 102.494
Hence, the equation of the regression line ŷ = βˆ0 + βˆ1 x is:
ŷ = 102.494 − 3.622x
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
29 / 48
Simple Linear Regression Analysis
The sign of the correlation coefficient and the sign of the slope of the
regression line will always be the same.
That is, if r is positive, then βˆ1 will be positive; if r is negative, then
βˆ1 will also be negative.
The reason is that the numerators of the formulas are the same and
determine the signs of r and βˆ1 , and the denominators are always
positive!
When you graph the regression line, always select x values between
the smallest x data value and the largest x data value.
The regression line will always pass through the point whose x
coordinate is the mean of the x values and whose y coordinate is the
mean of the y values, that is, (x̄, ȳ ).
The regression line can be used to make predictions for the dependent
variable.
We can use the regression line to predict, for example, the final grade
for a student with 7 absences.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
30 / 48
Simple Linear Regression Analysis
To make such a prediction, we substitute 7 for x in the equation, i.e.
ŷ = 102.494 − 3.622x = 102.494 − 3.622(7) = 77.14
Hence, a student who has 7 absences will have approximately 77.14%
as the final grade.
The magnitude of the change in one variable when the other variable
changes exactly 1 unit is called a marginal change.
The value of slope βˆ1 of the regression line equation represents the
marginal change.
For example, in our regression line constructed, the slope of the
regression line is −3.622, which means for each increase in number of
absences, the value of y (final grade) changes by −3.622 on the
average.
For valid predictions, the value of the correlation coefficient, r , must
be significant.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
31 / 48
Simple Linear Regression Analysis
When r is not significantly different from 0, the best predictor of y is
the mean of the data values of y .
Extrapolation, or making predictions beyond the bounds of the data,
must be interpreted cautiously!
Remember that when predictions are made, they are based on present
conditions or on the premise that present trends will continue. This
assumption may or may not prove true in the future!
Note that: A scatter plot should be checked for outliers. An outlier is
a point that seems out of place when compared with the other points.
Some of these points can affect the equation of the regression line.
When this happens, the points are called influential points or
influential observations.
Points that are outliers in the x direction tend to be influential points
and judgement has to be made as to whether they should be included
in the final analysis.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
32 / 48
ANOVA for Simple Linear Regression Analysis
We can use ANOVA and test of hypothesis to make inferences about
regression parameters for the simple regression model.
In a regression analysis, the response y is related to the independent
variable x.
Hence, the total variation in the response variable y , given by:
P
n
n
X
X
( ni=1 yi )2
2
2
SSTotal = syy =
(yi − ȳ ) =
yi −
n
i=1
i=1
SSTotal is divided into two portions:
(1) SSR (sum of squares for regression) - measures the amount of variation
explained by using the regression line with one independent variable x
n
SSR =
X
(sxy )2
=
(ŷi − ȳ )2
sxx
i=1
(2) SSE (sum of squares for error) - measures the ”residual” variation in
the data that is not explained by the independent variable x.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
33 / 48
ANOVA for Simple Linear Regression Analysis
Note that in the figure above, y 0 stands for the ŷ .
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
34 / 48
ANOVA for Simple Linear Regression Analysis
Since SSTotal = SSR + SSE , we can complete the partition by
calculating:
(sxy )2
SSE = SSTotal − SSR = syy −
sxx
Remember from our previous discussions on CRD and RCBD that
each of the various sources of variation, when divided by the
appropriate degrees of freedom, provides an estimate for mean
squares - MS = SS
df .
The ANOVA table for simple linear regression is as shown below:
Source
df
Regression
1
Error
Total
n−2
n−1
Lecture Number 12
SS
(sxy )2
sxx
(s )2
syy − sxyxx
MS
MSR =
MSE =
F
SSR
(1)
SSE
(n−2)
MSR
MSE
syy
Correlation and Simple Linear Regression Analysis
March 7, 2017
35 / 48
ANOVA for Simple Linear Regression Analysis
For the data on number of absences and final grade, we can compute
the quantities in the ANOVA table as follows:
P
n
n
X
X
( ni=1 yi )2
2
2
SSTotal = syy =
(yi − ȳ ) =
yi −
n
i=1
= 38, 993 −
i=1
(511)2
= 1, 690
7
(sxy )2
= βˆ1 ∗ sxy = −3.622 ∗ −416 = 1, 506.752
sxx
SSE = SSTotal − SSR = 1, 690 − 1, 506.752 = 183.248
SSR =
MSR =
SSR
SSE
183.248
= 1, 506.752 and MSE =
=
= 36.6496
1
n−2
5
MSR
1, 506.752
F =
=
= 41.1124
MSE
36.6496
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
36 / 48
ANOVA for Simple Linear Regression Analysis
The ANOVA table for our example is thus summarized as follows:
Source
df
SS
MS
F
Regression
Error
Total
1
5
6
1,506.752
183.248
1,690
1,506.752
36.6496
41.1124
The calculated F is then used to determine whether the regression
model constructed is significant.
The rejection region is; reject H0 if:
F > Fα (1, n − 2).
For our example, suppose we take α = 0.05, the critical F value is:
F0.05 (1, 5) = 6.608
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
37 / 48
Inferences Concerning β1 , the Slope of the Line
In considering simple linear regression, we may ask two questions:
(1) Is the independent variable x useful in predicting the response variable
y?
(2) If so, how well does it work?
The first question is like asking: is the regression equation that uses
information provided by x substantially better than the simple
predictor ȳ that does not rely on x?
If the independent variable x is not useful in the population model,
y = β0 + β1 x + ε, then the value of y does not change for different
values of x.
The only way that this happens for all values of x is when the slope
β1 of the line of means equals 0.
This would indicate that the relationship between y and x is not
linear, so that the initial question about the usefulness of the
independent variable x can be restated as: Is there a linear
relationship between x and y ?
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
38 / 48
Inferences Concerning β1 , the Slope of the Line
You can answer this question by using either a test of hypothesis or a
confidence interval for β1 .
Both of these procedures are based on the sampling distribution of
β̂1 , the sample estimator of the slope β1 .
It can be shown that, if the assumptions about the random error ε are
valid, then the estimator β̂1 has a normal distribution in repeated
sampling with mean
E (β̂1 ) = β1 , and
standard error (SE) given by:
s
SEβˆ1 =
σ2
sxx
where, σ 2 is the variance of the random error ε.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
39 / 48
Inferences Concerning β1 , the Slope of the Line
Since the value of σ 2 is estimated with s 2 = MSE , you can base
inferences on the statistic given by
β̂1 − β1
β̂1 − 0
t=p
or t = p
MSE /sxx
MSE /sxx
which has a t distribution with df = (n − 2), the degrees of freedom
associated with MSE .
A summary of statistical test for the slope β1 is outlined on the next
slide.
The F test (given by F = MSR
MSE ) obtained in regression analysis
ANOVA table can be used as an equivalent test statistic for testing
the hypothesis: H0 : β1 = 0.
The F test , F = MSR
MSE , has df1 = 1 and df2 = n − 2. It is a more
general test of the usefulness of the model and can be used when the
model has more than one independent variable.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
40 / 48
Inferences Concerning β1 , the Slope of the Line
Test of Hypothesis concerning the slope of the Line
(1) H0 : β1 = 0 and Ha : β1 6= 0(two-tailed test), and Ha : β1 > 0 or
Ha : β1 < 0 (both one-tailed tests).
(2) Test statistic: t = √ β̂1 −0
MSE /sxx
When the assumptions for simple linear regression are satisfied, the
test statistic will have a Student’s t distribution with df = (n − 2).
(3) Rejection region: reject H0 when: t > tα or t < −tα (when
Ha : β1 < 0) for one-tailed tests and t > tα/2 or t < −tα/2 , for
two-tailed test. Alternatively, reject H0 when p−value< α.
The values of tα and tα/2 are found in t− distribution tables; use
df = n − 2.
Note: A 100(1 − α)% confidence interval for β1 is found as:
p
β̂1 ± tα/2 ∗ MSE /sxx
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
41 / 48
Inferences Concerning β1 , the Slope of the Line
Example
Using the data for the number of absences and final grades,
(i) Test for the significance of the relationship between the two variables
using α = 0.05.
(ii) Construct a 95% confidence interval for the slope of the regression
line.
Solution
For testing the significance of the relationship between the two variables,
the test of hypothesis is as follows:
(1) H0 : β1 = 0 and Ha : β1 6= 0
(2) α has been given as 0.05 and the test is two-tailed; the critical values
for t are: t0.025 (5) = ±2.571.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
42 / 48
Inferences Concerning β1 , the Slope of the Line
Solution...Cont’d
(3) Rejection region: Reject H0 if t < −2.571 or t > 2.571.
(4) Compute the test statistic
−3.622
β̂1 − 0
=p
= −16.965
t=p
MSE /sxx
36.6496/804
(5) Make the decision. We reject H0 at α = 0.05 since
−16.965 < −2.571.
(6) Conclusion: there is sufficient evidence that there is a significant linear
relationship between the final grades and number of absences for the
Biometry class.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
43 / 48
Inferences Concerning β1 , the Slope of the Line
Solution...Cont’d
100(1 − α)% confidence interval for β1 = β̂1 ± tα/2 ∗
p
MSE /sxx
= −3.622 ± 2.571 ∗ 0.214
= [−4.161, −3.083]
The resulting 95% confidence interval does not contain 0; thus we would
conclude that the true value of β1 is not 0, and thus would reject
H0 : β1 = 0 in favour of Ha : β1 6= 0.
This conclusion is the same as the one we arrived at when we conducted
the test using the critical value approach.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
44 / 48
Measuring the Strength of the Relationship: Coefficient of
Determination R 2
The strength of the relationship measures how well the regression
model fits the data.
We earlier on stated that the correlation coefficient r can be used to
measure the strength of relationship between two variables.
Closely related to the r is what is called the Coefficient of
Determination R 2 - the coefficient of determination is simply the
square of the correlation coefficient.
The coefficient of determination is the ratio of the explained
variation to the total variation and is denoted by either r 2 or R 2 :
Explained Variation
SSR
SSTotal − SSE
R2 =
=
=
Total Variation
SSTotal
SSTotal
Thus, R 2 is a measure of the variation of the dependent variable that
is explained by the regression line and the independent variable.
R 2 is usually expressed as a percentage (%).
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
45 / 48
Measuring the Strength of the Relationship: Coefficient of
Determination R 2
For our example on number of absences and final grade,
R2 =
SSR
1, 506.752
=
= 0.892; and as a % = 0.892∗100 = 89.2%.
SST
1, 690
Thus, 89.2% of the variation in the final grades is explained by the
variation in the number of absences.
In other words, 89.2% of the variation in final grades for the Biometry
class is explained by the linear relationship between final grade and
number of absences.
The rest of the variation, i.e. 0.108 or 10.8%, is unexplained
variation; i.e., is due to other factors.
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
46 / 48
Exercise
An experiment was designed to compare several different types of air
pollution monitors.The monitor was set up, and then exposed to different
concentrations of ozone, ranging between 15 and 230 parts per million
(ppm) for periods of 8 - 72 hours. Filters on the monitor were then
analyzed, and the amount (in micrograms) of sodium nitrate (NO3 )
recorded by the monitor was measured. The results for one type of
monitor are given in the table below.
Ozone(ppm/hr)
NO3 (µg )
Lecture Number 12
0.8
2.44
1.3
5.21
1.7
6.07
2.2
8.98
Correlation and Simple Linear Regression Analysis
2.7
10.82
2.9
12.16
March 7, 2017
47 / 48
Exercise...cont’d
(a) Calculate the correlation coefficient for the data and interpret.
(b) Find the least-squares regression line relating the monitor’s response
to the ozone concentration.
(c) Using the estimated regression line, estimate the amount of NO3
recorded for ozone concentration of 2.0ppm/hr.
(d) Construct an ANOVA table for simple linear regression for the data.
(e) Do the data provide sufficient evidence to indicate that there is a
linear relationship between the ozone concentration and the amount of
sodium nitrate detected? Test at α = 0.05.
(f) Construct a 95% confidence interval for the slope of the regression
line.
(g) Calculate R 2 . What does this value tell you about the effectiveness of
the linear regression analysis?
Lecture Number 12
Correlation and Simple Linear Regression Analysis
March 7, 2017
48 / 48
Download