Uploaded by Liezl Galgo

Linear Regression and Correlation

Linear Regression and Correlation
When performing research studies, scientists often wish to know whether two
variables are related. If the variables are determined to be related, a scientist may
then wish to find an equation that can be used to model the relationship. For
instance, a goelogist might want to know whether there is relationship between the
duration of an eruption of a geyser and the time between eruptions. A first step in
this determination is to collect some data. Data involving two variables are called
bivariate data. Table 1 gives bivariate data showing the time between two eruptions
and the duration at the second eruption for 10 eruption of the geyser Old Faithful.
Table 1.
Table between eruptions (in
seconds) , 𝒙
Duration of eruption (in
second), 𝒚
272 227 237 238 203 270 218 226 250 245
89
79
83
82
81
85
78
81
85
79
Once the data are collected, a scatter diagram or scatter plot can be drawn,
as shown below in Figure 2.
Old Faithful
90
272; 89
88
86
270; 85
250; 85
84
237; 83
238; 82
226; 81
82
203; 81
80
227; 79
78
245; 79
218; 78
76
0
50
100
150
Figure 2.
200
250
300
Least-Squares Regression Line
Ordered pairs of bivariate data are given. The line that minimizes the sum of
the squares of the vertical deviations from each data point to the line is referred to as
the least-squares regression line, or least- squares line for a set of bivariate data.
The
equation
of
the
𝒍𝒆𝒂𝒔𝒕 − 𝒔𝒒𝒖𝒂𝒓𝒆𝒔 𝒍𝒊𝒏𝒆
for
the
n
noted
pairs
(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), (𝑥3 , 𝑦3 ), … , (𝑥𝑛 , 𝑦𝑛 ), is 𝑦̂ = 𝑎𝑥 + 𝑏, where
𝑎=
𝑛 ∑ 𝑥𝑦 − (∑ 𝑥)(∑ 𝑦)
𝑛 ∑ 𝑥 2 − (∑ 𝑥)2
and
b = 𝑦̅ − 𝑎𝑥̅
When the value of the other variable is known. ,the equation of the least-squares line
can be used to predict the value of one variable.
To apply this formula to the data for Old Faithful, we first find the value of each
summation.
∑ 𝑥 = 2386,
∑ 𝑥 2 = 573,560, ∑ 𝑥𝑦 = 196,636
∑ 𝑦 = 822,
Next, we use these value to find the value of 𝑎.
𝑛 ∑ 𝑥𝑦 − (∑ 𝑥)(∑ 𝑦)
𝑛 ∑ 𝑥 2 − (∑ 𝑥)2
10(196,636) − (2386)(822)
𝑎=
10(573,560) − (2386)2
𝑎 ≈ 0.1189559666
𝑎=
We then find the values of 𝑥̅ and 𝑦̅,
∑ 𝑥 2386
𝑥̅ =
=
= 238.6
𝑛
10
and
𝑦̅ =
∑ 𝑦 822
=
= 82.2
𝑛
10
and use them to find the 𝑦 −intercept, 𝑏.
𝑏 = 𝑦̅ − 𝑎𝑥̅
≈ 82.2 − (0.1189559666)(238.6)
= 53.81710637.
The regression equation is: 𝑦̂ = 𝑎𝑥 + 𝑏.
The graph of the regression equation and a scatter plot of the data are shown below.
Old Faithful
Length of the eruption
90
88
86
𝑦=0.1189559666𝑥+53.81710637
84
82
80
78
76
0
50
100
150
200
250
300
Second between eruption
We can now use the regression equation to estimate the duration of an eruption
given the time between eruption.
For instance, if the time between two eruption is 200 seconds then the estimated
duration of the second eruption is
𝑦̂ = 0.1189559666𝑥 + 53.81710637
= 0.1189559666(200) + 53.81710637
≈ 78.
Thus, the approximate duration of the eruption is 78 seconds.∎
Linear Correlation Coefficient
Is is the one that measures the strength of a linear relationship between two
variable. The closer the value of |𝑟| is to 1, the strongest the linear relationship is
between the variables.
For 𝑛 ordered pairs (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), (𝑥3 , 𝑦3 ), … , (𝑥𝑛 , 𝑦𝑛 ),
the 𝒍𝒊𝒏𝒆𝒂𝒓 𝒄𝒐𝒓𝒓𝒆𝒍𝒂𝒕𝒊𝒐𝒏 𝒄𝒐𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒕 𝒓 is given by
𝑟=
𝑛(∑ 𝑥𝑦) − (∑ 𝑥)(∑ 𝑦)
√𝑛(∑ 𝑥 2 ) − (∑ 𝑥)2 ⋅ √𝑛(∑ 𝑦 2 ) − (∑ 𝑦)2
The Correlation Coefficient (r)
The sample correlation coefficient (r) is a measure of the closeness of association of
the points in a scatter plot to a linear regression line based on those points, as in the
example above for accumulated saving over time. Possible values of the correlation
coefficient range from -1 to +1, with -1 indicating a perfectly linear negative, i.e.,
inverse, correlation (sloping downward) and +1 indicating a perfectly linear positive
correlation (sloping upward).
Describing Correlation Coefficients
The table below provides some guidelines for how to describe the strength of
correlation coefficients, but these are just guidelines for description. Also, keep in
mind that even weak correlations can be statistically significant, as you will learn
shortly.
The four images below give an idea of how some correlation coefficients might look
on a scatter plot.
Example.
A random sample of 50 year old men (n=10) was taken. Weight, height, and systolic
blood pressure were measured, and BMI was computed. In this analysis, the
independent (or predictor) variable is BMI and the dependent (or outcome) variable
is systolic blood pressure (SBP).
Solution.
∑ 𝑥 = 272.9, ∑ 𝑦 = 1,395, ∑ 𝑥 2 = 7,734.11, ∑ 𝑦 2 = 199,675, ∑ 𝑥𝑦 = 39,106.5
Substitute all the values to the value for 𝑟
𝑟=
=
𝑛(∑ 𝑥𝑦) − (∑ 𝑥)(∑ 𝑦)
√𝑛(∑ 𝑥 2 ) − (∑ 𝑥)2 ⋅ √𝑛(∑ 𝑦 2 ) − (∑ 𝑦)2
10(39106.5) − (272.9)(1395)
√10(7734.11) − (272.9)2 ⋅ √10(199675) − (1395)2
≈ 0.86 Strong positive correlation.