Statistics 2014, Fall 2001

advertisement
Ch 4 – Describing the Relation between Two Variables
Definition: When the values of two variables are measured for each member of a population or
sample, the resulting data is called bivariate.
When both variables are quantitative, we may represent the data set as a set of ordered pairs of
numbers, (x, y). The variable x is called the input (or independent) variable; the variable y is called the
response (or dependent) variable. We may examine the relationship between the two variables
graphically using a scatter diagram, or scatterplot.
Example: The following data set for a sample of 6 randomly middle-age to elderly patients consists of
x = age of patient, and y = measured value of systolic blood pressure of patient. We expect that as
people age, their blood pressure will increase. We will examine the relationship between the two
variables.
Age, x
43
48
56
61
67
70
Systolic Blood Pressure, y
128
120
135
143
141
152
To construct a scatterplot of the data using the TI-83:
1) Choose STAT, EDIT. Name one column Age; name the other column SBP.
2) Enter the data into the two columns.
3) Choose WINDOW. Set Xmin to be slightly smaller than the smallest value of x. In this case, we
set Xmin = 40. Set Xmax to be slightly larger than the largest value of x. In this case, we set
Xmax = 72. Set Ymin to be slightly smaller than the smallest value of y; in this case, Ymin = 118.
Set Ymax to be slightly larger than the largest value of y; in this case, Ymax = 155. Set Xscl = 1, and
Yscl = 1.
4) Choose 2nd, STAT PLOT. Turn Plot 1 On. For Type, choose the first type, scatterplot. For Xlist,
enter the name of the x variable; for Ylist, enter the name of the y variable.
5) Hit the GRAPH key.
In this example, we see an increasing, linear trend relationship between age and systolic blood
pressure, as expected. If we want to see the coordinates of the data points, we use the TRACE key.
Linear Correlation
The purpose of linear correlation analysis is to measure the strength of the linear relationship between
x and y.
Note: If the relationship between the two does not appear to be linear, then linear correlation analysis
should not be done.
Types of relationships:
1) If there is an increasing linear trend relationship, so that larger values of x tend to be associated
with larger values of y, then we say that there is a positive correlation between x and y. There may be
a strong positive linear trend relationship, if the data points cluster closely around a straight line; or a
weak positive linear trend relationship, if the data points are not all close to a straight line.
2) If there is a decreasing linear trend relationship, so that larger values of x tend to be associated with
smaller values of y, then we say that there is a negative correlation between x and y. There may be a
strong negative linear trend relationship, if the data points cluster closely around a straight line; or a
weak negative linear trend relationship, if the data points are not all close to a straight line.
3) If there is no linear trend present, then we say that the correlation between x and y is zero. This
may happen if there is no relationship at all apparent between x and y, or if the relationship appears to
be curvilinear, rather than linear.
Definition: Pearson’s correlation coefficient, r, is a numerical measure of the strength (and direction)
of a linear relationship between two quantitative variables. The formula for the correlation coefficient
is
r
where
sx
 x
i
 x  yi  y 
n  1s x s y
is the (sample) standard deviation for x, and
sy
,
is the (sample) standard deviation for y.
Note: We may also calculate the value of r using the following sample statistics for the two variables:
xi is the sum of all of the x values;
1)

2)
3)
4)
y
i
is the sum of all of the y values;
 x is the sum of all of the squared x values;
 y is the sum of all of the squared y values;
 x y is the sum of the products of corresponding x and y values;
2
i
2
i
5)
i i
Then the correlation coefficient is
r
1
 xi  yi 
n
.
1
1
2
2



2
2
 xi  n  xi    yi  n  yi  
x y
i
i

Properties of r:
1) It is always true that -1  r  1.
2) If there is a perfect positive linear relationship between x and y, then r = 1; if there is a perfect
negative linear relationship between x and y, then r = -1.
3) If there is a positive linear trend relationship between x and y, then 0 < r < 1; if there is a negative
linear trend relationship between x and y, then -1 < r < 0.
4) If there is no linear relationship between x and y, then r = 0.
We will learn how to calculate r after discussing simple linear regression.
Linear Regression
When we do a scatterplot of bivariate numerical data, we are looking for trends which describe the
relationship between the two variables. In this course, we will be concerned only with linear trend
relationships. We say that there is a linear trend relationship between two variables if we can draw a
line on the scatterplot which will represent the relationship between the two variables, apart from
random error.
In the above example of data on age and systolic blood pressure, the scatterplot shows a linear
increasing trend between the two variables. We want to represent this trend as a line of best fit to the
data, or a regression line. This is a straight line which lies as close as possible to all of the data points
simultaneously.
The equation for a straight line is y   o  1 x , where the parameter
0
is called the intercept of the
line (the y-coordinate of the point at which the line crosses the y axis), and  1 is called the slope of the
line (the rate at which the line is rising or falling as we move to the right).
Since our two variables are not perfectly linearly related (the points do not lie exactly on a straight
line), we need to add a random error term to our equation: y   o  1 x   . Here the quantity 
represents the amount by which a particular data point lies above or below the line. We will find the
estimates of the two parameters using the data and the method of least squares. This method says that
the line of best fit to the data may be found by making all of the squared vertical distances between the
data points and the line as small as possible simultaneously (see pp. 165-166 of textbook).
When we do this, the parameter estimates obtained are:
ˆ1 
 x  x  y  y  , and ˆ  y  ˆ x .
0
1
 x  x 
i
i
2
i
To find the regression line using the TI-83:
1) Enter the two columns of data for the variables, using STAT, EDIT.
2) Choose STAT, CALC, 4:LinReg(ax + b).
3) Enter the name of the x variable, using 2nd, LIST; followed by a comma; followed by the name of
the y variable, using 2nd, LIST. Hit ENTER.
4) You will see the estimated slope (the value for a) and the estimated intercept (the value for b).
Example: We will go back to the previous example of the comparison of age with systolic blood
pressure for middle-age and elderly people. The bivariate data set is given below:
Age, x
Systolic Blood Pressure, y
43
128
48
120
56
135
61
143
67
141
70
152
If we do this for the systolic blood pressure data, we find the regression line has the equation:
yˆ  81.0481  0.9644 x .
We put the hat over the y here to signify that this is the predicted value
of the systolic blood pressure, not the actual value from the data set. Thus, for someone at age 56, the

 
predicted value of systolic blood pressure is yˆ  81.0481  0.9644 56  135.0545 .
If we look back at the data set, the SBP measurement for a the person who was 56 years old was 135.
For someone at age 61, the predicted value of systolic blood pressure is
yˆ  81.0481  0.964461  139.8765 . The first predicted value is close to the actual data
value; the second predicted value is not quite so close.
Finding the correlation coefficient using the TI-83
We can find the Pearson correlation coefficient for the (linear) relationship between two quantitative
variables using the linear regression function of the calculator.
1) Choose 2nd, CATALOG. Scroll down to DiagnosticOn. Hit ENTER twice.
2) Do the linear regression as before.
3) Now the output screen will show, in addition to the estimates for the regression slope and intercept,
the estimated Pearson correlation coefficient for the linear relationship between x and y.
For the blood pressure example, r = 0.8967, indicating a fairly strong positive linear relationship
between age and systolic blood pressure.
Interpreting predicted values from linear regression
1. The slope, ˆ1 , of the regression line is the predicted change in y per unit increase in x, i.e., the rate
of change of the dependent variable with respect to the independent variable. If the slope is 0.9644
mm of Hg per year of age, then for each increase of age by one year, we predict that systolic blood
pressure will increase by 0.9644 mm of Hg.
2. For any given member of the population or sample, the actual value of y will differ somewhat from
the predicted value. The predicted value of y at a given value of x is the average of all y values for all
population members who have that value of x.
3. The y intercept, ̂ 0 , is the y-coordinate of the line when it crosses the x-axis. In other words, it is
the predicted value of y when x = 0. In many data sets, this point on the regression line is not
meaningful. For example, what would be the predicted systolic blood pressure for someone having
age 0? This leads to a cautionary note about the use of linear regression for prediction. The line of
best fit to the data works for prediction of y, so long as the value of x is in the range of x-values of the
original data set. It is not advisable, for example, to use the regression equation in the blood pressure
example to try to predict values of the systolic blood pressure for people whose age is less than 43 or
greater than 70, since nobody in our sample was less than 43 or over 70. The equation should be used
to make predictions only about the population from which the sample is drawn, and only within the
sample domain of the independent variable.


4. The regression line will always pass through the centroid of the data, x, y . In the blood pressure
example, the average age of our sample was 57.5 years. If we plug this value into our regression
equation, we get a predicted systolic blood pressure of yˆ  81.0481  0.964457.5  136.5011,
which is the average systolic blood pressure for our sample.
5. If the sample was taken in 1998, do not expect the results to be valid for the same population in
2004. There may have been changes over the intervening years.
Let’s do a complete analysis of the relationship between two quantitative variables.
Example: The following data are the ages and asking prices for 19 used foreign compact cars.
Age, x (years)
Price, y ($100’s)
3
68
5
52
3
63
6
24
4
60
4
60
6
28
7
36
2
68
2
64
6
42
8
22
5
50
6
36
5
46
7
36
4
48
7
20
5
36
We want to examine the relationship between age and price, first through use of a scatterplot, then
through calculating the regression equation and r.
Asking Price vs. Car Age
80
70
60
Price
50
40
30
20
10
0
0
2
4
6
8
10
Age
There appears to be a negative linear trend relationship between Age and Price, with older cars having
lower asking prices.
ˆ  86.5068  8.2593x , and the Pearson
The equation for the line of best fit to the data is: y
correlation coefficient is r = -0.9054. Hence there is a strong negative linear trend relationship
between Age and Price.
For a 5 year old car, the predicted asking price is yˆ  86.5068  (8.2593)(5)  $4521.03 . There
were four 5-year-old cars in the original data set, with asking prices of $5200, $5000, $4600, and
$3600. One of these cars has an asking price near the predicted value. The average of the asking
prices for the four cars is $4600, close to the predicted value.
Download