1

advertisement
1
DISPLAYING THE RELATIONSHIP
DEFINITIONS:
Studies are often conducted to attempt to show that some explanatory
variable “causes” the values of some response variable to occur.
The response or dependent variable is the response of interest, the
variable we want to predict, and is usually denoted by y.
The explanatory or independent variable attempts to explain the
response and is usually denoted by x.
A scatterplot shows the relationship between two quantitative variables x
and y. The values of the x variable are marked on the horizontal axis, and
the values of the y variable are marked in the vertical axis. Each pair of
observations (xi, yi), is represented as a point in the plot.
Two variables are said to be positively associated if, as x increases, the
values of y tends to increase. Two variables are said to be negatively
associated if, as x increases, the values of y tends to decrease.
When a scatterplot does not show a particular direction, neither positive,
nor negative, we say that there is no linear association.
2
Scatterplot of Final vs Midterm Scores
Final
The 10th student
(21 , 38)
Midterm
Student
Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
X
Midterm
Score
39
44
32
40
45
46
33
39
32.5
21
30
39
44
28.5
38
43
42
25.5
47
36
31.5
32
42
21
41
Y
Final
Score
62
69
68
86
88.5
88.5
76
66.5
75
38
71
88
96.5
71.5
96
82.5
85
28
95
39
58
49
62
59
90
3
Let's Do It! 1
The data below was obtained in a study of age and systolic blood
pressure of six randomly selected subjects. Make a scatter plot to
examine the relationship between (x) = age and (y) = pressure.
Comment on the relationship with respect to form, direction,
strength, and any departures or usual values.
Subject
A
B
C
D
E
F
Age x
43
48
56
61
67
70
Pressure y
128
120
135
143
141
152
4
Notes of Caution
1. An observed relationship between two variables does
not imply that there is some causal link between the
two variables.
For example, consider the following scatter-plot of IQ score versus shoe size:
IQ
Shoe Size
As a person ages their shoe size increases as well as their IQ. Although there
is a positive association, there is no causal link between the two variables
shoes size and IQ.
Most studies attempt to show that some explanatory variable "causes" the
values of the response to occur. While we can never positively determine
whether or not there is a distinct cause-and-effect relationship, we can assess
if there appears to be such relationship.
2. A relationship between two variables can be influenced
by confounding variables.
Consider the following scatter-plot of the number of sport magazines read in a
month versus the height of the person:
Number of
magazines
read
: women
: men
Height
Overall there appears to be a positive association between height and number
of magazines. However, if for each gender, there does not appear to be an
association. Gender is a confounding variable and aggregating the data
across gender can result in misleading conclusions. Any study, especially an
observational study, has the potential to be wrongly interpreted because of
confounding variables.
5
3. Unusual data points (outliers) can mislead the
association, especially if the data set is small.
Consider the following scatter-plot of the percentage of people who speak
English versus population size.
Percent
who speak
English
Outlier
Population Size
The eight points in the scatter-plot represents eight countries from Central
and South America selected at random. The outlier is Mexico City.
4. Sometimes a scatter plot, such as the one in Figure
below, shows a curvilinear relationship between the data.
In this situation, Methods for curvilinear relationships are
beyond the scope of this course.
6
Simple Linear Regression
Scatterplot of Final vs Midterm Scores
Final
Line #1
Line #2
Midterm
So the question remains as to how to find a “best-fitting” line?
Equation of a Line
y=a+bx
where
b = slope - the amount y changes when x is increased by 1 unit.
a = y-intercept - the value of y when x is set equal to zero.
7
DEFINITION::
The least squares regression line, given by y  a  bx , is the
line that makes the sum of the squared vertical deviations of the
data points from the line as small as possible. Performing the
regression is often stated as regress y on x .
Least squares regression line for regressing final exam scores,
. x.
on midterm exam scores, x , is given by y  7.5  175
y,
Estimated slope of b =1.75 tells us that for a 1-point increase on
the midterm we would expect, on average, an increase of 1.75
points on the final exam.
Estimated y -intercept of a =7.5 tells us that if someone were to
score 0 points on the midterm, we would predict they would get
7.5 points on the final exam.
Suppose a new student scores 40 points on the midterm. Based
on our model, what would be their predicted final exam score?
Plug the value of x =40 into our estimated equation. The predicted
. (40)  77.5 points.
final Exam score is y  7.5  175
8
Let's Do It! 2 13.2 Childhood Growth
The growth of children from early childhood through adolescence
generally follows a linear pattern. Data on the heights of female
Americans during childhood, from four to nine years old, were
compiled and the least squares regression line was obtained as
y  80  6 x , where y is height in centimeters and x is age in
years. Note that 1 inch is equal to 2.54 centimeters.
(a) Interpret the value of the estimated slope
b = 6.
(b) Would interpretation of the value of the estimated y intercept, a = 80, make sense here? If yes, interpret it. If no,
explain why not.
(c) What would you predict the height to be for a female
American at 8 years old? Give your answer first in
centimeters then in inches.
(d) What would you predict the height to be for a female
American at 25 years old?
Give your answer first in
centimeters then in inches.
(e) Why do you think your answer to part (d) was so inaccurate?
9
Calculating the Least Squares Regression Line
The Least Squares Regression Line
The least squares regression line is given by y  a  bx where
 xi  x  yi  y  n  xi yi     xi   yi 
slope = b 

2
2
n  xi2     xi 
 xi  x 
y – intercept = a  y  bx
Example
Test 1 versus Test 2—Obtaining the Regression Line “By Hand”
(a) Look at the relationship graphically with a scatter-plot to
confirm initially that a linear model seems appropriate.
10
(b) Calculate the estimated regression line by completing the
calculation table shown below.
b
n  xi yi     xi   yi 
n  xi2     xi 
2
5 884   60 70 220


 11
..
200
5 760   60 2
a  y  bx 
Least squares equation:
70
60
  11
.
 0.8.
5
5
  0.8  11
y
. x.
Slope of the line is b = 1.1.
This means that Test 2 scores are expected to go up by 1.1 points
on average for each additional point scored on test 1.
A student who scored 15 points on Test 1 is predicted to score
y  0.8  11
. (15)  17.3 points on Test 2.
11
Test 1 versus Test 2—Obtaining the Regression Line Using
the TI Calculator
To obtain the least squares regression line using the TI
graphing calculator we would first need to enter the data.
L1
8
10
12
14
16
L2
9
13
14
15
19
Enter the values of the quantitative
variable x = Test 1 into L1 and enter to
corresponding value of the quantitative
variable y=Test 2 into L2. To get the least
squared regression equation we use the
following sequence of buttons
Your output screen should provide the least squares
regression equation as
y=a+bx
with the y-intercept of a=0.8 and the slope of b=1.1.
Caution: There are two linear regression options-namely
LinReg(ax+b) and LinReg(a+bx). We request the latter
option, which uses b to represent the slope.
12
Let's Do It!
13.3 Oil-Change Data
The table below presents data on x = the number of oil changes
per year and y = the cost of repairs for a random sample of 10
cars of a certain make and model, from a given region.
(a)
Make a scatter-plot of the points as a check for linearity and
outliers. Comment on your plot.
(b)
Find the least squares regression line for regressing cost on
number of oil changes. Describe what the estimated yintercept and estimated slope represent.
(c)
Use your least squares regression line to predict the cost of
car repairs for a car that had four oil changes.
Homework page11.3 546: 1, 13, 21, 23, 30, 32, 38, 60, 62, 65, 93, 97
13
11.4 STATISTICALLY SIGNIFICANT RELATIONSHIP?
Researchers must rely on data from only a sample in order to
assess if a relationship exists between two variables.
Even though a relationship may be apparent in the sample, it is
possible that it will not extend to the population.
Researchers use statistics to assess the significance of an observed
relationship by measuring the chance that a relationship as strong
or stronger would be observed, assuming there really is no
relationship in the population.
Think About It
Slope of Zero
Consider the equation of a line for relating two variables:
yˆ  a  bx .
Suppose the y -intercept, a , is equal to 10 and slope, b , is equal
to 0.
What would be the value of the response,
ŷ , if x
were equal to 2?
What would be the value of the response,
12?
ŷ , if x
were equal to
What would be the value of the response,
ŷ , for any value of x ?
What would it mean if the slope for the regression line for a
population were equal to 0?
14
The hypothesis of interest in linear regression is:
Main Hypothesis: The slope of the linear regression line
using all of the population values is equal to 0 (i.e. the
linear relationship is insignificant).
Alternative Hypothesis: The slope of the linear regression
line using all of the population values is NOT 0 (i.e. the
linear relationship is significant).
After obtaining the regression line, one should test the significance
of the components of the line (mainly, the slope b .)We should
remember that we are testing for a linear relationship between x
and y . It is possible that x may determine y nonlinearly.
The test of the main hypothesis above is called the Slope F-Test
of Significance.
Microsoft EXCEL output:
Regression of final exam score, y , on midterm exam score, x .
Lines 12 through 14 are used to assess the significance of the
respective coefficients.
Line 14 presents information about the slope.
The p-value in Line 14 of 0.0001 is used to assess whether the
estimated slope is statistically significantly different from zero. Line
13 presents information about the
13 is used to test whether the
y -intercept.
The p-value in line
y -intercept for the population linear
regression is equal to 0.
15
Example Service Time
A computer-repair technician recorded data on the number of
computers serviced and the amount of time to complete the
service for 11 randomly selected service visits.
200
Service Time
180
160
140
120
100
80
60
40
20
0
1
2
3
4
5
6
7
8
Number of Computers Serviced
x
= number of computers serviced and
(a)
y = time to complete the service
What type of association, if any, does the scatter plot show?
Output from SPSS:
(b)
Obtain the estimated linear regression equation: yˆ  a  bx
Note: The column with the heading contains the values for b and
a
SPSS reports the estimated slope first, in the *row, and the yintercept information in the **row.
yˆ  10.19  24.83 x
16
(c)
Is there evidence of a significant (non-zero) linear
relationship between the number of computers serviced and
the service time? Explain.
Note: p-value is nearly 0 (p-value < 0.00005) => The number of
computers serviced appears to be a significant linear predictor
of service time.
(d)
Predict the number of minutes required for service when it is
reported that 5 pc's are down.
y  1019
.  24.83(5)  134.3 minutes
Let's Do It! Size of Homes and Selling Price
There are many factors that affect the selling price of a home. The
total dwelling size and the assessed value are just two factors.
Data were gathered on homes in a Milwaukee, Wisconsin,
neighborhood. Scatter-plots revealed a linear relationship between
the total dwelling size of a home in 100 square feet and its selling
price in dollars. Below is the regression output for the least
squares regression of selling price on total dwelling size.
Dep var: PRICE
N: 20 Multiple R: .913 Squared multiple R: .834
Adjusted squared multiple R: .825 Standard error of estimate: 3377.192
Variable
CONSTANT
SIZE
Source
Regression
Residual
Coefficient
11947.010
2749.622
Std error
4748.133
288.980
Analysis of Variance
Sum-of-squares DF
.103257E+10
1
.205298E+09
18
Std coef Tolerance
0.000 .
0.913
.100E+01
Mean-square
.103257E+10
.114054E+08
F-ratio
90.533
T
2.516
9.515
P(2 tail)
0.022
0.000
P
0.000
(a) How many homes were included in this study?
(b) Obtain the least squares regression line for predicting selling
price from the size of the home.
(c)
Is there a significantly (nonzero) linear relationship between
price and size? Explain.
17
(d) The total dwelling size for another home in this neighborhood
is 1620 square feet.
Use the least squares regression
equation to estimate the selling price of this home.
18
11.7 CORRELATION: HOW STRONG IS THE LINEAR
RELATIONSHIP?
DEFINITION:
The sample correlation coefficient r measures the strength of
the linear relationship between two quantitative variables. It
describes the direction of the linear association and indicates how
closely the points in a scatter-plot are to the least squares
regression line.
Features of the correlation coefficient.
1  r  1
1. Range
2.
Sign
The sign of the correlation coefficient
indicates direction of association — negative [1 , 0) or positive (0 , +1].
3.
Magnitude
The magnitude of the correlation
coefficient indicates the strength of the linear
association. If the data follow a straight line
r  1 (if the slope is positive) or r  1 (if
the slope is negative), indicating a perfect
linear association. If r  0 then there is no
linear association.
4.
5.
Measures Strength The correlation only measures the
strength of the linear association.
Unit-less
The correlation is computed using standard
scores of the two variables. It has no unit of
measure and the absolute value of r will not
change if the units of measurement for x or y
are changed. The correlation between x and y
19
is the same as the correlation between y and
x.
Some Pictures....
y
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Positive, moderate to strong linear
r  0.8 .
x
association,
y
x
x
x
x
x
x
x
x
x
x
x
x
x
x
r  0.2
x
Negative, weak linear association,
x
A strong association, just not a linear one,
y
x
x
x
r  0.
x
x
x
x
x
Let's Do It! 113.8Matching Graphs
The scatter-plot #1 to the right yields a regression line of
y = -2.6 + 1.1x and a correlation of r = 0.84.
20
Using this information as a base, match each of the four scatterplots below to the correct description of its regression line and
correlation coefficient. The scales on the axes of the scatter-plots
are the same.
How to Calculate the Correlation Coefficient r
Dep var: PERCENT N:18 Multiple R: .840 Squared multiple R: .706
Adjusted squared multiple R: .688 Standard error of estimate: 10.547
Variable
Coefficient Std error
CONSTANT
96.681
6.289
LENGTH
-5.970
0.963
Std coef Tolerance T
0.000 .
-0.840 .100E+01
P(2 tail)
15.373 0.000
-6.201 0.000
Analysis of Variance
Source Sum-of-squares DF Mean-square F-ratio
P
Regression
4276.908 1 4276.908
38.448
0.00
Residual
1779.815 16
111.238
21
“Multiple R: 0.840” = absolute value of the correlation coefficient
r.
The sign of r can be determined by looking at the sign of the slope,
which here is -5.970.
Correlation between length of putt and percentage of putts made is
r  084
. .
The formula:
r 

n xi y i    xi  y i 


n  xi2   xi 

n  y i2   y i 
2
2
Example Test 1 v e r s us Test 2 Obtaining t he Correlation
Coefficient
“By Hand”
We already have computed the summation quantities needed for
finding r, shown in the calculation table.
Completed Calculation Table
Total:
r
xi
yi
xi2
xi yi
yi2
8
10
12
14
16
9
13
14
15
19
64
100
144
196
256
72
130
168
210
304
81
169
196
225
361
x
i
y
 60
i
n  xi yi     xi   yi 
n  x
2
i
  x 
i
2
n  y
2
i
x
 70
 y 
i
2

2
i
 760
x y
i i
 884
5(884)  (60)(70)
5(760)  (60) 2 5(1032)  (70) 2
y
2
i
 1032
 0.965
22
The large positive correlation coefficient and the scatter-plot
indicate a strong, positive, linear association between Test 1 and
Test
2
scores.
Let’s Do It! 2 Birth Rates
We gathered data from 1970 for twelve nations on the percentage
of women aged 14 or older who were economically active and the
crude birth rate. (We define the crude birth rate as the number of
births in a year per 1000 population size) We are interested in the
relationship of the crude birth rate (y) on the percentage of women
who were economically active (x)
a. Create the scatter-plot.
Determine if there is a
positive, negative, or
association between x and y.
Nation
Algeria
Argentina
Denmark
E. Germany
Guatemala
India
Ireland
Jamaica
Japan
Philippines
USA
Soviet Union
x
2
19
34
40
8
12
20
20
37
19
30
46
y
48
21
14
11
41
37
22
31
19
42
15
18
b. Find the equation of the regression line. Interpret the slope.
c. Find the correlation coefficient r.
23
Obtaining the Correlation Using the TI
To get the regression line and the correlation coefficient using the
TI we first need to turn on the diagnostic option. If the x data is in
L1 and the y data is in L2, then the steps are as follows:
Let’s Do It! 3 Birth Rates
Check the value of r you obtained in activity2 above using TIcalculator.
24
Let’s Do It! 4
Data on Milk Production
Milk samples were obtained from 14 Holstein-Friesian cows, and
each was analyzed to determine uric acid concentration (Y),
measured in mol/L. In addition to acid concentration, the total mild
production (X), measured in kg/day, was recorded for each cow.
The data was entered into a computer and the following regression
output was obtained.
(a) What is the equation of the least squared regression line?
(b) What is the correlation between x and y?
r = __________.
(c) We are interested if the linear relationship is significant by
testing the following hypothesis.
Main Hypothesis: The slope of the regression line equals 0.
Circle the p-value in the output that is used to test this hypothesis.
At the level of significance of 0.05, we would (circle one):
Accept H0
Reject H0
Can’t Tell
25
THE SQUARED CORRELATION r 2 — WHAT DOES IT
TELL US?
r = correlation coefficient, gives the strength and the direction of
the linear relationship between two quantitative variables x and y;
–1  r  1.
Note that when we square r we get => 0  r2  1. The value of r2
Is the percentage of variation of dependent variable that are
explained the independent variable x.
The quantity r2 is generally denoted in computer output as R2, and
is often reported as a percent.
r2 = 0.75 => about 75% of the variation in the response variable
y can be explained by the linear relationship between x and y.
Homework page 546 : 2,3, 4 ,14,15,22, 36, 37, 39
26
Download