Summary Chapters 5 to 7

advertisement
Chapters 5-7 Correlation/Linear
Regression
• Linear Relationships: If the explanatory
and response variables show a straightline pattern, then we say they follow a
linear relationship.
• Curved relationships and clusters are
other forms to watch for.
1
Chapters 5-7 Correlation/Linear
Regression
• Linear Relationships: If the explanatory
and response variables show a straightline pattern, then we say they follow a
linear relationship.
• Curved relationships and clusters are
other forms to watch for.
2
Chapters 5-7 Correlation/Linear
Regression
• Direction: If the relationship has a clear
direction, we speak of either positive
association or negative association.
• Positive association: high values of the
two variables tend to occur together
• Negative association: high values of one
variable tend to occur with low values of
the other variable.
3
Chapters 5-7 Correlation/Linear
Regression
• Correlation is a number that determines
the strength of a linear relationship
between two quantitative variables.
• Correlation is always between -1 and 1
inclusive
• The sign of a correlation coefficient
determines positive/negative association
between the variables
4
Chapters 5-7 Correlation/Linear
Regression
• Strong correlation: If r is between 0.8 and
1 and -0.8 and -1
• Moderate correlation: If r is between 0.5
and 0.8 and -0.8 and -0.5
• Weak correlation: If r is between 0 and 0.5
and -0.5 and 0
5
Chapters 5-7 Correlation/Linear
Regression
• Correlation does not distinguish between
X and Y
• Correlation is unitless
• Correlation measures the strength of linear
relationship between two quantitative
variables
6
Chapters 5-7 Correlation/Linear
Regression
7
Choose the best description of the
scatter plot
A.
B.
C.
D.
E.
Moderate, negative, linear association
Strong, curved, association
Moderate, positive, linear association
Strong, negative, non-linear association
Weak, positive, linear association
8
Which of the following values is most likely to
represent the correlation coefficient for the data shown
in this scatterplot?
A. r = -0.67
B. r = -0.10
C. r = 0.71
D. r = 0.96
E. r = 1.00
9
Which of the following values is most likely to
represent the correlation coefficient for the data shown
in this scatterplot?
A. r = -0.67
B. r = -0.10
C. r = 0.71
D. r = 0.96
E. r = 1.00
10
Which of the following values is most likely to
represent the correlation coefficient for the data
shown in this scatterplot?
A. r = -0.67
B. r = -0.10
C. r = 0.71
D. r = 0.96
E. r = 1.00
11
Cautions about Correlation
• It should only be used
– To describe the relationship between 2
QUANTITATIVE variables
– When the association is “linear enough”
– When there are no outliers
• Correlation does NOT imply causation
12
A teacher at an elementary school measures the
heights of children on the playground and then makes a
scatter plot of the children’s heights and reading test
scores. The data meet the conditions for correlation so
she calculates r = .79. Which conclusion is most
accurate?
A. Being taller causes students to read better
B. Being shorter causes students to read better
C. Taller students tend to have better reading
scores
D. Shorter students tend to have better reading
scores
13
Chapters 5-7 Correlation/Linear
Regression
• Easiest to understand and analyze
• Relationships are often linear
• Variables with non-linear relationship can often be
transformed into linear relationship through an
appropriate transformation
• Even when a relationship is non-linear, a linear model
may provide an accurate approximation for a limited
range of values.
• Strength: The strength of a linear relationship is
determined by how close the points in the scatterplot lie
to a straight line
14
Least Square Regression Line Calculations
Chapters 5-7 Correlation/Linear
Regression
• Not all data fall on a straight line!
• Residual = Data – Model or
• Residual = Observed Y – Predicted y
16
Chapters 5-7 Correlation/Linear
Regression
X= Fat
19
31
34
35
39
39
43
Example
Y= Calories
410
580
590
570
640
680
660
17
Chapters 5-7 Correlation/Linear
Regression
Scatterplot of C2 vs C1
700
650
C2
600
550
500
450
400
20
25
30
35
40
45
C1
18
Chapters 5-7 Correlation/Linear
Regression
Fitted Line Plot
C2 = 211.0 + 11.06 C1
700
S
R-Sq
R-Sq(adj)
650
27.3340
92.3%
90.7%
C2
600
550
500
450
400
20
25
30
35
40
45
C1
19
Chapters 5-7 Correlation/Linear
Regression
• S = 27.3340 R-Sq = 92.3% R-Sq(adj) =
90.7%
Residual Plot
Versus Order
(response is C1)
3
2
Residual
1
0
-1
-2
-3
1
2
3
4
Observation Order
5
6
7
20
Chapters 5-7 Correlation/Linear
Regression
• Extrapolation: Reaching beyond the data
• Outliers: Regression models are sensitive
to outliers
• Leverage: An unusual data point whose x
value is far from the mean of the x values
• A point with high leverage has the
potential to change the regression line.
21
Chapters 5-7 Correlation/Linear
Regression
• Influential: A point is influential if omitting it
from the analysis gives a very different
model.
• Influence depends on leverage and
residual
• Lurking variables: A variable that is not
included in the construction of the linear
model/study.
22
Chapters 5-7 Correlation/Linear
Regression
• Lurking variables may influence correlation
and regression models.
• Association is not causations!!
23
Summary
• r is a number between -1 and 1
• r = 1 or r = -1 indicates a perfect
correlation case where all data points lie
on a straight line
• r > 0 indicates positive association
• r < 0 indicates negative association
• r value does not change when units of
measurement are changed (correlation
has no units!)
• Correlation treats X and Y symmetrically.
The correlation of X with Y is the same as
the correlation of Y with X
24
Summary
• Quantitative variable condition: Do not
apply correlation to categorical variables
• Correlation can be misleading if the
relationship is not linear
• Outliers distort correlation dramatically.
Report correlation with/without outliers.
25
More Examples for Checking Linear Enough Condition
All four data sets have r = .82
26
In which case is a linear model appropriate?
B.
A.
10.0
12
10
8.0
Y
b
8
Y
a
6
6
6
8
C.
10
12
14
10
12
14
X
D.
X
8
13
12
11
10
9
Y
c
Y
d
8
7
6
6
8
10
X
12
10.0
14.0
Xd
18.0
27
A. Linear model appropriate; residual plot shows no pattern
12
10
8
Y
a
6
6
8
10
12
14
X
r
e
s 1
i
d
0
u
a
l -1
s
(
Y
/
X
)
6
8
10
12
14
X
B. Linear model not appropriate; clear pattern of residuals
10.0
8.0
Y
b
6
8
X
10 12 14
r
e
s
i
d
u
a
l
s
(
Y
/
X
)
0.75
0.00
-0.75
-1.50
4
6
8
X
10
12
28
C. Graph has an outlier; outlier is clear on the residual plot
12
10
Y
c
8
6
6
8
10
12
X
r
e
s 2.50
i
d
1.25
u
a
l -0.00
s
(
Y
/
X
)
4
8
6
10
12
X
D. Linear model not appropriate; clear pattern of residuals
13
11
9
Y
d
7
10.0
14.0
Xd
18.0
r
e
s 1
i
d 0
u
a
l -1
s
(
Y
/
X
10.0
14.0
Xd
18.0
29
Calculating r with the TI-83/84
• The first time you do this:
– Press 2nd, CATALOG (above 0)
– Scroll down to DiagnosticOn
– Press ENTER, ENTER
– Read “Done”
– Your calculator will remember this setting
even when turned off
30
Calculating r with the TI-83/84
• Press STAT, ENTER
• If there are old values in L1:
– Highlight L1, press CLEAR, then ENTER
• If there are old values in L2:
– Highlight L2, press CLEAR, then ENTER
• Enter predictor (x) values in L1
• Enter response (y) values in L2
– Pairs must line up
– There must be the same number of predictor and
response values
31
Calculating r with the TI-83/84
• Press STAT, > (to CALC)
• Scroll down to LinReg(ax+b), press ENTER,
ENTER
• Read r at bottom of screen
32
Re-Expression with the TI-83/84
• Most common re-expressions are built in.
• To see what’s available, try
– STAT
– CALC
– Scroll down to see
•
•
•
•
•
5:QuadReg
6:CubicReg
9:LnReg
0:ExpReg
A:PwrReg
33
Example
•
•
•
•
X: Age in months
Y: Height in inches
X: 18
19 20 21 22
23
24
Y: 29.9 30.3 30.7 31 31.38 31.45 31.9
34
Chapters 5-7 Correlation/Linear
Regression
• Linear Model: Height = 24.212 +.321 * Age
Correlation: r = .992
Examples
• Age = 24 months, Observed Height = 31.9
• Predicted Height = 31.916
• Residual = 31.9 – 31.916 = .016
35
Chapters 5-7 Correlation/Linear
Regression
• Age = 20 years (20*12 = 240)
• Predicted Height ~ 8.5 ft!!
• Residual = BIG!
• Be aware of Extrapolation!
36
Example
4. Relationship between calories and sugar
content: A researcher tracked the sugar content
and calorie of 15 baked goods and found the
following information:
Average sugar content: 7.0 grams
Standard deviation of sugar content: 4.4 grams
Average calories: 107.0 grams
Standard deviation of calories: 19.5 grams
Correlation between sugar content and calories:
0.564
37
Solution to Example
a) Find a linear model that describes this example:
b_{1}=r S_{y}/S_{x} = 0.564*19.5/4.4 = 2.5 calories per
gram of sugar
b_{0}= mean of (Y) –b{1}mean of (X) = 107 -2.50*7 = 89.5
Linear Model:
y = b_{0}+b_{1}x
y= 89.5 + 2.5x or better
calories = 89.5 +2.50* sugar
b) How many calories are there in a muffin with 6.5 grams
of sugar?
calories = 89.5 +2.50* 6.5 = 105.75
38
Chapters 5-7 Correlation/Linear Regression: Reexpressing Data
• Example: The data
shows the number of
academic journals
published on the
Internet and during
the last decade.
YEAR
1991
1992
1993
1994
1995
1996
1997
# of
Journals
27
36
45
181
306
1093
2459
39
Chapters 5-7 Correlation/Linear Regression: Re-expressing
Data
3000
Number of Journals
2500
2000
1500
Series1
1000
500
0
1990
1991
1992
1993
1994
1995
1996
1997
1998
Year
40
Chapters 5-7 Correlation/Linear Regression: Reexpressing Data
• Re-express data to
linearize:
Year
1991
Log(#
journals)
1.431364
1992
1.556303
1993
1.653213
1994
2.257679
1995
2.485721
1996
3.03862
1997
3.390759
41
Chapters 5-7 Correlation/Linear Regression: Re-expressing
Data
4
Log(Number of Journals
3.5
3
2.5
2
Series1
1.5
1
0.5
0
1990
1992
1994
1996
1998
Year
42
4
3.5
Log(# of Journals)
3
2.5
Series1
2
Series2
1.5
1
0.5
0
0
2
4
6
8
Year
43
Chapter 10 Re-expressing Data
• Least Square Regression Line has the
following equation:
Log(journals) = 1.22 + 0.346 * Year
Problem:
How many journals will be published online
in year 2000?
44
Chapter 10 Re-expressing Data
Answer
Log(journals) = 1.22+ 0.346*9 =4.334
Answer: 21577.44 (10^(4.334))
45
Chapter 10 Re-expressing Data
Why Re-expressing data?
1. Make a distribution of a variable more
symmetric
2. Make the spread of several groups more alike,
even if their centers differ
3. Make the form of a scatterplot more nearly
linear
4. Make the scatter in a scatterplot spreadout
more evenly rather than thickening at one end.
46
Chapter 10 Re-expressing Data
The Ladder of Powers:
Power 2: the square of the data values y^2
– Try this for unimodal distributions that are skewed to
the left.
Power 1: No change at all
Power ½: the square root of the data values
Y^(1/2)
– Try this for counted data
Power 0: the logarithm of the data values y
– Try this for measurements that cannot be negative
– Especially those that grow by percentage increases
– Salries and populations are good examples.
47
Download