Analytical Methods I

advertisement
Chapter 3: Examining Relationships
3.2
3.3
Scatterplots
Correlation
Least-Squares
Regression
Fabric Tenacity, lb/oz/yd^2
3.1
26
25
24
23
22
21
20
19
18
y = 3.9951x + 4.5711
R2 = 0.9454
3.5
4.0
4.5
5.0
Fiber Tenacity, g/den
1
Relationship Between Fiber Tenacity
and Fabric Tenacity
Fiber Tenacity,
g/den
Fabric Tenacity,
lb/oz/yd2
3.6
19.0
3.9
20.5
4.1
20.8
4.3
21.0
4.8
23.0
5.0
24.9
2
Variable Designations
• Which variable is the dependent variable?
– Our text uses the term response variable.
• Which variable is the independent variable?
– Explanatory variable
• Note: Sometimes we do not have a clear explanatoryresponse variable situation … we may just want to look at
the relationship between two variables.
• Problems 3.1 and 3.4, p. 123
3
Fabric Tenacity, lb/oz/yd^2
Scatterplot 1: Relationship Between Fiber
Tenacity and Fabric Tenacity
26
25
24
23
22
21
20
19
18
3.5
4.0
4.5
5.0
Fiber Tenacity, g/den
Note placement of response and explanatory variables. Also note
axes labels and plot title.
4
Problem 3.6, p. 125
• Type data into your calculator.
• Examining a scatterplot:
– Look for the overall pattern and striking
deviations from that pattern.
• Pay particular attention to outliers
– Look at form, direction, and strength of the
relationship.
5
Examining a Scatterplot, cont.
• Form
– Does the relationship appear to be linear?
• Direction
– Positively or negatively associated?
• Strength of Relationship
– How closely do the points follow a clear form?
– In the next section, we will discuss the correlation
coefficient as a numerical measure of strength of
relationship.
6
Scatterplot for 3.6
7
Problem 3.9, p. 129
8
Tips for Drawing Scatterplots
• p. 128
9
Income (Thousands of Year
2000 Dollars)
Adding a Categorical Variable to a Scatterplot
60
50
40
30
20
10
0
60
70
80
90
100
110
Year (67=year 1967)
Black
Hispanic
White
Asian
10
Homework
• Reading: pp. 121-135
11
Practice
• Problems:
– 3.11 (p. 129)
– 3.12 (p. 132)
– 3.16 (p. 136)
12
Figure 3.6, p. 136
13
1600
1500
Which shows the
strongest
relationship?
1400
1300
1200
1100
1000
900
800
30
40
50
60
2200
1800
1400
1000
600
200
0
20
40
60
80
10014
120
The two plots represent the same data!
• Our eye is not good enough in describing
strength of relationship.
– We need a method for quantifying the
relationship between two variables.
• The most common measure of relationship is
the Pearson Product Moment correlation
coefficient.
– We generally just say “correlation coefficient.”
15
Correlation Coefficient, r
1 n  xi  x  yi  y 


r

n  1 i 1  s x  s y 
• The correlation, r, is an average of the products
of the standardized x-values and the
standardized y-values for each pair.
16
Correlation Coefficient, r
• A correlation coefficient measures these characteristics of
the linear relationship between two variables, x and y.
– Direction of the relationship
• Positive or negative
– Degree of the relationship: How well do the data fit the
linear form being considered?
• Correlation of (1 or -1) represents a perfect fit.
• Correlation of (0) indicates no relationship.
17
Interpreting Correlation Coefficient, r
• Correlation Applet:
http://www.duxbury.com/authors/mcclellandg/tiein/joh
nson/correlation.htm
• Facts about correlation
– pp.143-144
• Correlation is not a complete description of twovariable data. We also need to report a complete
numerical summary (means and standard deviations,
5-number summary) of both x and y.
18
Exercise 3.25, p. 146
19
Outlier, or influential point?
• Let’s enter the data into our calculators and
calculate the correlation coefficient. The data
are in the middle two columns of Table 1.10, p.
59.
– r=?
• Now, remove the possible influential point.
What happens to r?
20
21
Exercises: Understanding Correlation
• Review “Facts about correlation,” pp. 143-144
• 3.34, 3.35, and 3.37, p. 149
• Reading: pp. 149-157
22
Relationship Between Winding Tension
and Yarn Elongation
Elongation%
9.0
8.5
8.0
7.5
7.0
6.5
6.0
y = -0.0759x + 9.4455
2
R = 0.732
10
15
20
25
30
35
Winding Tension, g
23
Least Squares Regression
• Ultimately, we would like to predict elongation by using
a more practical measurement, winding tension.
– A regression line, also called a line of best fit, was
found.
• How was the line of best fit determined?
– Determine mathematically the distance between the
line and each data point for all values of x.
– The distance between the predicted value and the
actual (y) value is called a residual (or error).
^
residual  y i  y  error (e)
24
Least Squares Regression: Line of Best Fit
• This could be done for each data point. If we square
each residual and sum all of the squared residuals, we
have:
n
^
2
e

(y

y
)
  i
2
i 1
• The best-fitting line is the line that has the smallest sum
of e2 ... the least squares regression line! That is, the line
of best fit occurs when:
n
^
2
e

(y

y
)
 minimum
  i
2
i 1
25
A Residual (Figure 3.11, p. 151)
26
Least-Squares Regression Line
• With the help of algebra and a little calculus, it can be
shown that this occurs when:
br
sy
sx
a  y  bx
^
y  a  bx
27
Exercise 3.12, p. 132
• Is there a relationship between lean body mass
and resting metabolic rate for females?
– Quantify this relationship.
• Find the line of best fit (the least-squares
regression, LSR).
• Use the LSR to predict the resting metabolic
rate for a woman with mass of 45 kg and for a
woman with mass of 59.5 kg.
28
Interpreting the Regression Model
• The slope of the regression line is important for
the interpretation of the data:
– The slope is the rate of change of the response
variable with a one unit change in the
explanatory variable.
• The intercept is the value of y-predicted when
x=0. It is statistically meaningful only when x
can actually take values close to zero.
29
R2: Coefficient of Determination
• Proportion of variability in one variable that can be
associated with (or predicted by) the variability of the
other variable.
1- r2 = 0.28
r = 0.85, r2 = 0.72
30
Exercise 3.45, p. 166
31
Exercise 3.45, p. 166
32
Residuals
• In regression, we see deviations by looking at
the scatter of points about the regression line.
The vertical distances from the points to the
least-squares regression line are as small as
possible, in the sense that they have the smallest
possible sum of squares.
• Because they represent “left-over” variation in
the response after fitting the regression line,
these distances are called residuals.
33
Examining the Residuals
• The residuals show how far the data fall from
our regression line, so examining the residuals
helps us to assess how well the line describes
the data.
– Residuals Plot
34
Residuals Plot
• Let’s construct a residuals plot, that is, a plot of
the explanatory variable vs. the residuals.
– pp. 174-175
• The residuals plot helps us to assess the fit of
the least squares regression line.
– We are looking for similar spread about the line
y=0 (why?) for all levels of the explanatory
variable.
35
Residuals Plot Interpretation, cont.
• A curved or other definitive pattern shows an
underlying relationship that is not linear.
– Figure 3.19(b), p. 170
• Increasing or decreasing spread about the line
as x increases indicates that prediction of y will
be less accurate for smaller or larger x.
– Figure 3.19(c), p. 171
• Look for outliers!
36
Figures 3.19 (a-c), pp. 170-171
37
How to create a residuals plot
• Create regression model using your calculator.
• Create a column in your STAT menu for residuals.
Remember that a residual is the actual value minus
the predicted value:

residual  y  y
38
Residuals Plot for 3.45
39
HW
• Read through end of chapter
• Problems:
– 3.42 and 3.43 (parts a and b only), p. 165
– 3.46, p. 173
• Chapter 3 Test on Monday
40
Regression Outliers and Influential Observations
• A regression outlier is an observation that lies outside
the overall pattern of the other observations.
• An observation is influential for a statistical calculation
if removing it would markedly change the result of the
calculation.
– Points that are outliers in the x direction of a scatterplot
are often influential for the least-squares regression line.
• Sometimes, however, the point is not influential when it
falls in line with the remaining data points.
– Note: An influential point may be an outlier in terms of
x, but we label it as “influential” if removing it
significantly influences the regression.
41
Practice Problems
• Problems:
– 3.56, p. 179
– 3.74, p. 188
– 3.76, p. 189
42
Preparing for the Test
• Re-read chapter.
– Know the terms, big concepts.
• Chapter Review, pp. 181-182
• Go back over example and HW problems.
• Study slides!
43
Download