Uploaded by Leah Mcconnell

Lesson 1 Scatter Plots, Correlation, Causation, & Regression Lines

advertisement
Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines
The Candy Grab Investigation
1. Measure the span of your hand to the nearest half centimeter. Hand span is the distance
from the tip of the thumb to the tip of the pinkie finger on your full stretched-out hand.
2. One student at a time, go to the front of the class and use your dominant hand to grab as
many candies as possible from the container. You must grab the candies with your fingers
pointing down (no scooping!) and hold the candies for 2 seconds before counting them.
After counting, put the candy back into the container.
3. On the board, record your hand span and number of candies in the table.
4. While other students record their values on the board, copy the table onto a piece of
paper and make a graph. Begin by constructing a set of coordinate axes. Label the
horizontal axis “Hand span (cm)” and the vertical axis “Number of candies.” Choose an
appropriate scale for each axis and plot each point from your class data table as accurately
as you can on the graph.
5. What does the graph tell you about the relationship between hand span and number of
candies? Summarize your observations in a sentence or two.
Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines
Vocabulary
A ____________________________ variable measures an outcome of a study.
These variables are often called __________________________ variables.
An ___________________________ variable may help predict or explain changes in a
_______________________________ variable.
These variables are often called __________________________ variables.
Identify the explanatory variable and response variable for the following
relationships, if possible. Explain your reasoning.
A. For the candy activity you participated in.
B. The weight (in carats) and the price (in dollars) for a sample of diamonds.
C. The SAT math score and the SAT evidence-based reading and writing score for a
sample of students.
Vocabulary
A ________________________________________ is the best way to display the relationship between
two quantatitive variables. The values of one variable appear on the horizontal axis, and
the values of the other variable appear on the vertical axis. Each individual in the data set
appears as a point on the graph.
Two variables have a _______________________________ association when above-average values
one variable tend to accompany above- average values of the other variable and when
below-average variables also tend to occur together.
Two variables have a _______________________________ association when above-average values
of one variable tend to accompany below-average values of the other variable.
There is ____________________________________ association between two variables if knowing the
value of one variable does not help us predict the value of the other variable.
Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines
How to describe a scatterplot:
 Direction: A scatterplot can show a positive association, negative association, or no
association.
 Form: A scatterplot can show a linear form or a nonlinear form. The form is linear if
the overall pattern follows a straight line. Otherwise, the form is nonlinear.
 Strength: A scatterplot can show a weak, moderate, or strong association. An
association is strong if the points don’t deviate much from the form identified. An
association is weak if the points deviate quite a bit from the form identified.
 Unusual features: Look for outliers that fall outside the overall pattern and distinct
clusters of points.
Describe each of the following.
A. The scatterplot created by the class in our activity.
B. The scatterplot on the shows the relationship between the duration (in minutes) of an
eruption and the interval of time until the next eruption (in minutes) of Old Faithful
during a particular month.
C. The scatterplot on the shows the relationship between the average income (gross
domestic product per person, in dollars) and fertility rate (number of children per
woman)in 187 countries.
Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines
For a linear association, mathematicians have defined a measure of direction and
magnitude of a correlation. This measure is called the correlation coefficient and is
represented by R. This value falls between –1 and 1. The closer the value is to –1 the
stronger the negative correlation is. The closer the value is to 1 the value is the stronger
the positive correlation is. Remember these limitations of r: Correlation does not imply
causation. The correlation is not resistant, so outliers can greatly change the value of r.
The correlation should only be used to describe linear relationships. Correlation ignores
the distinction between explanatory and response variables. The value of r does not have
units and is not affected by changes in the unit of measurement of either variable.
Example Here is the scatterplot showing the relationship between payrolls (in millions of
dollars) and wins for MLB teams in 2016. For these data, r = 0.613. Interpret the value of r.
A Correlation measures the relationship between and 2 variables (x &y).
When there is a CORRELATION we are saying that as one variable changes the other
also changes either positively or negatively. There is some type of "mutual activity" in
their change.
CAUSATION is a special correlation that means one variable directly caused a change in
the other variable.
Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines
Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines
Example: The correlation for the data shown is 𝑟 = 0.97. Does an increase in skiing
revenue cause more people to die by becoming tangled in their bedsheets?
Example: Most people love chocolate for its great taste. But does it also make you
smarter? A scatterplot like this one recently appeared in the New England Journal of
Medicine. The explanatory variable is the chocolate consumption per person for a sample
of countries. The response variable is the number of Nobel Prizes per 10 million residents
of that country.
A. If people in the United States started eating more chocolate, could we expect more
Nobel Prizes to be awarded to residents of the United States? Explain.
Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines
B. What effect does Switzerland have on the correlation? Explain.
Calculating Correlation
The formula for correlation is:
Example A student wonders if tall women tend to date taller men than do short women.
She measures herself, her dormitory roommate, and the women in the adjoining dorm
rooms. Then she measures the next man each woman dates. Here are the data (heights in
inches):
Find the correlation. Check your work in the calculator.
Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines
Example: The following scatterplot shows the height (in inches) and number of steps
needed for a random sample of 36 students to walk the length of a school hallway. The
correlation is 𝑅 = −0.632.
A. Explain why it isn’t correct to say that the correlation is −0.632 steps per inch.
B. What would happen to the correlation if number of steps was used as the explanatory
variable and height was used as the response variable?
C. What would happen to the correlation if height was measured in centimeters instead
of inches? Explain.
Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines
A ______________________________ line is a line that describes how a response variable y changes as an
explanatory variable x changes. Regression lines are expressed in the form:
Exercise #1: A survey was taken of 10 low and high temperatures, in Fahrenheit, in the month of April to try
to establish a relationship between a day’s low temperature and high temperatures.
Low Temperature, x
26
28
30
32
34
35
37
38
41
45
High Temperature, y
49
50
57
54
60
58
64
66
63
72
(a) Enter data into lists on your calculator. And create a scatter plot using your graphing technology. What do
you notice about the data?
(b) Use your calculator to find the equation for the line of best fit. Round the slope of the line to the nearest
hundredth and the y-intercept to the nearest integer.
(c) Explain what the y-intercept of this model represents in terms of the low and high temperatures that are
being modeled in this problem.
(d) How would you interpret the slope of this model in terms of how the low and high temperatures change with
respect to each other?
(e) What is the correlation coefficient of the data? What does this indicate?
Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines
Exercise #2: Generally, the fuel efficiency of a car changes with the weight of the car. A survey of some cars
with their weights and gas mileages is shown below.
Weight
(1000’s of lbs)
Mileage
(miles per gallon)
3.7
4.5
3.2
5.1
6.8
4.9
4.8
5.5
38
26
48
24
18
30
28
21
(a) Find the equation for the line of best fit using your calculator. Round both coefficients to the nearest tenth.
List what the variables x and y represent in this problem.
(b) Create a graph of the scatter plot for this data. Would you consider the correlation between weight and
mileage to be positive or negative? Explain.
(c) Which parameter of the linear model predicts whether the correlation is positive or negative? Use this
model to help explain your answer.
(d) If a car had a weight of 4,300 pounds, what would this model predict as its fuel efficiency? Round to the
nearest integer. Use appropriate units and make sense of your answer.
(e) If we wanted to purchase a car that got 40 miles to a gallon, what weight of car, to the nearest 100 pounds,
should we purchase? Solve algebraically.
Lesson 1: Scatter Plots, Correlation, Causation, & Regression Lines
Exercise #3: A pediatrician would like to determine the relationship between infant female weights versus age.
The pediatrician studies 100 newborn girls and finds their average weight at the end of 3 month intervals.
Age
(months)
Average Weight
(pounds)
0
3
6
9
12
15
7.2
12.2
15.1
19.4
21.5
26.3
(a) Using the linear regression command on your
calculator, find the equation of the best fit line
(b) Use your calculator to determine the linear
correlation coefficient. Round to the nearest
thousandth. How can you interpret this value in
terms of the variation in weight due to age?
Exercise #4: Using the equation that your calculator produced in Exercise #1, predict the weight of a baby girl
after 10 months. Round your answer to the nearest tenth of a pound. The use of a model to predict outputs
when the input is within the range of the known data is called interpolation. Interpolation tends to be
fairly accurate.
Exercise #5: Using the equation that your calculator produced in Exercise #3, predict the weight of a baby girl
after 2 years. Round your answer to the nearest tenth of a pound. The use of a model to predict outputs when
the input is outside of the range of the known input data is called extrapolation. Models are most helpful
when they can be used to extrapolate, but tend to be less accurate.
Exercise #6: Biologists are trying to create a least-squares regression equation (another name for best fit line)
relating the length of steelhead salmon to their weight. Seven salmon were measured and weighed with the data
given below.
Length
22
24
28
34
39
42
48
(inches)
Weight
3.43
4.46
7.08
14.21
22.19
31.22
35.67
(pounds)
(a) Determine the least-squares regression
equation, in the form y  ax  b , for this data.
nearest
(b) Using your equation from part (a), determine
the expected weight of a salmon that is 30
inches long.
(c) Using your equation from part (a), determine
the expected weight of a salmon that is 52
inches long.
(d) In which part, (b) or (c), did you use
interpolation and in which part did you use
extrapolation? Explain.
Round all
hundredth.
coefficients
to
the
1. Bivariate Data Summary
Bivariate data – data that examines the relationship between two variables
• What individuals to the data describe?
• What are the variables and how are they measured
• Are the variables quantitative or categorical
Types of bivariate data
• Response variable – measures the outcome of a study
• Explanatory variables – attempts to explain (not cause) the response variable
to determine which is explanatory and which is response, think about which one seems to be a
possible explanation of the other.
if it is not obvious which one is explanatory and which is response, it very well be that
it doesn’t matter.
Scatterplots – shows the relationship between two quantitative variables measured on the same individuals.
The values of one variable appears on the horizontal (x) axis and the other axis appears on the
vertical (y) axis. Each individual appears as a point in the plot. The explanatory variable is
placed on the x-axis and the response variable is placed on the y-axis. If there is no explanatoryresponse relationship, either variable can go on the horizontal axis.
Scatterplots must be labeled (both axes). The intervals on each axis must be uniform.
Interpreting scatterplots
• Look for the overall pattern and deviations from that pattern.
• Form – does the data appear linear or curved or have distinct clusters?
• Direction – the term used is association
• Positive association – low values of the explanatory variable accompanies
low values of the response variable and
high values of the explanatory variable accompanies
high values of the response variable
• Negative association – low values of the explanatory variable accompanies
high values of the response variable and
high values of the explanatory variable accompanies
low values of the response variable
• Strength of an association – how closely the points follow a clear form. Both
of the associations above are strongly linear.
• Outliers – a point that falls outside the overall pattern of the relationship. As usual,
outliers should be eliminated if it can be justified.
© www.MasterMathMentor.com
Illegal to post on the Internet
Using the calculator:
Place the data in your lists: STAT
Set up your plots: 2nd
Then: ZOOM
EDIT
Y=
9 : ZoomStat Be sure that you have no graphs in your Y = list.
Correlation –measures the direction and strength of the linear relationship between two quantitative variables.
The variable to denote correlation is r.
Facts about correlation:
1. r is a number between -1 and 1 inclusive. Positive values of r indicates a positive association
between variables. Negative values of r indicates a negative association between variables.
2. Values of r near 1 or -1 mean a very strong association (the points are very close to forming a
straight line). Values of r near 0 mean a very weak association. If r is exactly 1 or -1, the
association is perfect (rarely happens).
3. Correlation makes no difference between explanatory and response variables. It makes no
difference which variable is x and which is y when calculating r.
4. When calculating r, both variables must be quantitative.
5. r does not change when we change the units of measurements of x, y, or both.
6. Correlation measures the strength of a linear relationship between variables. It does not
relationships that are curved no matter how strong they appear to be.
7. r is non-resistant … it is affected strongly by outliers.
8. In general, we will make this claim:
⎧ r ≥ .9...Very strong association
⎪
⎪.7 ≤ r < .9...Fairly strong association
⎪
⎨.5 ≤ r < .7...Moderately strong association
⎪
⎪.2 ≤ r < .5...Fairly weak association
⎪
⎩ r < .2...Very weak association
⎛ x − x ⎞⎛ y − y ⎞
1
The formula for correlation: r =
⎟ . You are not responsible for this formula.
⎜
⎟⎜
∑
n −1 ⎝ sx ⎠⎜⎝ s y ⎠⎟
We will find r using the calculator.
R1: Linear Regression Example
Name ______________________________
A diver is investigating a wreck under the water and has to come up to the surface slowly. Following is a chart
detailing his depth from the time he starts ascending.
time
(min)
depth
(ft)
0 sec
30 sec. 1 min.
240
225
203
1min.
40 sec
189
2 min
20 sec
180
3 min
164
3 min
30 sec
155
4 min
40 sec
160
5 min
30 sec
130
6 min
125
6 min
30 sec
120
1. Draw a scatterplot which is suitable for predicting depth based on time. There is one point that lies outside
the pattern of the plot. Circle it.
2. Find the equation of the least squares line of the data. Comment on the strength of the association.
3. Eliminate the point you circled and recalculate the equation of the least squares line. Sketch the line on the
scatterplot above. Comment on the strength of the association compared to your answer with that circled
point included.
4. Using the model above (point removed), explain the meaning of the slope of the line.
5. Describe the relationship between time and depth using r2 to make your description more precise.
© www.MasterMathMentor.com
Illegal to post on the Internet
6. Using the model above (point removed), predict the depth of the diver at the following times. Comment on
the confidence of your prediction.
a) 2 minute, 50 seconds
b) 5 minutes
c) 7 minutes, 10 seconds
7. Find the difference between the observed depth at 1 min, 40 seconds, and the predicted depth.
8. Explain why a linear model might not be the best to describe this data.
9. In the diagram above, place a square around the point that has the largest residual. Place a triangle around the
point that has the second largest residual.
10. Using the linear model, how long would your predict before the diver reaches the surface. In reality, do
you believe it is a longer or shorter duration? Explain.
11. What is the exact sum of the residuals?
12. One more piece of data is added. It is found that the diver is at 40 feet, 7 minutes into his ascent. Explain
the role of this new point and how it will affect the slope of the LSRL without actually calculating it.
13. New data is added to change the mean time to mean time to 6 minutes and 10 seconds with standard
deviation 2 minutes and 5 seconds. The mean depth is now 115.5 feet with standard deviation 31.6 feet. If
the association is still negative and r2 = .68, find the slope of the regression line of time versus depth.
© www.MasterMathMentor.com
Illegal to post on the Internet
Download