Fathom

advertisement
Section 4: Analyzing Bivariate Data with Fathom
Section 4: Analyzing Bivariate Data with Fathom
Summary: Building from ideas introduced in Section 3, teachers continue to analyze
automobile data using Fathom to look for relationships between two quantitative
attributes. They use the concept of variation and deviations from the mean in each
univariate attribute to help conceptualize correlation and least squares regression as ways
of describing the relationship in the bivariate data and developing a linear model for
making predictions. The teachers will consider pedagogical issues concerning the
difficulties students may have in analyzing bivariate data and the benefits and drawbacks
for using conceptual underpinnings from univariate analysis to develop bivariate analysis
techniques.
Objectives:
Mathematical: Teachers will be able to
• use concepts and techniques used in univariate data analysis to understand how
two attributes co-vary and the techniques used to analyze two quantitative
variables in bivariate data analysis;
• describe how to create a linear model using the method of least squares;
• describe the strength of the relationship between two quantitative attributes using
correlation coefficient;
• analyze the appropriateness of a linear function as a model for a bivariate data
using a residual plot;
• use the coefficient of determination to explain the amount of variation for a
predicted value that can be attributed to a domain value by a least squares line;
• determine whether a linear function is an appropriate model for a set of data.
Technological: Teachers will be able to use Fathom to
• create box plots and scatterplots;
• plot functions and values on graphs;
• create a movable line and show squares;
• create a least-squares line-of-best-fit,;
• create a residual plot;
• enter formulas;
• use the summary table to compute statistical measures.
Pedagogical: Teachers will
• discuss the benefits and drawbacks related to different representations of data,
using dynamic files to conceptualize correlation and linear models;
• become familiar with some difficulties students have with creating and using a
least squares linear model;
• consider benefits and drawbacks of using tasks to assist students in reasoning
about how to use univariate data analysis techniques to develop a better
understanding of bivariate data analysis techniques.
Prerequisites: univariate analysis techniques, understanding mean and standard
deviation as measures of center and spread, dot plots and box plots
Vocabulary: covariation, bivariate data, correlation coefficient, residuals, Sum of
squares, Least Squares line, coefficient of determination, interpolation, extrapolation,
influential points
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 1
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
Technology Files: 2006Vehicles.ftm, Correlation.ftm, Outlier.ftm,
Emergency Technology Files: 2006VehiclesPart3.ftm, 2006VehiclesPart4.ftm,
2006VehiclesPart5.ftm, 2006VehiclesPart6.ftm
Required Materials: Fathom v. 2
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 2
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
Section 4: Analyzing Bivariate Data
While measures of center and spread provide us with important information about
the distribution of a single variable, often more than one variable is collected, and
relationships between two or more variables are examined. In Section 3, we used
two attributes in the 2006 Vehicle data to answer a question about vehicles with
which engine type seemed to have the best City mpg performance. To answer that
question, we were using bivariate data with one quantitative attribute (City mpg)
and one qualitative attribute (Engine type). Bivariate data is the term used to
describe data that have two variables for each observation. In this lesson, we will
focus on ways to examine relationships between two quantitative attributes. When
examining two quantitative measures in a data set, our attention is on how the
measures co-vary. To help students conceptualize co-variation, we are going to
build on what they already know about partitioning data sets and using measures
of center and spread to describe univariate distributions.
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 3
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
Part 1: Examining Relationships Between Two Quantitative
Variables
Open the 2006Vehicle.ftm file and clear all previous graphs.
Tech Tip:
Unless one of your
goals is to investigate
the effects of changing
data values, remember
to choose Prevent
Changing Values in
Graphs from the
Collection menu.
There are several quantitative attributes in this data set. When considering a
purchase of a new vehicle, a buyer is likely interested in the gas mileage for both
city and highway. When looking at the 2006 vehicle data, we can ask the
question:
Is there a relationship between City mpg and Hwy mpg for
this set of vehicles?
We can begin by examining the distributions of the City mpg and the Hwy mpg
on separate dot plots. Drag down two empty Graph objects and create dot
plots for each attribute.
Figure 4. 1
We can utilize Fathom’s linked representations to informally investigate the
nature of the relationship between the two distributions by selecting a portion of
the cases in the City distribution and noticing the corresponding position of those
cases in the Hwy distribution.
Tech Tip:
To deselect cases in a
graph, click on any
“white space” in the
graph.
To select a portion of cases in a graph,
1. in the graph window, click and drag to draw a
dashed rectangle around a subset of points.
2. When the mouse is released, the selected
points will appear red in the graph. However,
because the representations are all linked, the
same cases will be highlighted in all other
open representations of this data.
Figure 4. 2
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 4
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
Figure 4. 3
FOCUS ON MATHEMATICS
M-Q1. What do you anticipate might be a reasonable relationship between the
City mpg and Hwy mpg?
M-Q2. Select the cases with the highest City mpg. Where are the corresponding
values for the Hwy mpg located in its dot plot? Identify the specific vehicles
represented by these points.
M-Q3. Change the graphs for the City and Hwy to be displayed as box plots.
Click on the lower whisker in one of the box plots and notice the location of the
highlighted cases in the other box plot. Repeat this several times by clicking on
different parts of a box plot. What do you notice?
M-Q4. Compare what you noticed in your exploration in M-Q3 to the prediction
you made in M-Q2.
FOCUS ON PEDAGOGY
P-Q1. Explain how it might be helpful to use the linked representations of two dot
plots or box plots to assist students in examining the covariation between two
quantitative attributes.
Having students look for the corresponding position of subsets of data in two
univariate distributions (e.g., see Figure 4.3) can help students notice initial
patterns and relationships between attributes and motivate the notion of how to
display the data to examine the bivariate distribution. We want to help students
transition to a representation in two dimensions, rather than representations like
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 5
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
dot plots and box plots that are only in one dimension. The two-dimensional space
commonly used to represent bivariate data is a scatterplot. Scatterplots are an
efficient way to examine whether there is an association between two quantitative
attributes. We are interested in examining if there is an association between City
mpg and Hwy mpg. City and Hwy mpg are both outcome measures of a vehicle’s
performance, where one is not dependent on the other. We are interested if there
is a relationship between these attributes; however, unlike some situations, we are
not assuming that there is a cause-effect, independent-dependent relationship
between these two quantities. Thus, it does not matter which attribute we use as
our input (x) or predictor variable, and which one we use as the output (y) or
response variable. In creating a scatterplot, the attribute assigned as the predictor
variable is represented on the x-axis while the response variable is on the y-axis.
For our example, we will assign City mpg as the predictor variable and Hwy mpg
as the response variable.
FOCUS ON PEDAGOGY
P-Q2. Create several pairs of variables that can help students understand the
difference between an association between two variables that are independentdependent or where both are outcome measures and would only have a predictorresponse relationship. Explain why each pair is either independent-dependent or a
predictor-response example.
To help students transition from examining the two attributes as distributions in
one dimension, to inscribing the data in two dimensions, we are going to re-orient
the univariate distributions such that the distribution of the predictor variable
(City) is horizontal and the response variable (Hwy) is vertical.
To move an attribute from the x-axis to y-axis,
1. in the graph window, drag the attribute label (Hwy) from the x-axis and
drop it on the y-axis.
2. The graph will be redrawn with the distribution displayed along the y-axis.
Figure 4. 4
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 6
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
Before continuing, change the window sizes and orient the two box plots as
shown in Figure 4.5 so as to leave room to add another graph object to
display the scatterplot.
Figure 4. 5
The boxplots for each attribute show how each distribution is partitioned by the
quartiles, with the second quartile, represented by the line inside the box,
representing the median. To analyze how the two attributes co-vary, we are going
to inscribe the data as a scatterplot where each case icon in the graph will
represent the ordered pair (City, Hwy) for that particular vehicle.
To create a scatterplot,
1.
2.
3.
drag down an empty Graph
object to the workspace.
Click and drag the attribute
representing the predictor (or
independent) variable to the
x-axis.
Click and drag the attribute
representing the response (or
dependent) variable to the yaxis.
Figure 4. 6
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 7
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
Adjust the scales of the three graphs until they are aligned (see Figure 4.7). It will
also be helpful to display the location of the mean for the City and Hwy on each
graph.
Tech Tip:
To display the mean
on a graph, select the
graph window then
choose Plot Value (or
Plot Function for the
attribute on the y-axis)
under the Graph menu.
Use the formula
mean(attribute_name)
(e.g., mean (City)).
Figure 4. 7
The display of the horizontal and vertical lines representing the mean values can
help students examine how each data point varies from the means. The lines can
also serve as a reference to notice the placement of data points in comparison to
the general trend of data points. When describing relationships between two
variables, we typically describe the form (linear, exponential, etc), direction
(positive or negative), and strength (weak, moderate, or strong) of the general
trend and relationship.
Tech Tip:
Remember that you
can click on data
points or select a
cluster of data points
in a graph and see
those points
highlighted in all
graphs and in the
table.
FOCUS ON MATHEMATICS
M-Q5. Explain why you use the command Plot Value to display the mean City
mpg and use the command Plot Function to display the mean Hwy mpg in a
scatterplot.
M-Q6. Use form, direction and strength to describe the relationship between City
and Hwy mpg.
M-Q7. Describe the location of the data points in relation to the mean City and
mean Hwy mpg. What does this tell you about the general trend of the data?
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 8
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
M-Q8. Describe a typical City and Hwy mpg for this set of vehicles. Explain how
you determined what you would consider as “typical”.
FOCUS ON PEDAGOGY
P-Q3. How can displaying the means in a scatterplot help or hinder students’
ability to think about variation in bivariate data?
In the next part, we will more closely examine how to quantify the strength of a
linear relationship between two quantitative variables.
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 9
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
Part 2: Conceptualizing Correlation
Based on the scatterplot of City mpg versus Hwy mpg, there appears to be a
relationship between the two attributes and the data seems like it could be
modeled using a linear function with positive slope. Thus, there seems to be a
correlation between the two attributes, meaning that there seems to be a general
trend and an association between the variation in each attribute. A correlation
coefficient, usually represented by the letter r, is a measure used to describe the
strength and direction of a linear relationship between two variables. The most
common formula used in textbooks and technology tools is the Pearson Product
Moment correlation coefficient. There are several equivalent forms of the formula
used to compute this measure, all of which are based on how the data points vary
from the mean of the predictor (x) and response (y) variables. For example,
consider the formula below as one expression of the Pearson’s correlation
coefficient.
( xi x )( yi y )
r=
( xi x ) 2 ( y i y ) 2
Notice that the expressions in both the numerator and denominator are based on
how a data value deviates from the mean, including squared deviations.
Open the file Correlation.ftm.
Tech Tip:
Fathom software
includes several precreated documents that
are useful for
exploring various
concepts. These can be
found in the Sample
Documents folder. The
correlation.ftm file is
adapted from one of
these sample
documents.
To help visualize how the correlation coefficient is a measure of covariation and
the spread of the bivariate data, we are going to use this interactive diagram as
shown in Figure 4.8. In this diagram, we have a slider to control the value of r,
and are able to view how changes in r affect the spread of the data points in the
scatterplot. A visualization tool such as this can help students get a sense of how r
is a measure of the covariation between the variables.
Figure 4. 8
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 10
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
FOCUS ON MATHEMATICS
M-Q9. Drag the slider for r and observe how the scatterplot changes. Based on
your exploration, fill in the blank in each of the following statements.
Pedagogy Tip:
Students can test out
their ability to
estimate a
correlation values
for a given
scatterplot using the
java applet at
http://www.stat.uiuc.
edu/courses/stat100/j
ava/GCApplet/GCA
ppletFrame.html
a. A correlation coefficient value of ______suggests a perfect increasing
linear association between two variables.
b. A correlation coefficient value of _______ suggests a perfect decreasing
linear association between two variables.
c. A correlation coefficient value of ______ suggests there is no linear
association between the two variables.
M-Q10. Use the slider to create a scatterplot that can help you estimate a value
for the correlation coefficient for the relationship between City mpg and Hwy
mpg.
FOCUS ON PEDAGOGY
P-Q4. Discuss the benefits and drawbacks of using this interactive diagram to
introduce correlation as a useful measure for describing the strength of a
relationship between two variables.
Generally, a correlation coefficient with absolute value greater than 0.8 is an
indicator of a strong linear relationship, while a correlation coefficient with
absolute value less than 0.5 is considered weak. The scale in Figure 4.9 can be
helpful in interpreting the r value. However, it is important to not use the
correlation coefficient as the sole indicator of the strength of a linear relationship.
Just remember that the correlation coefficient is computed using means and
deviations for the mean—and we know how easily an outlier can affect a mean!
Figure 4. 9
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 11
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
Part 3: Using a Line to Describe a Relationship Between Two
Quantitative Variables1
Return to the 2006Vehicle.ftm file.
The scatterplot of Hwy vs. City mpg suggests that there may be a linear
relationship between the two variables. That is, if we know a vehicle’s City mpg,
then we can use a linear function rule to predict an approximate value for that
vehicle’s Hwy mpg. We can use Fathom to compute the correlation coefficient, r.
To compute r,
1. drag an empty Summary object to the
workspace.
2. Click and drag a quantitative attribute
label onto the Summary table. Once the
cursor is over the table, a down arrow
and a right arrow will appear. Drop the
quantitative attribute below the down
arrow.
3. Click and drag a second quantitative
attribute label onto the Summary table
to the right of the right arrow. The
default measure that will be displayed is
the correlation between the two
attributes.
Figure 4. 10
Figure 4. 11
FOCUS ON MATHEMATICS
M-Q11. Compare the calculated correlation coefficient with the one you
estimated using the Correlation.ftm file.
M-Q12. What does the value of the correlation coefficient imply about the
relationship between City and Hwy mpg?
Since we have a high correlation value, it makes sense to try to use a linear
function to model the vehicle data. The model represents an estimate of the
response variable (often denoted as y) given a value for the predictor variable
1
The technology file “2006Vehicle_Part3” is available for students to use for Part 3 if they were
unable to complete Part 1 with the technology.
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 12
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
(often denoted as x). This model could be thought of as a measure of center for
this bivariate distribution.
To create a movable line to fit bivariate data,
1. click on the graph and under the Graph menu, select Add Movable Line.
The graph of a line appears in the scatterplot with its equation at the
bottom of the graph window.
2. Dragging the line by its middle changes the intercept (translates the line)
while dragging by either end changes the slope (rotates the line).
FOCUS ON MATHEMATICS
M-Q13. Insert a movable line and adjust it so that you feel it best models the data.
Describe the method you used for determining where to place the line to model
the data.
M-Q14. Interpret the slope and y-intercept in the equation of your linear model.
M-Q15. We can use the equation of a line to estimate a value for the response
variable for an input for the predictor variable. For example, a Jeep Liberty gets
22 mpg in the City. If we think of a linear model for this data with an equation,
Hwy = 1.01*City+5, then we can use this equation to estimate a predicted Hwy
mpg when the City mpg is 22. Based on where you placed the moveable line, use
your equation to predict the Hwy mpg for a vehicle with a City mpg of 31.
FOCUS ON PEDAGOGY
P-Q5. How can the ability to overlay a moveable line on a scatterplot help or
hinder students’ understanding of the use of a linear equation to model a
relationship between two variables?
P-Q6. One of the difficulties in using learning activities such as this is that
students do not have confidence in their solution when their results differ from
fellow students or from a teacher. Think of two strategies that you could employ
to help students understand that differences in solutions are acceptable and
expected in the context of trying to estimate a linear model. How could you
capitalize on this difficulty should it arise?
When we try to create a linear model by visually inspecting a graph, it is unlikely
that two different people will generate the exact same line. If we have two or
more different lines, how do we determine which is really best? There are
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 13
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
different methods that one might use for creating a linear model and analyzing
how good of a fit it is. With each method, the goal is to minimize the distance
each predicted value is from the actual data value.
Similar to how we examined
deviations from the mean with
univariate data, we can examine
deviations from a line with
bivariate data. One common
method that is used for finding the
best linear model is to minimize
the deviations of the actual data
points from the predicted values.
Visually, these are the vertical
distances between the actual data
points and the line (Figure 4.12).
Figure 4. 12
We are trying to minimize the
vertical distance between an actual data value and the predicted value (output)
that would fall on the line for the same input value (x- coordinate) of the actual
data point. Thus, we are comparing the y-coordinate for an actual data point in the
collection to its predicted y-coordinate that would result from the linear function.
The difference, y-coordinate of actual data point -- y-coordinate of predicted
value, is called a residual.
Recall that for univariate data, we
described deviation from the mean
using variance and standard deviation
(Part 4 of Section 3), which are both
based on summing the squared
deviations. We can use a similar
method with our bivariate data where
we sum the square of each residual.
For bivariate data, this sum is called
the Sum of squares (or residual sum
of squares), and represents a measure
of variation. A linear model that
minimizes the sum of squares is called
the Least Squares regression line.
Figure 4. 13
To visualize the Sum of squares,
1. click on the graph that has moveable line and select Show Squares
from the Graph menu. Notice the sum of the squares is computed and
displayed below the equation of the line.
2. Adjusting the movable line will change the squares and the value of
the Sum of squares.
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 14
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
You should notice that there are tan and maroon squares shown in the scatterplot.
Since the mean(Hwy) was used with the Plot Function command, Fathom is also
displaying the squared deviations from the horizontal line . Before continuing, it
is important to remove the mean(Hwy) and mean(City) lines from the
scatterplot.
To remove a plotted value or function from a graph,
1. point to the plotted value
or function you wish to
remove and right-click
with the mouse.
2. In the pop-up menu,
choose Cut (or Clear)
Formula. The graph of the
value or function should be
removed from the graph.
Figure 4. 14
Figure 4.15 displays the moveable line that is being used to estimate a linear
model for estimating Hwy mpg given City mpg. The squares represent the
squared residuals for how much the line over or under predicts for each data
point.
Figure 4. 15
FOCUS ON MATHEMATICS
M-Q16. Manipulate the movable line to explore whether it is possible to create a
line that is far from several points but still has a small sum of squares. Explain.
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 15
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
M-Q17. Compare and contrast the method of squaring residuals for bivariate data
with the calculation of the standard deviation in univariate data.
M-Q18. Adjust the movable line so as to minimize the sum of squares. Record
your new equation for the linear model. Use your new linear model to compare
the predicted Hwy mpg and the actual Hwy mpg for the following two vehicles:
a) Honda Accord Standard engine vehicle, and b) Toyota Prius Hybrid engine.
Does your line under-predict or over-predict for the each of these vehicles? By
how much?
FOCUS ON PEDAGOGY
P-Q7. Describe the benefits and drawbacks of building on what students already
know about deviations from a mean with univariate data and standard deviation to
find a linear model by minimizing the sum of squared residuals.
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 16
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
Part 4: Visualizing the Residuals
In addition to finding the smallest Sum of squares, a plot of the residuals is
helpful in deciding whether your line is a good model for the data. A residual plot
displays the order pairs (x-value, (y-value -- predicted y-value)).
To view a plot of the residuals,
1. click on a scatterplot that has a linear model displayed.
2. Under the Graph menu, choose Make Residual Plot.
3. The residual plot will be displayed at the bottom of the graph window.
Figure 4. 16
In general, we want the residuals to be near zero, and the plotted points should be
randomly dispersed above and below the horizontal line y=0 and not reveal any
trends or patterns.
FOCUS ON MATHEMATICS
M-Q19. Consider the residual plot for your linear model. If you continue to adjust
the moveable line, you should notice the residual plot update accordingly. What
does the residual plot reveal about the usefulness of your linear model for
predicting Hwy mpg for various vehicles?
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 17
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
M-Q20. A student placed their movable line in the scatterplot for the 2006
Vehicle data that resulted in the following residual plot.
Sketch the location of the predicted linear model based on the residual plot above
in the following graph.
FOCUS ON PEDAGOGY
P-Q8. Describe some of the conceptual difficulties students may have in
interpreting and using the residual plot. How will you help them understand the
residual plot and its usefulness in analyzing a linear model?
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 18
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
Part 5: The Least Squares Regression Line2
While we can use the techniques of minimizing the sum of squares and viewing
the residual plot to help find a really good linear model, technologies like Fathom
can easily compute a linear model using the Least Squares method.
Tech Tip:
If a moveable line
and residual plot
are currently
displayed when
the Least Squares
line is added to a
graph, the
Residual plot will
still be displaying
the residuals for
the moveable line,
not the Least
Squares line.
To find the least squares line,
1. click on a graph displaying a scatterplot.
2. Under the Graph menu, select Least-Squares Line. The Least-Squares
regression line computed by Fathom will appear on the graph, along with
the equation for this line and the value of r2.
The square of the correlation coefficient, r 2 , is called the coefficient of
determination and can be interpreted as the proportion of variation in the
response variable that can be attributed to the variation in the predictor variable
by the least squares line. As the value gets closer to 1, the variation is better
defined by the predictor variable, and increasingly accurate predictions for the
response variable can be assumed. The difference between r2 and 1 (1-r2)
indicates the proportion of the variation in the response variable that is attributed
to other variables besides the predictor variable.
FOCUS ON MATHEMATICS
M-Q21. Compare the function rule for the least squares linear model with the
function rule for your estimated linear model (your moveable line).
M-Q22. What is the predicted Hwy mpg for the Honda Accord Standard and the
Toyota Prius Hybrid using the least squares linear model? How do these compare
to the predicted Hwy mpg for both vehicles using your estimated linear model
found with the moveable line (refer to your solution of M-Q18)?
M-Q23. Interpret the coefficient of determination for the least squares line.
Now that we have the least squares line, we can remove the Movable line and
the Residual Plot from the graph.
To remove a movable line from a graph,
1. with the graph selected, under the Graph menu, choose Remove
Moveable Line.
2. The moveable line and its associated residual plot should be removed.
2
The technology file “2006Vehicles_Part5.ftm” is available for students to use for Part 5 if they
were unable to complete through Part 4 with the technology.
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 19
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
The least squares line is computed using means and standard deviations for each
variable, and the correlation coefficient r. To help visualize the location of the
least squares line in relation to the mean mpg for both City and Hwy, use Plot
Value and Plot Function to display the mean(City) and mean(Hwy).
Figure 4. 17
Algebraically, the slope and y-intercept of the least squares line are:
sy
slope = r
sx
y-intercept= y -slope ( x )
Thus the equation for the least squares line can be symbolically represented as
sy
sy
yˆ = r ( x) + ( y r ( x ))
sx
sx
Or an alternative form of
yˆ = r
sy
sx
(x x) + y
When using a least squares line to make predictions, interpolation is the process
of predicting a response based on a value within the domain of the predictor
variable. In answering previous questions, we used interpolation. Extrapolation
is the process of predicting based on a value outside the range of the original data
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 20
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
for which the least squares line was computed. For example, we would be
extrapolating if we used the least squares line to predict the Hwy mpg for a
vehicle with a City mpg of 5 or 80.
When finding a linear model we should take into account at least four different
factors in considering whether the line is a good model for the data: (1) the value
of the correlation coefficient and coefficient of determination, (2) how the data is
positioned in the scatterplot in relation to the linear model, 3) the residual plot,
and (4) the situation and whether the line makes sense for all data points. In many
cases, the domain of the linear model needs to be specified in order for it to fit the
situation and to avoid the potential dangers of extrapolation.
FOCUS ON MATHEMATICS
M-Q24. The least squares regression line passes through the intersection of the
mean City mpg and the mean Hwy mpg. Will this always happen? Justify your
answer algebraically.
M-Q25. Do you believe the least squares line is a good model for the 2006
Vehicle City and Hwy data? Explain.
FOCUS ON PEDAGOGY
P-Q9. How could you assist students in thinking about the dangers of
extrapolating using the 2006 Vehicle data?
P-Q10. If technologies like Fathom, as well as others such as Excel and graphing
calculators, will compute and display the least squares line, would you choose to
show students the algebraic form for computing the least squares line? Defend
your position.
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 21
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
Part 6: Exploring Additional Attributes on a Scatterplot3
The analysis of the scatterplot and least squares line for the City and Hwy mpg
raises several issues. First, there are several vehicles, all of which are Hybrids, for
which the linear model grossly overestimates their Hwy mpg based on their City
mpg. Second, the r2 value suggests that there may be other variables besides City
mpg that are contributing to the variation in the Hwy mpg.
Fathom can facilitate students analyzing data in a scatterplot in such a way as to
add a third variable, or dimension, to the analysis. This can help students visualize
the relationship between three variables, rather than only considering two.
Deselect the Least Squares line before continuing.
Tech Tip:
To remove a legend
attribute from a
graph under the
graph menu, choose
Remove Legend
Attribute.
To overlay a legend attribute on a scatterplot,
1. drag the attribute of interest to the center of the scatterplot.
2. If the attribute is quantitative, then each data point will be displayed along
a color gradient continuum. If the attribute is qualitative (categorical) then
each data point will be displayed using different shapes and colors. A key
will appear at the bottom of the graph.
FOCUS ON MATHEMATICS
M-Q26. Which quantitative and qualitative attributes in the 2006 vehicle data set
could be related to the City and Hwy mpg for a vehicle?
M-Q27. Explore overlaying the attribute Weight on the Hwy vs. City scatterplot.
Explain whether a vehicle’s weight seems to be related to the City and Hwy mpg.
FOCUS ON PEDAGOGY
P-Q11. Consider how the differences between the use of color to highlight
different attributes in Fathom and in TinkerPlots could affect students’ reasoning
about relationships between attributes.
Earlier we noticed that many of data points that did not seem to fit the general
trend between City and Hwy mpg were Hybrid vehicles. It may make sense, then
to compute a least squares line for the data in each of the subcategories of engine
3
The technology file “2006Vehicle_Part6.ftm” is available for students to use for Part 6 if they
were unable to complete through Part 5 with the technology.
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 22
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
type: Hybrid, Diesel, and Standard. If we overlay a qualitative attribute on a
scatterplot and display a Least-Squares Line on it, Fathom will compute a least
squares line using the qualitative attribute as a filter, and will compute a different
linear model for subsets of data according to the qualitative attribute.
FOCUS ON MATHEMATICS
M-Q28. Overlay the Engine type attribute onto the Hwy vs. City scatterplot.
Display the least squares line. Interpret the resulting least squares equations.
FOCUS ON PEDAGOGY
P-Q12. What are the benefits and drawbacks of having the ability in Fathom to
overlay a qualitative attribute as a filter for computing Least Squares linear
models for subsets of data?
As we have seen, overlaying an attribute in a scatterplot can allow students to
simultaneously consider three attributes and relationships among them. While at
first this may seem confusing to students, this feature can help them consider
relationships among more than one variable and realize that linear models that
only consider a relationship between two variables are often not sufficient in
explaining the phenomenon. In the case of the 2006 Vehicle data, we have learned
that the type of engine in a car appears to be a significant factor that affects the
relationship between a vehicle’s fuel economy when driving in a city and on the
highway.
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 23
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
Part 7: Exploring the Effects of Outliers on Correlation and the
Least Squares Line
In the exploration of the 2006 vehicle data, we were able to dynamically control
the location of a moveable line in the scatterplot. Now, we are going to reverse
our locus of control and capitalize on the ability to move data points in a
scatterplot and observe the effect on the measures of that data.
Open the file Outliers.ftm.
This file has 5 data points displayed in the table and a scatterplot. The correlation
between the two variables is also computed and displayed.
Figure 4. 18
FOCUS ON MATHEMATICS
M-Q29. The correlation coefficient is currently about r=0.02506. Explain why
this value makes sense for these 5 data points.
M-Q30. Add the Least Squares line to the graph. Describe the slope of the least
squares line with respect to the correlation. Why are they related in that way?
M-Q31. Drag the “center” point located at (6.1, 7.5) to the upper right and bottom
left corner of the graph. Describe the effects on the correlation coefficient and
Least Squares line. Repeat by dragging this point to the bottom right and top left
corners.
M-Q32. Describe two translations of the “center” point that have little effect on
the slope of the least squares line.
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 24
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
M-Q33. By moving the five points, find at least three very different arrangements
of points that result in a corresponding least squares line that has a positive slope
and a correlation coefficient greater than 0.8. How are the three arrangements
different, and what does this difference imply about the relationship between the
location of the points on the scatterplot, the value of the correlation coefficient,
and the slope of least squares line?
FOCUS ON PEDAGOGY
P-Q13. Often middle or high school students only consider the correlation
coefficient when modeling data. Explain why this information alone is not
sufficient when determining an appropriate mathematical model.
P-Q14. What are the benefits and drawbacks of changing the data points in a
graphical representation and observing the effects on the measures of correlation
and the least squares line?
The exploration in the Outliers.ftm file is yet another example of how a
technology tool can be used to create an interactive diagram. Such diagrams allow
students to engage in dynamic manipulations, observe effects of their activities,
and reflect on those effects to develop a more meaningful conception of a
mathematical idea. These type of diagrams can be used in a variety of settings
such as: 1) for individuals to complete working alone at a computer, 2) small
groups of students working together with one computer, 3) small groups of
students working on individual computers but allowed to discuss their results as a
group, and, 4) whole group discussion with the interactive diagram displayed
using a projector and students and teacher discussing the activities and the effects
of the activities together. When considering how you use such technology files in
your own classroom you will need to balance what your goals are for students’
learning with the time allotted and computers available.
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 25
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
SUGGESTED ASSIGNMENTS
H-Q1 (Mathematical)
In this problem you will be investigating relationships between Weight and City
mpg.
(a) First determine if there is a strong linear relationship between Weight and
City mpg. Consider which attribute should be the predictor variable and which
should be the response variable. Explain your results.
(b) Investigate the relationship between Weight and City mpg using a third
attribute, engine type, and describe what additional information about the data
this attribute provides.
c) Remove the Engine type attribute from the scatterplot. Construct a plot of
Residuals versus Weight. Describe this plot and what it tells you about the
relationship between City mpg and Weight.
H-Q2 (Pedagogical)
In many classrooms, students and teachers use graphing calculators to enter data,
create scatterplots, and compute linear regression. Discuss the advantages and
disadvantages between using graphing calculators and Fathom for helping
students understand these concepts and perform these procedures.
H-Q3. (Mathematical)
[Question about computing residual on graphing calculator and displaying a
residual plot. Give directions.]
H-Q4 (Pedagogical)
Create a task where a linear relationship exists between two variables, but a linear
model would be inappropriate, such as shoe size and scores on an achievement
test. Create a series of questions that would help students to see that, though the
scatterplot reveals a linear trend and the correlation coefficient is strong, it does
not make sense to predict one from the other.
H-Q5 (Mathematical)
The median-median line is another linear model that is more resistant to outliers
than the least squares line. Below are directions for creating the median-median
line without technology.
•
To fit a median-median line to the points, divide the points into three
groups. Do this by taking the set of one-third of the points consisting of
those with the smallest x-values, a middle group, a set of one-third of the
points with the larges x-values. For the SAT data set of n=51, each onethird will include 17 data points. If the number of points is not divisible
by three, extra points need to be assigned symmetrically. Thus, if there is
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 26
Modified 11/3/2006
Section 4: Analyzing Bivariate Data with Fathom
one extra point, it should be added to the middle group, and if there are
two extra points, add one each to the two outer groups. Even if it makes
equal allocation impossible, points with the same x-values must always be
placed in the same group.
• Consider each group of data separately and order the values of
both variables. Ignore the data pairings at this point.
• Now create a summary point (one for each group) for each portion
of the data by using the median x-value and the median y-value,
and combining them to create an ordered pair. We have three
summary points: for the leftmost data, for the middle group, and
for the right hand group. **These summary points may, or may
not, be actual data points.
• Now use the two outer summary points to determine the equation
of the line between them. This, in essence, will determine your
slope.
• Construct the line parallel to this line but is one-third of the way to
the middle summary point. (adjusting the y-intercept) This is the
median-median line. By moving the line one-third of the way
toward the middle summary point gives each summary point equal
weight in determining the y-intercept. To do this:
i. Find the y-coordinate of the point on the line with the same
x-coordinate as the middle summary point. (The predicted
value)
ii. Find the vertical distance between the middle summary
point and the line by subtracting y-values.
iii. Find the coordinates of the point P*one third of the way
from the line to the middle summary point.
Fathom will compute the median-median line. For the Hwy vs. City mpg graph,
under the Graph menu choose, choose Median-Median line. Record the equation
and compare it with the least square model. Record the similarities and
differences that you notice.
Learning to Teach Mathematics with Technology: An Integrated Approach
DRAFT MATERIALS DO NOT DISTRIBUTE
Page 27
Modified 11/3/2006
Download