Chapter 10 Exploring Relationships Between Numerical Variables

advertisement
Chapter 10
Exploring Relationships Between
Numerical Variables
Objectives
SWBAT:
1) Create and examine scatterplots of data
2) Describe the direction, form, and strength of
an association
3) Use correlation to measure the strength of a
linear association
4) Test for correlation
• In recent years, golfers have been
driving the golf ball farther than
before, partly due to equipment, and
partly due to the golfers themselves.
• However, trying to hit the ball as far as
possible off the tee can result in less
accuracy, and a more challenging
second shot playing out of the rough.
• Each time a golfer tees off, they need
to determine if they should try to hit
the ball very far, which would leave
them with a shorter and easier second
shot but risk missing the fairway, or
should they ease up a bit and make
sure to hit the ball down the middle
even though the ball won’t go as far.
• In order to examine situations like the one on
the previous slide, we need to expand beyond
what we have learned so far, which has only
focused on one variable at a time, such as
home runs, rushing yards, or points per game.
• We will now begin to investigate the
relationships between two numerical
variables.
• In order to begin analyzing this situation we’ll investigate
the association between average driving distance and
driving accuracy for the top 10 women golfers in the
LPGA in 2009.
• Average driving
distance is our
explanatory variable
and driving accuracy is
our response variable.
• We will see if average
driving distance can
help explain accuracy.
• In other words, we want to see if the explanatory variable can help
predict the value of the response variable.
• In order to display the relationship between two
numerical variables, we will create a scatterplot (this
is our only option).
• Some things to keep in mind when making
scatterplots:
– 1) it is essential to include labels and clear, consistent
scales on each axis
– 2) neither axis needs to start at 0
– 3) the explanatory variable will be plotted on the
horizontal axis and the response variable will be plotted on
the vertical axis
(Note: the response variable is usually the one we are
more interested in - either because we want to predict the
response for a particular value of the explanatory variable or
because we want to use the explanatory variable to explain
changes in the response variable.
Steps to Make a Scatterplot
1) Label and scale the
horizontal (explanatory)
axis and vertical
(response) axis.
2) Draw a dot for each
player to represent the
ordered pair.
3) Finish the scatterplot
by drawing dots for the
remaining players.
Making a Scatterplot Using the TI-84
1) Enter the average
driving distance
values in L1 and the
driving accuracy
values into L2.
2) Open STAT PLOT by
pressing 2nd y=. Turn
on Plot1 (make sure
all other plots are off).
Choose the first graph
type, and enter L1 for
Xlist and L2 for Ylist.
3) Press ZOOM and
select 9: ZoomStat
and press ENTER in
order to see the
scatterplot in a
nice window.
4) To see the values used by
the graphing calculator for
the scales, press WINDOW.
On this screen, you can
change the starting, ending or
increment values to
something more convenient
and then regraph.
Describing the Association Between Variables
• After constructing a scatterplot, there are
several important characteristics about the
association between the variables that should
be considered.
1. the direction of the association
2. the form of the association
3. the strength of the association
Describing the Direction of an Association
• Here is a scatterplot
showing the average
driving distance and
driving accuracy for the
top 146 money winners on
the LPGA tour in 2009.
• Based on the scatterplot, we can see there is a negative
association between average driving distance and driving
accuracy, meaning that players who drive the ball farther
typically hit a smaller percentage of fairways; those who don’t
hit the ball as far typically hit a higher percentage of fairways.
• Here is the relationship
between a golfer’s average
driving distance and her
percentage of greens-inregulation (GIR means you
hit the ball on the putting
surface (the green) in at
least two shots under par.)
• There is a positive association between average driving
distance and GIR. Typically, players who drive the ball
farther have a better chance to land on the putting
surface in at least two shots under par.
• It is important to remember that a positive or negative
association is describing the overall tendency of the data, NOT
an absolute relationship.
• Example: Even though players who drive the ball farther
typically hit a higher percentage of greens, this is not always
true.
• Lee drives the ball farther than Creamer, but Lee hits a lower
percentage of GIR. This is the opposite of what we’d expect.
• Sometimes, two variables
have no association.
• The scatterplot to the right
shows the relationship
between driving accuracy
and putting average.
• Also included are the mean
putting average and mean
driving accuracy.
• In this case, knowing a golfer’s driving accuracy tells us nothing
about how many putts she will average per round. Therefore, we
can say there is no association between driving accuracy and
putting average.
• To recap: if two variables have an association, then knowing the
value of one variable will help you predict the value of the other
variable. However, if there is no association, then knowing the
value of one will not help you predict the value of the other.
Describing the Form of an Association
• The form of an association can either be linear or
nonlinear.
• If the association is linear, then a line would be a
reasonable way to model the overall relationship. But
if an association is nonlinear, then a line won’t match
the pattern of the scatterplot very well.
• Examples:
Describing the Strength of an Association
• The strength of an association describes the
amount of scatter there is from the overall form
of the data. In other words, how closely the
points on the scatterplot conform to the linear (or
nonlinear) form.
• In a strong association, there isn’t much scatter
and predictions of the response variable will be
fairly precise.
Examples of positive, linear associations with different
amounts of strength.
• The same strength associations can also be
tied to negative linear associations.
• Here is an example of a strong negative linear
association:
• Describe the strength of these associations:
Strong
association
Very strong
association
Moderate
association
• Let’s return back to the initial question of is it better to
drive the ball long or drive it straight.
• Let’s look at two different scatterplots: one comparing
average driving distance to scoring average, and one
comparing driving accuracy to scoring average.
• Negative relationship – golfers
who drive the ball farther tend to
have lower scores (this is a good
thing in golf!)
• Negative relationship – golfers who
hit the fairway more often tend to
have lower scores (again, lower
scores are good!)
• Both scatterplots show negative associations, but which one shows
the stronger relationship? In other words, which explanatory
variable, average driving distance or driving accuracy, is a more
reliable predictor of scoring average?
• Neither association is strong, but average driving distance as an
explanatory variable conforms more to a linear pattern than does
driving accuracy.
• Therefore, low scores in golf tend to be more strongly associated
with average driving distance than with driving accuracy.
Measuring Strength: Correlation
• Judging the strength of a relationship is
difficult to do simply by looking at a
scatterplot.
• What might seem strong to one person might
seem moderate to another.
• There is a numerical way to measure the
strength of a linear association in a
scatterplot, and that is called correlation.
• The correlation (r) is a measure of the
strength and direction of a linear association
between two numerical variables.
• The TI-84 will calculate correlation for us, so
we’ll start by talking about some of the
properties of correlation.
• Unfortunately, Park Place and Boardwalk are
not part of correlation’s properties.
Properties of the Correlation (r)
5) The value of r has no units and is not dependent on the
units used to measure the variables. This makes it an ideal
way to compare the strengths of the associations between
different sets of variables. This also means that changing
the units of the explanatory or response variable won’t
affect the correlation.
To give you a visual, here are some examples of different correlations,
using data from the 32 NFL teams in the 2008 regular season.
• Now it’s time for everyone’s favorite
game….Guess the Correlation!!!
• http://www.rossmanchance.com/applets/
Something to be aware of:
Correlation and association are NOT synonyms.
– Association is a more general word to describe the
relationship between any two variables, whether
numerical or categorical.
– Correlation is a specific measure of the strength
and direction of a linear association between two
numerical variables.
Calculating Correlation on the TI-84
1) Turn on the diagnostic feature. Press 2nd: 0 to
enter the CATALOG. Scroll down to
DiagnosticOn. Press ENTER twice so it says
Done on the home screen.
2) Enter the top 10
LPGA data in L1 and
L2.
3) Press the STAT button, move to
the CALC menu, and scroll down
to select number 8: LinReg (a+bx).
After choosing enter L1, L2 and
press ENTER. The last line of the
output gives the value of r. In this
case the correlation is r=-0.905.
• An applet on the book website also calculates
correlation. It is appropriately titled
“Correlation and Regression.”
• Book site
• Let’s come back to this comparison, in which we initially said
average driving distance seems to be a better predictor of scoring
average than driving accuracy. We made this statement based off
of just looking at the scatterplots.
• It turns out that the correlation between average driving distance
and scoring average is r=-0.47, compared to r=-0.23 for the other
comparison.
• This result confirmed our initial suspicion.
FYI: Statistics 101
• In a more traditional statistics course, you may
be asked to calculate the correlation by hand.
Here is the formula:
• The terms being multiplied in the numerator are
standardized scores for the x variable and the y
variable (they are then being summed).
• Be cautioned: Correlation does NOT imply
cause-and-effect.
• Just think of Happy Gilmore. He was able to
crush the ball off the tee, but he still finished
in last early in the movie because other facets
of his game were not polished. So even
though there is correlation between average
driving distance and scoring average, there is
no guarantee that increasing driving distance
will result in lower scores.
Influential Points
• What effect do you think unusual observations
have on correlation?
• Much like mean and standard deviation,
correlation is not a resistant measure, and as such
it can be strongly effected by outliers. On a
scatterplot, an outlier might be a point that seems
out of place.
Here is a scatterplot showing
the number of stolen bases
and home runs for the nine
primary offensive players for
the 2009 Boston Red Sox.
• The points on the left side of the graph seem to have a
positive association, but Jacoby Ellsbury’s unusually value
makes the overall relationship between stolen bases and
home runs look negative. In fact, the correlation here is
r=-0.35
• What would happen to the correlation is Ellsbury was
omitted?
• Here is the scatterplot
with Ellsbury omitted.
The correlation is now
r=0.21.
• Be careful: It’s not a good idea to remove observations from a
data set without a good reason.
• If a value was recorded incorrectly, correct it and keep it in
the data set. However, if you don’t know the correct value,
then remove it.
• Any observations that do not belong should be excluded.
For example, certain totals or averages of all the
observations.
Testing the Correlation
• In sports, teams play exhibition games before
the regular season begins.
• Some people think a team’s PERFORMANCE in
exhibition games is a good predictor of how
well that team will do in the regular season.
Other people think PERFORMANCE in
exhibition games tells us nothing about the
future.
• Here is a scatterplot showing
the 2009 winning
percentages (WP) for the 30
MLB teams during spring
training and during the
regular season.
• The correlation is r=0.45. There seems to be a moderate,
positive, linear association between spring training
winning percentage and regular season winning
percentage.
• Let’s keep in mind this is only based off of PERFORMANCES
for one season, so we need to account for RANDOM
CHANCE before making any firm conclusions. It is possible
that there really is no association.
• Some terminology:
– The true correlation between two variables (much like
ABILITY) exists only in theory. To find the true
correlation between spring training winning
percentage and regular season winning percentage,
we would need to repeat the 2009 spring training and
regular season for each team millions and millions of
times and then find the correlation.
– The observed correlation between two variables
(much like PERFORMANCE) is based on a limited
amount of data, such as one season. Because it is
based on a limited amount of data, the observed
correlation will vary from the true correlation due to
RANDOM CHANCE.
• To see if the observed data provide convincing
evidence that there really is a positive association
between winning percentage in spring training
and winning percentage in the regular season, we
will test the following hypotheses using the
correlation as our test statistic.
To simulate:
-Get 30 notecards (for the 30 teams)
-Write each of the 30 regular season winning
percentages on the notecards
-Shuffle
-Randomly pair one of the regular season
winning percentage cards to each of the spring
training winning percentages
-Calculate the correlation
• Here are the results for 100 trials of the
simulation:
• What is our p-value?
0%
Download