Ch 8 PP - Lyndhurst School

advertisement
Chapter 8
Standardized Scores and Normal
Distributions
Objectives
Students will be able to:
1) Understand what a standard score measures and how to
calculate it
2) Recognize when standardization can be used to compare
values
3) Use the normal distribution model to estimate observations
falling within certain standard deviations of the mean
4) Use the standard normal table to find the percentage of data
falling below a given value in a normal distribution
5) Use the standard normal table in reverse to determine
percentile values of a distribution
Fantasy Sports
• What are fantasy sports?
– In fantasy sports “owners” in a league draft teams of
actual players and use their real-life statistics to
measure the PERFORMANCE of each fantasy team.
– Example: In fantasy baseball each “owner” combines
the statistics of the players on his or her team using
variables such as batting average and home runs for
hitters and wins and strikeouts for pitchers.
– Each team is then ranked according to each variable.
• How do you determine which players you should
draft onto your fantasy team?
– Obviously you want good players, but it can be a bit
tricky because the variables used to measure the
players’ PERFORMANCES are on very different scales.
– For example, in 2009 Ichiro (Mariners) had a 0.352
batting average and Adam Dunn (Nationals) hit 38
home runs. Both are outstanding PERFORMANCES, but
which was better? And what about their
PERFORMANCES in other categories?
• To solve the problem of
evaluating
PERFORMANCES that are
measured on different
scales, we need to learn
how to standardize these
PERFORMANCES so they
will be on the same scale.
• Let’s look at how we can
do this…
Standardized Scores
This dotplot shows the batting averages of all MLB
players with at least 300 plate appearances in 2009.
Ichiro’s batting average is the red dot (0.352).
• This dotplot shows the number of home runs hit
by the same group of MLB players. Adam Dunn’s
home run total is the red dot (38).
• The average batting average of all players in
the sample was .270.
• Ichiro’s batting average of .352 was .082 above
average.
• The average number of home runs for those
players is 15.3
• Adam Dunn’s home run total of 38 was 22.7
home runs above average.
• Does this mean Dunn’s PERFORMANCE was
better because he was 22.7 above average, as
opposed to Ichiro being .082 above average?
• We should not compare 22.7 and .082 because
batting average and home runs are measured with
different units and the spread of their
distributions is very different.
• What we can do though is convert these values so
we use standard deviation as a common unit of
measure. This will allow us to make fair
comparisons.
• By doing this, we will be able to see how far each
PERFORMANCE is above their respective mean, in
terms of standard deviations.
• What we will do is measure how far the
PERFORMANCE is above the mean, and then
divide it by the standard deviation.
• The standard deviation of the batting average
distribution was 0.029. As previously
mentioned, Ichiro’s PERFORMANCE was .082
above the mean. .082/.029= 2.83 standard
deviations above the mean.
• The standard deviation of the home run
distribution was 9.9. As previously
mentioned, Dunn’s PERFORMANCE was 22.7
home runs above the mean. 22.7/9.9= 2.29
standard deviations above the mean.
• Now that we are on the same scale we can
compare.
• Ichiro’s PERFORMANCE was 2.83 standard
deviations above the mean. Dunn’s
PERFORMANCE was 2.29 standard deviations
above the mean. Which PERFORMANCE was
better?
• Ichiro’s PERFORMANCE was better since it was
farther above the mean than Dunn’s
PERFORMANCE.
• Let’s formalize some of what we just talked
about…
• A standardized score or z-score measures how
many standard deviations a PERFORMANCE is
above or below the mean.
• When a z-score is positive, the PERFORMANCE
is above the mean; when a z-score is negative,
the PERFORMANCE is below the mean.
• When would it be good for a z-score to be
negative?
– One example is when looking at a quarterback’s
interceptions. A negative value would indicate that a
quarterback’s PERFORMANCE for throwing
interceptions is below the mean.
Hypothetical situation: Let’s say Dave Carucci played high
school football. Let’s say he threw 0.5 interceptions per
game. Let’s say that the average interceptions per game
for BCSL National quarterbacks was 1. Let’s also say the
standard deviation for this distribution was 0.4. Calculate
and interpret Carucci’s standardized score for his
interceptions per game.
• Interpretation: Carucci’s interceptions per
game PERFORMANCE was 1.25 standard
deviations below the mean.
• For this distribution, this is a good thing!!!
Using Standardized Scores to Compare
Across Eras
• In 1927 Babe Ruth set the MLB single season
home run record with 60 home runs. This record
has been broken 3 times:
– 1961 Roger Maris hit 61 home runs
– 1998 Mark McGwire hit 70 home runs
– 2001 Barry Bonds hit 73 home runs
• Is it fair to say Barry Bonds had the best
PERFORMANCE? Keep in mind the unit of
measure is the same.
• There are many differences amongst different
eras in baseball: the quality of batters, the
quality of pitchers, dimensions of ballparks,
equipment, possible use of PED’s, etc…
• As a result, in certain years it may have been
easier to hit a home run than other years.
• To make a fair comparison, we need to see
how these PERFORMANCES compare relative
to other hitters in the same era. Let’s find the
standardized scores for these record-setting
PERFORMANCES.
• All four PERFORMANCES are noteworthy (all
are more than 3 standard deviations from the
mean!). However, Babe Ruth’s appears to be
the most outstanding by a large margin,
relative to the players of his era.
• Even though Barry Bond’s has his name in the
record books, this data suggests Babe Ruth
may still be considered the single-season
home run champ, relatively speaking.
Back to Fantasy…
• At the beginning of this chapter, we used zscores to compare Ichiro’s batting average
PERFORMANCE to Dunn’s home run
PERFORMANCE.
• Even though Ichiro’s PERFORMANCE was
better, it doesn’t necessarily mean he would
be a better player to draft on our fantasy
team.
• Typical fantasy baseball leagues use five
variables for hitters:
– Batting average, home runs, RBIs, stolen bases,
runs scored
• To estimate the fantasy value for a player, we
must measure the player’s PERFORMANCE in
each category and combine these
measurements in a reasonable way.
• To do this, find a player’s standardized score
for each variable and then add those scores
together.
• Overall, Ichiro’s sum of 4.39 was higher than
Dunn’s sum of 3.79, making him a slightly more
valuable fantasy player than Dunn in 2009.
• Here is the distribution of total fantasy values for
all 284 eligible players.
• The most valuable player was Pujols, with a value
of 11.4.
• Note: This approach of adding the z-scores will
only work when being above average is better
for each category.
• Example: If you used the number of strikeouts
for hitters as a variable, then being above
average is not a good thing.
• In situations like this, you should subtract the
z-scores for a category where being above
average is a bad thing.
• What would be considered an unusual zscore?
• Let’s look at the distribution of the z-scores for
the different hitting variables we have been
discussing…
Pg 271
• As you can see, it is fairly unusual to have a
PERFORMANCE that is more than 2 standard
deviations away from the mean and quite rare
to have one more than 3 standard deviations
from the mean, especially in a roughly
symmetric distribution.
The 68-95-99.7 Rule
• In general, when a distribution of
PERFORMANCES is roughly symmetric,
unimodal, and bell-shaped:
– Approximately 68% of the observations will be
within 1 standard deviation of the mean.
– Approximately 95% of the observations will be
within 2 standard deviations of the mean.
– Approximately 99.7% of the observations will be
within 3 standard deviations of the mean.
• Here is a visual summary: (pg 273)
• Note: Distributions that are skewed or
bimodal may not follow this rule very well.
• Here is an example of Barry Sanders rushing
yards per game for his career. (pg 273)
The Normal Distribution
• The Normal distribution is a mathematical
model that is often used to describe
distributions of data that are symmetric,
unimodal, and bell-shaped.
• The graph of a Normal distribution is called a
Normal curve.
• All Normal curves are symmetric, unimodal,
and bell-shaped.
• Here is a histogram of Peyton Manning’s first 176
regular season games (1998-2008).
• A Normal curve is on top of the histogram.
• The bars of the histogram represent 100% of
Manning’s PERFORMANCES.
• Therefore, the area under the Normal curve also
needs to represent 100% of Manning’s
PERFORMANCES.
• This leads us to two important facts about the
Normal curve:
– 1) The total area under any Normal curve is equal to 1
(100%).
– 2) The expected proportion of PEFORMANCES
between two values is equal to the area under the
Normal curve between the same two values.
• Example: To estimate what proportion of
Manning’s PERFORMANCES were less than
180 yards, we would calculate the area under
the Normal curve to the left of 180, as
illustrated below.
Using the Normal Table
• To estimate what proportion of Manning’s
PERFORMANCES were below 180 yards, we
first have to standardize the value 180. This
distribution has a mean of 259 and a standard
deviation of 74.
• A z-score of -1.07 means a PERFORMANCE of 180
passing yards is 1.07 standard deviations below
the mean.
• Since this PERFORMANCE isn’t exactly 1, 2, or 3
standard deviations below the mean, we cannot
use the 68-95-99.7 rule.
• Instead, we have to use the standard Normal
table which lists the proportion of
PERFORMANCES that are less than a given
standardized score in a Normal distribution.
• This table is in the back of your book.
• We have to look up the z-score of -1.07 on the
table.
• The table on the left has the negative scores.
• Since the score starts out as “-1.0”, go to the z
column and go down until you reach -1.0.
• Since there is a 7 in the hundredths place (.07), go
to the right until you reach the .07 column. You
should now be located in the spot that has -1.0 to
the left, and .07 on the top.
• This value should be .1423.
• The area under the standard Normal curve to the
left of a z-score of -1.07 is equal to 0.1423.
• This means we expect about 14.23% of Manning’s
PERFORMANCES to be less than 180 yards.
• Let’s try another one. Estimate what proportion
of Manning’s PERFORMANCES were below 350
yards. Remember, the mean is 259 yards and the
standard deviation is 74 yards.
• Since the z-score is 1.23, look for 1.2 in the z
column and .03 on top.
• This should give us an area of .8907, meaning
89.07% of Manning’s PERFORMANCES were
below 350 yards.
• Now let’s say we want to know what
proportion of games Manning had at least 290
passing yards. We will start this problem the
same way. First, get the z-score, then look up
that score on the standard Normal table.
• The z-score is .42.
• This gives us an area of .6628 on the standard
Normal table.
• .6628 is the proportion of PERFORMANCES
that are less than a z-score of .42.
• However, we want the proportion of games
that are at least 290 passing yards (meaning
290 yards and above). What should we do?
• Since we know the area under the curve has
to be 1, we need to subtract .6628 from 1.
This will give us the proportion of
PERFORMANCES that are greater than a zscore of .42.
• 1-.6628= .3372.
• This means that we expect about 33.72% of
Manning’s PERFORMANCES to be 290 yards or
more. Here’s an illustration:
• Let’s try another one. Find what proportion of
games Manning had at least 150 yards passing.
(mean 259 yards; SD 74 yards)
• Z-score: -1.47.
• Area under the standard Normal curve to the left
of -1.47: .0708
• 1-.0708= .9292
• 92.92% of Manning’s PERFORMANCES are at least
150 yards.
• How would we go about finding a proportion that
was between two PERFORMANCES?
• Let’s say we want to find the proportion of
Manning’s PERFORMANCES that were between
180 yards and 290 yards. What do we do?!?!
• Here is a picture of what we want:
• Here is what we have to do:
We need to first calculate the area to the left of 290.
Then we need to calculate the area to the left of
180. Then we subtract those values. (still using
mean 259, SD 74)
Area under the curve:
.6628
Area under the curve:
.1423
.6628-.1423=.5205
52.05% of Manning’s PERFORMANCES were
between 180 and 290 yards.
• Let’s try another. What proportion of
Manning’s PERFORMANCES were between
125 yards and 275 yards?
Area under the
curve: .5871
Area under the curve:
.0351
.5871-.0351=.552
55.2% of Manning’s PERFORMANCES were
between 125 yards and 275 yards.
Using the Normal Distribution in
Reverse
• In 2008, the distribution of batting averages for
MLB players with at least 300 plate appearances
was approximately Normal with a mean of 0.272
and a standard deviation of 0.027.
• Suppose a player gets a salary bonus if his batting
average is in the top 10% of all players. How well
must a player hit for his batting average to be in
the top 10%?
• We need to find the boundary between the lowest
90% of the distribution and the highest 10%.
• The boundary value is called the 90th percentile,
because 90% of the values fall below it.
• We know that the area under the curve is .90.
Therefore, we want to look at the interior of
the standard Normal table for a proportion
closest to 0.9000 and get the z-score
associated with this proportion.
• The closest value is 0.8997. This corresponds
to a z-score of 1.28. This means the 90th
percentile is 1.28 standard deviations above
the mean.
• Now let’s find the batting average associated
with this z-score.
• Let’s try another one. In 1970, the rebounding
average for all ABA players with a minimum of 15
games played was 5.2 rebounds per game, with a
standard deviation of 1.6.
• Say Jackie Moon, owner of the Flint Tropics, wants
to release his center, Vakidis, if he finishes in the
bottom 15% of the league in rebounding.
• How many rebounds per game must Vakidis
average to finish in the bottom 15% of the league
in rebounding?
• Start by looking up .1500 in the table and finding
the corresponding z-score.
– The closest value is .1492, which is a z-score of -1.04.
• Now, substitute in and solve for the rebounding
average.
Technology and the Normal
Distribution
• The TI-84 calculator can be your best friend
when it comes to performing calculations
involving Normal distributions.
• Let’s say the number of points Anne scores in
a basketball game is approximately Normally
distributed, with a mean of 13.1 and a
standard deviation of 4.2. Let’s use this
information to solve a few problems…
• In what proportion of games do you expect
Anne to score between 10 and 15 points?
– Let’s do this manually first, and then use the TI-84
calculator…
Area under the curve:
.2296
Area under the curve:
.6736
Proportion= .6736-.2296=.444
44.4%
To do this on the TI-84, we use the normalcdf
command.
1) Press 2nd-DISTR (VARS key)
2) Select the second option: normalcdf
3) Area=normalcdf(lower boundary, upper
boundary, mean, standard deviation)
– Normalcdf(10, 15, 13.1, 4.2); Press enter
– We see we have the same result as when we
calculate it manually.
• Try another one. In what proportion of Anne’s
games do you expect her to score between 9
and 18 points?
– normalcdf(9, 18, 13.1, 4.2)
– 71.38%
• If you are just looking for the area below a
certain value, use -9999 as your lower
boundary.
• If you are just looking for the area above a
certain value, use 9999 as your upper
boundary.
• Let’s try some of these…
• Remember Anne’s mean is 13.1 points, with a
standard deviation of 4.2
• What proportion of games can you expect
Anne to score more than 20 points?
– Normalcdf(20, 9999, 13.1, 4.2)= 5.02%
• What proportion of games can you expect
Anne to score less than 17 points?
– Normalcdf(-9999, 17, 13.1, 4.2)= 82.34%
• On the iPad, myNormal Calculator can
perform similar operations.
• You need to select iPhone only when
searching for the app.
• When using the app, you need to perform the
extra step of calculating the z-scores.
Using the TI-84 to Calculate Percentiles
• To find percentiles in a Normal distribution, use
the invNorm command, found in 2nd-DISTR.
• Boundary=(area to the left of the boundary, mean,
standard deviation)
• Let’s find the 25th percentile of Anne’s distribution
of points scored.
– invNorm(.25, 13.1, 4.2)= 10.27.
– Meaning, in about 25% of her games, Anne will score
less than 10.27 points.
• Let’s try two more.
• In about 60% of games, how many points will
Anne score less than?
– invNorm(.60, 13.1, 4.2)= 14.16 points
• How many points will Anne need to score to
be in the upper 15% of points scored?
– invNorm(.85, 13.1, 4.2)= 17.45 points
Download