m2_1_density_normal

advertisement
Density Curves
Normal Distribution
Area under the curve
Learning Objectives
By the end of this lecture, you should be able to:
– Describe what is meant by a density curve
– Be able to identify normal (bell-shaped), skewed, bimodal, uniform
distributions from a density curve
– Describe the most common type of distribution encountered in nature
– Be able to estimate areas under a Normal density curve
Density Curve
• Density curve: When applied to a histogram, a density curve is a line
drawn as a smooth approximation to the graph.
• A density curve is a mathematical model of a distribution. That is, we do
not draw this curve by hand. We will allow our statistical software to do it
for us.
Here we have a bell-shaped distribution. It gets this
name because when you draw a density curve over
the histogram, the curve is shaped like a bell. This,
then, is the famous “bell curve”.
The proper name for a bell-shaped distribution, is the
normal distribution. You should know this term.
Density curves can be in any
shape. It all depends on the
distribution.
Left skewed
However, there are some
shapes that we tend to see
much more frequently than
others.
Right skewed
Here are some of the common
distributions we’ve discussed.
For each of these histograms, a
density curve has been drawn
over it.
Bimodal
Uniform
You should be able to identify a distribution by
looking at only the density curve. That is, even
if the histogram is not visible. Try to identify
each of these three:
Right skewed
Bimodal
Uniform
Normal Distribution
If you took a large sample of people (or observations) and graphed any of the
following:
–
–
–
–
–
–
–
Heights
Corn yield per year in Indiana
SAT (or ACT) Scores
Blood pressure
Age of graduate students at DePaul
Weight of M&Ms per large package
Etc, etc, etc
You would see that they all result in a bell-shaped distribution. When looking at data,
the bell-distribution the most common distribution that shows up in the ‘real world’.
As a result, we give this distribution it’s own name: the Normal distribution.
And because this distribution is so common, we’re going to spend quite a lot of time
studying it and learning how to find out all kinds of statistics from it.
Example of a dataset that shows a Normal distribution
•
•
•
•
•
One study looked at the gestation (pregnancy)
time of a group of women who were given
prenatal vitamins. After creating the histogram,
the following density curve was drawn. You can
see that it is a normal distribution.
This tells us that the most common gestation
period was a range in the area of, say, 240-260
days.
As you might expect, as you go further and
further out (i.e. longer and longer or shorter and
shorter gestation times), there are fewer and
fewer women.
For example, as you might expect, while there
are some women who had gestation periods of
less than 210 days (or longer than, say, 290
days), they are relatively rare.
That is a normal distribution: The majority of
people cluster around some value in the middle
(in this case, about 250 days), but as you go
higher and higher (or lower and lower) you find
relatively few observations.
170
190
210
230
250
270
Gestation time (days)
290
310
The “Normal Curve”
• A density curve drawn over a Normal distribution is
called (not surprisingly) the Normal density curve (or
just the ‘normal curve’).
• Notice that while the density curve is exactly
symmetric, it does not perfectly outline the histogram.
• That is, a density curve is an idealized description of
the data. Still, even though the curve is higher than the
histogram at some points, and below it at others, the
mathematical model used to generate the density
curve will turn out to be very accurate for our
calculations.
Not all distributions are normal!
• While many datasets that we look at do follow a Normal distribution,
many other datasets do not.
– For example, income distribution is not Normal. (It is typically right-skewed).
– The age at which people are diagnosed with Inflammatory Bowel Disease is
typically bimodal.
“Normal” Curves
• Normal curves have the following properties:
– Symmetric
– Unimodal
– Bell-shaped
• Curves like this are called ‘Normal curves’ and the data
distributions they describe are called ‘Normal
distributions’
• The idea of a Normal curve does not imply that other
kinds of curves are somehow abnormal! It’s simply the
term that we use – and it is a term you must be
comfortable with!
How we use density curves
•
One of the reasons we love density curves, is that by estimating the area under the
curve, we can make various predictions and calculations about the population.
•
Important: Be sure you understand, however, that the rules we are going to study
over the next few lectures, only apply to density curves of Normal distributions.
These tools will not apply to density curves for, say, skewed distributions.
•
Example: Suppose we take our sample of 25 women’s heights, plot them on a
histogram, and then create a density curve. If that the density curve turns out to
show a Normal distribution, we can use this density curve to make all kinds of
statistical estimates such as:
–
–
–
–
–
•
What percentage of women in our population would be more than 6’ tall?
What percentage of women are between 5’0 and 5’5?
What is the likelihood of encountering women sorter than 4’6?
What is the height of the tallest 90th percetile of women?
Etc
However, in order to do all of this, we must learn how to calculate the area under
the density curve.
Area under the curve
Here is an example of a histogram and density curve
showing the score results of a group of students on a
certain exam. Scores range from 0 to 12.
If we want to know the percentage of students who scored
below, say, 6 on this exam, we would need to find out the
area under the curve to the left of 6.0. It is shaded on this
diagram.
This percentage is somewhere in the neighborhood of
30%. Determining the exact percentage will be the subject
of an upcoming lecture.
Note: I hope it also makes sense that if the shaded area
tells us that 30% of students scored below 6, it stands to
reason that 70% of students scored higher than 6.
Mean of a Normal distribution
On a Normal density curve, the peak / midpoint /
midline is the mean. (Represented by the black line).
I hope you can see that the area to the left of the line
contains 50% of the area under the curve, while the
area to the right also contains 50% of the area under
the curve.
In terms of the graph seen here, if we estimate the
midpoint to be a score of 7, we can say that about
50% of the population scores below 7 and 50% scores
above.
Examples: Area under the curve
How would you determine the percentage of students
who scored greater than 10?
In this case, we would want to calculate the area under the
curve above the score of 10. It would probably be
somewhere around 5%. Again, we will learn how to
accurately determine this number in an upcoming lecture.
How would you determine the percentage of students
who scored between 6 and 8?
In this case, we would want to calculate the area under the
curve between those two numbers.
About 50% of students achieved a score higher than
_____ ?
Answer: Draw a line down the very center of the curve.
The area under the curve to the right of that line
represents 50% of students. That line is right about a
score of 7. So you could say that 50% of students scored
above 7 (and, or course, about 50% of students scored
below 7).
Estimate the area under the curve
While we will shortly learn how to estimate the area under a curve
pretty accurately, you must also be able to make some ballpark
‘guesstimates’.
Example: What percentage of students scored below 6 on this
exam?
Answer: On the graph, it would be the shaded area here. A
ballpark estimate would be somewhere around 30-40%.
Example: What percentage of students scored above 9?
Answer: A reasonable guess might be a number, in the vicinity of
20-30%. Don’t worry about accuracy here, just focus on being in
the general area.
Example: What percentage of students scored less than 2?
Answer: A very low number! Eg: 1% would be a good guess.
Example: What percentage of students scored more than 7?
Answer: Since 7 is right around the midpoint, then the area under
the curve to the right of the midline is 50%.
I will ask you to do at least a couple of these estimations on
your quiz and/or exams. However, you will not have to be
super-accurate – you just need to be in the ballpark.
More practice estimating:
Example: About what percentage of women had a gestation longer
than 250 days?
Answer: About 50%.
Example: About what percentage of women had a gestation less than
210 days?
Answer: A reasonable guess would be a low-ish number such as 15%.
Example: About what percentage of women had a gestation less than
310 days?
Answer: A high number! Eg: 99% would be a good guess.
Example: About 30% of women had a gestation longer than _____?
Choose among the following: 210 days, 230 days, 250 days, 270 days.
Answer: The only reasonable option here would be 270 days.
Note the last question: Turning it around like that is a common
way that stats people love to throw on exams!
170
190
210
230
250
270
Gestation time (days)
290
310
Coming up…
• For the moment, we have been estimating the area
under the curve.
• Very soon, we will look at how to accurately
determine the area under a Normal density curve.
• Still, if you can’t estimate the answers to the
previous questions we have gone through then you
should absolutely not go on to the ‘number
crunching’. Make sure you get the concept down
before moving on.
Download