Objectives

advertisement
Objectives 1.3
Density curves and Normal distributions
p 
Density curves
p 
Measuring center and spread for density curves
p 
Normal distributions
p 
The 68-95-99.7 (Empirical) rule
p 
Standardizing observations
p 
p 
Calculating probabilities using the standard Normal Table (CIS
Chapter 8, p 105 – mainly p114)
Inverse Normal calculations
Additional reading:
http://onlinestatbook.com/2/normal_distribution/normal_distribution.html
Histogram and density curves p 
p 
p 
p 
p 
As I have mentioned several times, underlying the histogram of the
observations is the true (usually unknown) histogram of the
population.
If the data is a continuous numerical variable, it can take any value
(eg. heights, weights, but not the number of M&Ms – as this is
numerical discrete variable). For continuous numerical variables, the
underlying distribution is known as the density curve.
Like the histogram, the density tells us which outcomes are more
likely. Unlike the relative frequency histogram the y-axis does not
denote the chance/probability of an event -- the area gives the
chance.
We usually do not know the true density curve, but we can often get a
good estimate based on the data using statistical software.
Lab practice: Load the calf data into Statcrunch. Try fitting a range of
different well known densities shapes to the data. In the third page on
the histogram menu, there is an option called overlay density. This is a
list of density shapes you can overlay your histogram to see which
best fits your data.
Density curves A density curve is a mathematical model of a distribution.
The total area under the curve, by definition, is equal to 1.00, or 100%.
The area under the curve for a range of values is the proportion of all
observations for that range.
Histogram of a sample with the
smoothed, density curve
describing theoretically the
population.
Calculation practice p 
p 
p 
p 
p 
p 
Make a sketch of a density plot of human heights with mean 67
inches and standard deviation 7 inches.
What is the area below the entire curve?
On this plot show the probability of a human height being less than
60 inches.
On this plot show the probability of a human height being greater
than 75 inches.
On this plot show the probability of a human height lying between 60
and 75 inches.
On the next slide we give an examples of different types of density
curves. Match a variable to each plot.
Density curves come in any
imaginable shape.
Some are well known
mathematically and others aren’t.
Review: Median and mean of a density curve
The median of a density curve is the point that divides the area under
the curve in half.
The mean of a density curve is the balance point, at which the curve
would balance if it were made of solid material.
The median and mean are the same for a symmetric density curve.
The mean of a skewed curve is pulled in the direction of the long tail.
The normal family of density plots p 
p 
We now introduce a family of density functions which are extremely
useful in statistics. It is called the normal distribution.
There are various reasons that they are an important family
p 
p 
p 
p 
Many variables (but not all NOT all) have a density which is close to a
normal distribution. These include biological measurements, some type
of exam scores etc.
The normal distribution is a good approximation to the results of many
types of chance outcomes (over the long run). For example, if you toss a
coin many times the probability for the number of heads will look like
normal distribution (if tossed enough times) – We come back to this
later.
If we can assume a variable is normally distributed, it allows us to
calculate probabilities easily (for example, given that your weight is
normal, you can easily calculate your percentile from the mean and
standard deviation).
The normal distribution forms the basis of statistical inference. For this
reason you should become very familiar with all the normal calculations I
do from now on. As we will be using these ideas throughout the course.
De<inition: Normal density Normal distributions are a family of symmetrical, bell-shaped density
curves defined by a mean µ (mu) and a standard deviation σ (sigma).
We denote a normal distribution by Normal(µ,σ) or N(µ,σ).
The formula for the density curve is somewhat complicated:
1
f (x) =
e
σ 2π
€
1 ⎛ x − µ ⎞
− ⎜
⎟
2 ⎝ σ ⎠
2
x
e = 2.71828… the base of the natural logarithm
π = pi = 3.14159…
x
Examples of normal density curves Here, means are the same (µ = 15)
while standard deviations are
different (σ = 2, 4, and 6).
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
Here, means are different
(µ = 10, 15, and 20) while
standard deviations are the same
(σ = 3).
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
Statcrunch practice p 
Making normal density plots in Statcrunch:
Stat -> Calculators -> Select the normal distribution.
p  Here you can choose the mean and standard deviation and calculate the
area on the left or right.
p 
q 
Load the calf data into Statcrunch. Here see whether the calf
weights is close to normally distributed. Graphics -> Histogram and
select a weight variable (such as 8 weeks), select relative frequency
(in Type options). In overlay distrib choose normal density (don’t
give a mean or sd and it will use the sample mean and standard
deviation of the data to create the normal curve).
You will see the relative frequency histogram and the best fitting
normal density (with the same mean and standard deviation as the
data) overlaying the histogram.
Remember this is just a sample, the fit won’t be perfect.
The 68-95-99.7% Rule for Normal Distributions
p 
About 68% of all observations
are within 1 standard deviation
(1×σ) of the mean (µ).
p 
Inflection point
About 95% of all observations
are within 2×σ of the mean µ.
p 
Almost all (99.7%) observations
are within 3×σ of the mean.
p 
Also called the empirical rule
because it works approximately for
data and many other distributions.
e.g., typically 90%-99% of data are
within two st. dev.’s of the mean.
mean µ = 64.5
standard deviation σ = 2.5
Normal(µ, σ) = Normal(64.5, 2.5)
Notation: µ (mu) is the mean of the idealized curve, while xis the mean of a sample.
σ (sigma) is the standard deviation of the idealized curve, while s is the s.d. of a sample.
Do the weights of 8 week calves satisfy the empirical rule? p 
p 
p 
Load the calf data into Statcrunch, and make a relative frequency
histogram of the calf data (use the bin width, say, 5). Now calculate
the mean and standard deviation of the calf data and construct the
intervals (one standard deviation from the mean, two standard
deviations from the mean, three standard deviations form the mean).
And count the proportion of calves in these intervals.
This is just a small sample, but it would appear that calf weights
`roughly’ satisfy this rule.
You are presented with an 8 week calf with weight 95 pounds. From
the calculation (below) we see he is -2.8 standard deviations below
the mean:
142.6
= 2.8
17
Assuming normality of weights, we see that event of a healthy calf
having such a low weight is very small. May be he is unwell?
= 17,
p 
µ = 142.6,
z=
95
Z-­‐scores and the normal density p 
p 
p 
p 
The z-scores (defined at the end of Chapter 3) for variables which
are normally distributed are useful for calculating probabilities.
Example: The heights of women tend to be normally distributed with
mean 64.5 inches and standard deviation 2.5 inches.
Question: A women has a height of 71 inches, is she exceptionally
tall?
Answer: Calculate how close she is to the mean but take into
account the spread. This is the z-transform.
p 
p 
The z-transform is z= (71 – 64.5)/2.5 = 2.6, which is 2.6 standard
deviations to right of the mean. From the 68-95-99.7 rule we know
because heights are close to normally distributed, that roughly 5% of
women are more than 2 standard deviations from the mean. Therefore
she has to be in the top 2.5% (5% divided by two) of heights.
She is tall, but what is the exact percentile? To do this we need to
calculate the area to the RIGHT of 71 on the density curve. The beauty
of the z-score, is that it can be used to calculate the percentile. The zscore of a normal variable has a distribution which is well documented.
The standard Normal distribution Because all Normal distributions share the same shape properties, we
can standardize our data to transform any Normal(µ,σ) curve into the
standard normal curve: Normal(0,1).
Normal(0,1)
Normal(64.5, 2.5)
=>
x
Standardized height (no units)
For each x we calculate a new value, z-score.
z
The area under the normal curve p 
p 
The area between two points on a normal curve gives the chance of
an even lying between those two points.
The z-transform is a useful tool:
p 
p 
Determining the number of standard deviations an observation is from
the mean.
If the data is normal for calculating probabilities.
In the next few slides slides we explain how the z-transform can be
used to calculate probabilities. However, an online area calculator is
given here (see the normal calculator link at the end of page):
http://onlinestatbook.com/2/normal_distribution/areas_normal.html
p 
and also in Statcrunch (Go to Stat -> Calculations -> Normal)
Calculation: Women heights N(µ, σ) =
N(64.5, 2.5)
Women’s heights follow the N(64.5",
2.5") distribution. What percent of
women are shorter than 71 inches
Area = ???
tall?
mean µ = 64.5"
standard deviation σ = 2.5"
x (height) = 71"
µ = 64.5” x = 67”
z=0
z=1
Always draw a picture of your problem!
We calculate z, the standardized value of x:
z=
x
µ
,
z=
71
64.5
= 2.6 2.6 s.d. above the mean
2.5
To find the percent of women are shorter than 71 inches tall, we need to find
the area to the left of z = 2.6. For this, we must use a special table.
Using the standard Normal table Table A gives the area under the standard Normal curve to the left of any z value.
.0082 is the
area under
N(0,1) left
of z = -2.40
.0080 is the area
under N(0,1) left
of z = -2.41
(…)
0.0069 is the area
under N(0,1) left
of z = -2.46
Percent of women shorter than 71″
For z = 1, the area under
the standard Normal
curve to the left of z is
84.13%
For z = 2.6, the area under
Conclusion:
99.53% of women are shorter than 71".
the standard Normal curve
By subtraction, 1 – 0.9953, or 0.46% of
to the left of 2.6 is 99.53%.
women are taller than 71".
Tips on using Table A Because the Normal distribution is symmetric, there are 2 ways that
you can calculate the area under the standard Normal curve to the
right of a z value.
Area = 0.9901
Area = 0.0099
z = -2.33
area to right of z = area to left of –z
area to right of z = 1 – area to left of z
Tips on using Table A To calculate the area between two z- values, first get the area under
Normal(0,1) to the left of each z-value from
Table A.
Then subtract the smaller
area from the larger area.
A common mistake made
by students is to subtract
the z values instead of
subtracting the areas.
The area between z1 and z2 is
the area left of z1 minus the area left of z2.
Calculation Practice: p 
Question: What is the chance of randomly selecting a female whose
height lies between 60 to 70 inches?
p 
Answer: Calculate the z-transform corresponding to 60 and 70.
64.5
70 64.5
= 1.8 z2 =
= 2.2
2.5
2.5
Make a plot with these numbers of the x-axis.
Using tables: z1 corresponds to 3.6% percentile and z2 corresponds to
the 98.6% percentile.
The probability of a height being between 60 and 70 inches is
(98.6-3.6) = 95%.
z1 =
p 
p 
p 
60
More Calculations: Scores in SATs
One way to get admitted to A&M requires a score of at least 1300 on the
combined critical reading and mathematics SAT exams. The SAT scores for 2010
were approximately normal with mean 1016 and standard deviation 212.
What proportion of students taking the SAT in 2010 have this requirement?
x = 1300
µ = 1016
=
σ = 212
( x − µ)
σ
(1300 − 1016)
z=
212
284
z=
≈ 1.34
212
Table A: area under
−
z=
N(0,1) to the left of
z = 1.34 is 0.9099.
area right of 1300
0.0901
=
=
total area
1
−
−
area left of 1300
0.9099
Approximately 9.1% of students scored at least 1300.
Side note: The actual data may contain students who scored exactly 1300.
However, the proportion of scores exactly equal to 1300 is zero for a normal
distribution.
Students are considered for (but not guaranteed) admission if they have a
combined (CR + M) SAT score of at least 1100.
What proportion of all students who took the SAT in 2010 have (only) this
requirement? That is, what proportion have scores between 1100 and 1300?
x = 1100
µ = 1016
σ = 212
( x − µ)
σ
(1100 − 1016)
z=
212
84
z=
≈ 0.40
212
Table A: area under
=
z=
N(0,1) to the left of
z = 0.40 is 0.6554.
area between 1100
and 1300
0.2545
−
=
area left of 1300
−
area left of 1100
=
0.9099
−
0.6554
Approximately 25.5% of students scored between 1100
and 1300.
Comparing exams using percentiles p 
p 
p 
p 
There are various ways to gain entrance into A&M. We mentioned
on the previous slide SATs, but there are ACTs too.
The ACTs have a different scoring system to the SATs, these range
from 1-36.
How to compare students who have taken different exams?
The easiest way is by comparing their percentiles,
p 
p 
Example: student A is in the top 10% SAT scores whereas student B is
in the top 5% ACT scores. Based on this information, student B did
better in their exams. How to obtain these percentiles?
Example: Suppose that SAT and ACT scores are close to normally
distributed. SAT scores are almost normally distributed with mean
1025 and standard deviation 200, whereas ACT scores are almost
(in reality this is not true, since the scores can only take only integer
values) normally distributed with mean 20 and standard deviation 5.
p 
Betty scores 1400 on her SATs, whereas Jon scores 31 on his ACT.
Which student did `better’.
Using z-­‐scores to compare grades p 
Answer: We first make a z-score for both Betty and Jon:
p 
p 
q 
q 
q 
Betty’s z-score is z = (1400 -1025)/200 = 1.875
Jon’s z-score is z = (31-20)/5 = 2.2.
Using the tables we see that Betty is in the 96.7 percentile, whereas Jon
is in the 98.6 percentile. So Jon did slightly better than Betty, since only
1.4% of students did better than Jon, whereas 3.3% students did better
than Betty.
Equivalently, we can just compare z-transforms, Jon did slightly better as
his grade is 2.2 standard deviations right of the mean, whereas Betty is
1.875 standard deviations right of the mean. Since 2.2>1.875, Jon did
better.
We can also translate Jon’s grade into a SAT grade using the ztransform. Since Jon is 2.2 standard deviations from the mean, this
means if he took the SAT he would be 2.2 standard deviations from the
SAT mean. Thus Jon’s translated SAT grade = 1025 + 200 ×2.2 = 1465.
Do these calculations make sense? p 
p 
In the previous question we compared Betty’s SAT score with Jon’s ACT
score by comparing their percentiles (by making the z-transform).
However, in all statistical analysis we need to take a step back and ask
ourselves whether the calculations were meaningful. Lets go through
them step by step:
p 
p 
p 
p 
p 
Comparing the percentiles for the grades in both exams is a reasonable thing
to do. Its gives us an idea of where each student stands with respect to the
other students who took the exam.
The percentiles were calculated by first calculating the z-transform and then
looking up the z-values in the normal tables.
This means we have assumed that the distribution of grades for both SATs
and ACTs are normally distributed.
While it can be argued that SATs are normally distributed as its maximum
score is 2100 (even though one can only take integer scores, there 2100 is so
large it is possible it is normal – this can be checked by making a QQplot).
The assumption that ACT grades are normally distributed is clearly wrong.
ACT grades are numerical discrete variable which can only take integer
grades between 1-36. Therefore, using the normal distribution to calculate the
Jon’s ACT percentile is liable to give inaccurate probabilities.
Comparing normal approx. with the true probability Left, is the true distribution of ACT grades.
By counting the height of the blocks which
less than or equal to 31, we see that scoring
31 puts Jon in the 89% (percentile). This is
the true probability.
Comparing this to the normal approximation
of 98.6% we see that using the normal
approximation over estimated the
percentile.
In statistical inference we often calculate
probabilities under the assumption of
normality. We need to be mindful that these
are approximations and the probabilities may
not be correct.
A calculation using the wrong distribution will
give the wrong result.
Calculation Practice p 
A farmer wants to enter either his cow or pig for the heaviest animal
competition. The winning animal is the heaviest animal in its
category (cows or pigs). It is known that the weight of cows is
approximately normally distributed with mean 280 pounds and
standard deviation 20 pounds (N(280,20)) and the weight of pigs is
approximately normally distributed with mean 250 pounds and
standard deviation 50 pounds (N(250,50)). His prize cow weighs
330 pounds and prize pig weighs 310 pounds.
The contest only allows one animal per farmer, which animal should
he enter?
q 
q 
q 
q 
It makes sense to see how heavy the animal is relative to its species.
The z-score for the cow = (330-280)/20 = 2.5 standard deviations from
the mean. This corresponds to the 99.3% percentile.
The z-score for the pig is (310-250)/50 = 1.2 this corresponds to the 88.4
percentile. Despite the pig’s weight lying further from the mean, there is a
lot of variation in pig weight.
The farmer should enter the cow, since only 0.7% of cows are heavier
than her.
Calculations based on the plot p 
Suppose we want to calculate the chance of randomly selecting a
calf, whose weights at 0.5 weeks is less than 90 pounds. To
calculate this chance, we need to know from what population we
randomly selecting the calf.
Probability calculations p 
p 
If we are selecting the calf from just our sample (the 44 calves that we were
observing), then the chance is simply the sum of the heights of the bins less
than 90, that is:
§ 
0.341+0.159+0.068 = 0.568. Hence there is a 56.8% chance of
selecting a calf from the sample of 44 which is less than 90 pounds (at
week 0.5).
On the other hand if we are sampling from the from the population of the
calves, we need to use the density plot of the population to find the chance.
If we believe density plot of calves is close to normal density plot, then we
calculate the probability using z-transforms:
§ 
The mean and standard deviation of the density plot is 90.11 and 7.7
respectively. Thus to calculate the probability we make a z-transform z =
(90 - 90.11)/7.7 = -0.014. Looking this up on the tables gives a
probability very close to 0.49. Hence, assuming the density plot of calf
weights are normal the chance of randomly selecting a calf from the
population of calves which is less than 90 pounds is 49%.
Inverse normal calculations
We may also want to find the observed range of values that correspond
to a given proportion/ area under the curve.
For that, we use Table A backward:
p  We
first find the
desired area/ proportion
in the body of the table.
p  We
then read the
corresponding z-value
from the left column
and top row.
For an area of 1.25% (0.0125) to the left of z,
the z-value is −2.24.
Example: Female heights p 
p 
Female heights tend to be normally distributed with N(64.5,2.5).
Questions:
p 
p 
p 
p 
(a) How tall is a female in the 75% percentile?
(b) How tall is a female who is in the top 10% percentile?
(c) How tall is a female who is in the bottom 2.5% percentile?
Answers:
p 
p 
p 
(a) Look up 0.75 inside the z-table – 0.674. This means that someone who is
in the 75 percentile is 0.674 standard deviations to the right of the mean.
That person is 64.5 + 0.674×2.5 = 66.2 inches tall.
(b) Top 10% = 90% percentile. Look up 0.9 in the z-table – 1.28. Using the
same argument as above that person is 64.5 + 1.28×2.5 = 67.7 inches tall.
(c) Look up 0.025 in the z-tables – 1.96. Using the same argument as above
that person is 64.5 -1.96×2.5 = 59.6 inches tall.
p 
q 
q 
(a)
(b)
(c)
p 
Question: Construct an interval centered about the mean, where 95%
of female heights lie.
p 
p 
p 
p 
Answer: Look up 2.5% and 97.5% in z-tables – [-1.96, 1.96].
This means that 95% of heights will lie between 1.96 standard deviations
(either way) from the mean.
Translating this into heights. 95% of heights lie between
[64.5 -1.96×2.5, 64.5 + 1.96×2.5] = [59.6,69.4] inches.
Observe and compare 1.96 standard deviations from the mean to the
68-95-99.7% rule which corresponds to 1 standard deviation, 2
standard deviation and 3 standard deviations from the mean.
p 
2 standard deviations is in fact an approximation of 1.96 standard
deviations from the mean.
Using QQplots to check normality of data p 
p 
p 
p 
p 
By simply superimposing a normal distribution over a histogram it is
very hard to see how close to normal the data is.
Typically to check for normality of data we make a QQplot.
The idea is behind the plot is similar to checking the 68-95-99.7% but
extended to all multiples of the standard deviation not just 1,2, and 3.
The data is close to normality if the points lie along the x=y line.
In the following few slides we consider a few examples:
QQplot of normal data Observe that most points lie close to x=y line. There are a few which lie far away.
QQplot of right skewed data For right skewed data the Qqplot has a U-type shape
QQplot for left skewed data QQplot for left skewed data looks like an inverted U
QQplot for uniform data QQplot for uniform and thick tailed data (data whose tails are not much thinner than
the center) have an S shape.
QQplot for binary response data In this data set,
the response is
either 0 or 1. The
vertical lines
correspond to
each of these
responses.
QQplot of calf data The horizontal lines we see are due to several weights having the same value (due to
rounding). Eg. The first horizontal line corresponds to 5 calves with the same weight.
The weights are not exactly normal, but it does not deviate massively from normality.
Accompanying problems associated with this Chapter p 
p 
p 
Quiz 4
Quiz 4 – parts 2
Homework 2
Download