1.4 Lecture - College of Marin

advertisement
8/26/2011
Looking at Data - Distributions
Density Curves and Normal Distributions
IPS Chapter 1.3
© 2009 W.H. Freeman and Company
Edited by Nikos Psomas
Objectives (IPS Chapter 1.3)
Density curves and Normal distributions
Density curves
Measuring center and spread for density curves
Normal distributions
The 68-95-99.7 rule
Standardizing observations
Using the standard Normal Table
Inverse Normal calculations
Normal quantile plots (Skip)
1
8/26/2011
Density curves
A density curve is a mathematical model of a distribution.
The total area under the curve, by definition, is equal to 1, or 100%.
The area under the curve for a range of values is the proportion of all
observations for that range.
Histogram of a sample with the
smoothed, density curve
describing theoretically the
population.
Density curves come in any
imaginable shape.
Some are well known
mathematically and others aren’t.
2
8/26/2011
Median and mean of a density curve
The median of a density curve is the equal-areas point: the point that
divides the area under the curve in half.
The mean of a density curve is the balance point, at which the curve
would balance if it were made of solid material.
The median and mean are the same for a symmetric density curve.
The mean of a skewed curve is pulled in the direction of the long tail.
Normal distributions
Normal – or Gaussian – distributions are a family of symmetrical, bellshaped density curves defined by a mean µ (mu) and a standard
deviation σ (sigma) : N(µ,σ).
1
e
2π
f ( x) =
1  x−µ 
− 

2 σ 
2
x
x
e = 2.71828… The base of the natural logarithm
π = pi = 3.14159…
3
8/26/2011
A family of density curves
Here, means are the same (µ = 15)
while standard deviations are
different (σ = 2, 4, and 6).
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
Here, means are different
(µ = 10, 15, and 20) while standard
deviations are the same (σ = 3)
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
The 68-95-99.7% Rule for Normal Distributions
About 68% of all observations
Inflection point
are within 1 standard deviation
(σ) of the mean (µ).
About 95% of all observations
are within 2 σ of the mean µ.
Almost all (99.7%) observations
are within 3 σ of the mean.
mean µ = 64.5
standard deviation σ = 2.5
N(µ, σ) = N(64.5, 2.5)
4
8/26/2011
Normal Distribution Percentages
σ
.14%
34%
2.1%
34%
13.6%
µ−3σ
µ−2σ
2.1%
.14%
13.6%
µ−1σ
µ
µ+1σ
µ+2σ
µ+3σ
Using the 68-95-99.7% Rule – Example
Heights of American men follow a normal distribution with mean 5ft 10in and standard
deviation 3in.
What % of
American men are
taller than 5ft 7in
but shorter than
6ft 4in?
What % of
American men
are shorter than
6ft 1in?
σ=3”
= 83.84%
= 81.6%
.14%
34%
2.1%
34%
13.6%
5’1”
5’4”
5’7”
2.1%
.14%
13.6%
5’10”
6’1”
6’4”
6’7”
5
8/26/2011
Height Distribution of American Men
There are about 100,000,000
adult men in America.
The table shows the expected
number of American men in
various height ranges.
34%
.14%
34%
2.1%
2.1%
13.6%
28
3,200
2,100,000
6'7" - 6'10" 6'10" - 7'1" 7'1" - 7'4" 7'4" - 7'7"
135,000
6'4" - 6'7"
34,000,000
6'1" - 6'4"
13,600,000
5'7" - 5'10" 5'10" - 6'1"
34,000,000
5'4" - 5'7"
13,600,000
135,000
3,200
5'1" - 5'4"
2,100,000
4'7" - 4'10" 4'10" - 5'1"
.14%
13.6%
Some famous tall guys!
S.D above average
Players
3σ
Michael Jordan 6'6",
Kobe Bryant 6'7"
4σ
Larry Bird 6'9",
Karl Malone 6'9"
5σ
Shaquille O'Neal 7"1',
Wilt Chamberlain 7'1",
Kareem Abdul-Jabbar 7'2"
6σ
Yao Ming 7'5"
US population this tall
130,000
3,200
28
2 in the world
6
8/26/2011
Normal Distribution Calculations
The 68-95-99.7% rule gives a good way to compute normal
distribution percentages for intervals with end points that are an
integer multiple of σ away from the mean µ.
What about intervals with end points that are not an integer multiple
of σ away from the mean µ?
σ=3”
What % of American
men are shorter
than 5ft 5in tall?
5’5”
?
5’1” 5’4” 5’7” 5’10” 6’1” 6’4” 6’7”
Normal Distribution Percentages
Normal distribution percentages for any interval under the normal
curve can be computed using software, calculators, or tables.
A =% of men shorter
than 5’ 8” tall
5’8”
B =% of men taller
than 5’ 8” but shorter
than 6’ 3”
σ=3”
C =% of men taller
than 6’ 3”
B
6’3”
A
C
5’1”
5’4”
5’7”
5’10”
6’1”
6’4”
6’7”
7
8/26/2011
Normal Calculations Using TI-83/84
Press 2nd VARS/DISTR
Normal Calculations Using TI-83/84
normalcdf(a, b, µ, σ)
= fraction (proportion, or %) of population values that are larger
than a but smaller than b
P[ a ≤ X ≤ b]
σ
a
µ
b
X
8
8/26/2011
% of men taller than 5’ 8” but shorter than 6’ 3”
normalcdf(a, b, µ, σ)
= normalcdf(68, 75, 70, 3) = .69972 or 69.97%
% of men taller than 5’ 8”
but shorter than 6’ 3”
a = 5’8”
Probability that a randomly
selected man will be taller than
5’ 8” but shorter than 6’ 3”
σ=3”
b = 6’3”
5’1”
5’4”
5’7”
5’10”
6’1”
6’4”
6’7”
% of men shorter than 5’ 8”
normalcdf(a, b, µ, σ)
= normalcdf(0*, 68, 70, 3) = .25249 or 25.25%
Note*
Use a number for
a that’s 5σ or
more below the
mean µ.
b = 5’8”
% of men shorter
than 5’ 8”
5’1”
σ=3”
5’4”
5’7”
5’10”
6’1”
6’4”
6’7”
9
8/26/2011
% of men taller than 6’ 3”
normalcdf(a, b, µ, σ)
= normalcdf(75, 100*, 70, 3) = .04779 or 4.78%
Note*
Use a number for
b that’s 5σ or
more above the
mean µ.
% of men taller
than 6’ 3”
σ=3”
a = 6’3”
5’1”
5’4”
5’7”
5’10”
6’1”
6’4”
6’7”
Normal Calculations Using
STANDARD NORMAL TABLES
10
8/26/2011
The standard Normal distribution
Because all Normal distributions share the same properties, we can
standardize data to transform any Normal curve N(µ,σ) into the
standard Normal curve N(0,1).
N(64.5, 2.5)
N(0,1)
=>
x
Standardized height (no units)
z
For each x we calculate a new value, z (called a z-score).
Standardizing: calculating z-scores
A z-score measures the number of standard deviations that a data
value x is from the mean µ.
z=
(x − µ )
σ
When x is 1 standard deviation larger
than the mean, then z = 1.
for x = µ + σ , z =
µ +σ − µ σ
= =1
σ
σ
When x is 2 standard deviations larger
than the mean, then z = 2.
for x = µ + 2σ , z =
µ + 2σ − µ 2σ
=
=2
σ
σ
When x is larger than the mean, z is positive.
When x is smaller than the mean, z is negative.
11
8/26/2011
Normal Distribution Tables
Normal Distribution Tables
Example:
If z = -2.57
Area under the curve
below -2.57 = .0051
or 0.51%
Example:
If z = 1.02
Area under the curve
below 1.02 = .8461
or 84.61%
12
8/26/2011
Ex. Women heights
Women’s heights follow the N(64.5”,2.5”)
N(µ, σ) =
N(64.5, 2.5)
distribution. What percent of women are
shorter than 67 inches tall (that’s 5’7”)?
Area= ???
Area = ???
mean µ = 64.5"
standard deviation σ = 2.5"
x (height) = 67"
µ = 64.5”
x = 67”
z=0
z = 1.4
We calculate z, the standardized value of x:
z=
(x − µ)
σ
, z =
( 67 − 64 . 5 ) 2 .5
=
= 1 => 1 stand. dev. from mean
2 .5
2 .5
Percent of women shorter than 67”
For z = 1, the area under
the standard Normal curve
to the left of z is 0.8413.
N(µ, σ) =
N(64.5”, 2.5”)
Area ≈ 0.84
Conclusion:
Area ≈ 0.16
84.13% of women are shorter than 67”.
By subtraction, 1 - 0.8413, or 15.87% of
women are taller than 67".
µ = 64.5” x = 67”
z=1
13
8/26/2011
The National Collegiate Athletic Association (NCAA) requires Division I athletes to
score at least 820 on the combined math and verbal SAT exam to compete in their
first college year. The SAT scores of 2003 were approximately normal with mean
1026 and standard deviation 209.
What proportion of all students would be NCAA qualifiers (SAT ≥ 820)?
x = 820
µ = 1026
σ = 209
z=
(x − µ)
σ
z=
(820 − 1026 )
209
z=
− 206
≈ −0.99
209
0.1611
1
Table A : area under
N(0,1) to the left of
z = -.99 is 0.1611
or approx. 16%.
≈ 84%
Tips on using Table A
Because the Normal distribution is
symmetrical, there are 2 ways
Area = 0.9901
that you can calculate the area
under the standard Normal curve
Area = 0.0099
to the right of a z value.
z = -2.33
area right of z = area left of -z
area right of z =
1
-
area left of z
14
8/26/2011
Tips on using Table A
To calculate the area between 2 z- values, first get the area under N(0,1)
to the left for each z-value from Table A.
Then subtract the
smaller area from the
larger area.
A common mistake made by
students is to subtract both z
values. But the Normal curve
is not uniform.
area between z1 and z2 =
area left of z1 – area left of z2
The area under N(0,1) for a single value of z is zero.
(Try calculating the area to the left of z minus that same area!)
The NCAA defines a “partial qualifier” eligible to practice and receive an athletic
scholarship, but not to compete, with a combined SAT score of at least 720.
What proportion of all students who take the SAT would be partial qualifiers?
That is, what proportion have scores between 720 and 820?
x = 720
µ = 1026
σ = 209
(x − µ)
z=
σ
(720 − 1026 )
209
− 306
z=
≈ −1.46
209
Table A : area under
z=
area between
720 and 820
≈ 9%
=
=
area left of 820
0.1611
-
area left of 720
0.0721
N(0,1) to the left of
z - .99 is 0.0721
About 9% of all students who take the SAT have scores
or approx. 7%.
between 720 and 820.
15
8/26/2011
The cool thing about working with
normally distributed data is that
we can manipulate it, and then
find answers to questions that
involve comparing seemingly noncomparable distributions.
We do this by “standardizing” the
data. All this involves is changing
the scale so that the mean now = 0
and the standard deviation =1. If
you do this to different distributions
it makes them comparable.
z=
N(0,1)
(x − µ )
σ
Ex. Gestation time in malnourished mothers
What is the effect of better maternal care on gestation time and preemies?
The goal is to obtain pregnancies 240 days (8 months) or longer.
What improvement did we get
by adding better food?
µ 266
σ 15
µ 250
σ 20
180
200
220
240
260
280
300
320
Gestation time (days)
Vitamins only
Vitamins and better food
16
8/26/2011
Under each treatment, what percent of mothers failed to carry their babies at
least 240 days?
Vitamins Only
µ=250, σ=20,
x=240
x = 240
µ = 250
σ = 20
z=
(x − µ)
σ
(240 − 250)
z=
20
− 10
= −0.5
z=
20
(half a standard deviation)
190
210
230
250
270
290
310
Gestation time (days)
Table A : area under N(0,1) to
the left of z - 0.5 is 0.3085.
Vitamins only: 30.85% of women
would be expected to have gestation
times shorter than 240 days.
Vitamins and better food
µ=266, σ=15,
x=240
x = 240
µ = 266
σ = 15
z=
(x − µ)
σ
(240 − 266)
z=
15
− 26
= −1.73
z=
15
(almost 2 sd from mean)
Table A : area under N(0,1) to
the left of z - 1.73 is 0.0418.
221
236
251
266
281
296
311
Gestation time (days)
Vitamins and better food: 4.18% of women
would be expected to have gestation
times shorter than 240 days.
Compared to vitamin supplements alone, vitamins and better food resulted in a much
smaller percentage of women with pregnancy terms below 8 months (4% vs. 31%).
17
8/26/2011
Inverse Normal
Distribution Calculations
Inverse Normal Distribution Calculations
Deal with computing percentiles of the normal distribution
Examples –
How tall does an American man should be to fall in the lower
25% of the men’s height distribution?
A university admits students that place in the top 20% of the SAT
scores distribution. How high an SAT score must a college
candidate have to be eligible for admittance to this university?
18
8/26/2011
Inverse Normal Calculations Using TI-83/84
Press 2nd VARS/DISTR
Finding Percentiles Using the TI-83/84
Percentile (x)
= invNorm(p, µ, σ)
x
σ
p
µ−3σ
µ−2σ
µ−1σ
µ
µ+1σ
µ+2σ
µ+3σ
X
19
8/26/2011
25th Percentile of Men’s Heights
invNorm(p, µ, σ)
= invNorm(0.25, 70, 3)
= 67.9765”
= 5’ 8”
25th percentile = 5’8”
25% of men are
shorter than 5’ 8”
σ=3”
25%
5’1”
5’4”
5’7”
5’10”
6’1”
6’4”
6’7”
Top 20% of the SAT distribution
invNorm(p, µ, σ)
= invNorm(0.80, 1200, 210)
= 1376.740
= 1377
80th percentile = 1377
σ = 210
Top 20% of SAT
scores
80%
20%
570
780
990
1200
1410
1620
1830
20
8/26/2011
Inverse normal calculations using Normal tables
To find the range of values that correspond to a given proportion/ area
under the curve:
1. Find the desired area/
proportion in the body of
the table,
2. Read the corresponding
z-value from the left
column and top row.
3. To find the percentile
(x), use the formula
x = µ + (σ*z)
(σ )
Example:
The z value that has an area of 1.25% (0.0125)
to it’s left is -2.24
Vitamins and better food
How long are the longest 75% of pregnancies when mothers with malnutrition are
given vitamins and better food?
µ=266, σ=15,
upper area 75%
µ = 266
σ = 15
upper area = 75%
x=?
lower area = 25%
x=?
upper 75%
Table A : z value for the
lower area 25% under
N(0,1) is about - 0.67.
z=
(x − µ)
σ
⇔ x = µ + ( z *σ )
x = 266 + (−0.67 *15)
x = 255.95 ≈ 256
221
236
251
266
281
296
311
Gestation time (days)
Remember that Table A gives the area to
the left of z. Thus, we need to search for
the lower 25% in Table A in order to get z.
The 75% longest pregnancies in this group are about 256 days or longer.
21
8/26/2011
Five-Number Summary & Boxplots for Normal Distributions
Q1 = 25th percentile
Med = Mean
Q3 = 75th percentile
Min = Q1 – 1.5*IQR
Max = Q3 + 1.5*IQR
µ =250 σ =20
µ =266 σ =15
Min
Q1
Med
Q3
Max
196
237
250
263
304
226
256
266
276
306
Normal Calculations Using Excel
NORMDIST(x,µ
µ,σ
σ,1)
22
8/26/2011
Normal Calculations Using Excel
NORMSDIST(z)
Inverse Normal Calculations Using Excel
NORMINV(p,µ
µ,σ
σ)
23
8/26/2011
Inverse Standard Normal Calculations Using Excel
NORMSINV(p)
Lesson Summary
Key Concepts
Density curves & Properties
Mean & Median points
Mean & SD (population versus sample mean & standard deviation)
Normal density curves
68-95-99.7 Rule
Z-scores
Standard Normal distribution
Normal quantile plot
Skills Learned
Computing z-scores
Normal distribution calculations
Computing proportions by finding areas under a normal curve
Computing normal distribution percentiles
24
8/26/2011
Heights of Fortune 500 CEOs
A survey of Fortune 500 CEO height in 2005 revealed that they were
on average 6 ft 0 in (1.83 m) tall, which is approximately 2–3 inches
(5.1–7.6 cm) taller than the average American man. 30% were
6 ft 2 in (1.88 m) tall or more; in comparison only 3.9% of the overall
United States population is of this height.[11] Similar surveys have
uncovered that less than 3% of CEOs were below 5 ft 7 in (1.70 m)
or taller than 6 ft 2 in (1.88 m) in height. Ninety percent of CEOs are
of above average height.[12
Dating and marriage
Heightism is also a factor in dating preferences. For some people,
height is the major factor in sexual attractiveness.
The greater reproductive success of taller men is attested to by studies
indicating that taller men are more likely to be married and to have
more children, except in societies with severe gender imbalances
caused by war.[17][18] Quantitative studies of woman-for-men personal
advertisements have shown strong preference for tall men, with a large
percentage indicating that a man significantly below average height was
unacceptable.[19]
Conversely, studies have shown that women of below average height
are more likely to be married and have children than women of above
average height. Some reasons which have been suggested for this
situation include earlier fertility of shorter women, and that a shorter
woman makes her mate feel taller in comparison and therefore more
masculine.[20]
It is unclear and debated as to the extent to which such preferences are
innate or are the function of a society in which height discrimination
impacts on socio-economic status. Certainly, much is always made in
newspapers and magazines of celebrity couples with a notable height
difference, especially where a man is shorter than his wife (for example,
Jamie Cullum, 5 inches (13 cm) shorter at 5 ft 6 in (1.68 m) than Sophie
Dahl, though the difference is often exaggerated).
25
Download