8/26/2011 Looking at Data - Distributions Density Curves and Normal Distributions IPS Chapter 1.3 © 2009 W.H. Freeman and Company Edited by Nikos Psomas Objectives (IPS Chapter 1.3) Density curves and Normal distributions Density curves Measuring center and spread for density curves Normal distributions The 68-95-99.7 rule Standardizing observations Using the standard Normal Table Inverse Normal calculations Normal quantile plots (Skip) 1 8/26/2011 Density curves A density curve is a mathematical model of a distribution. The total area under the curve, by definition, is equal to 1, or 100%. The area under the curve for a range of values is the proportion of all observations for that range. Histogram of a sample with the smoothed, density curve describing theoretically the population. Density curves come in any imaginable shape. Some are well known mathematically and others aren’t. 2 8/26/2011 Median and mean of a density curve The median of a density curve is the equal-areas point: the point that divides the area under the curve in half. The mean of a density curve is the balance point, at which the curve would balance if it were made of solid material. The median and mean are the same for a symmetric density curve. The mean of a skewed curve is pulled in the direction of the long tail. Normal distributions Normal – or Gaussian – distributions are a family of symmetrical, bellshaped density curves defined by a mean µ (mu) and a standard deviation σ (sigma) : N(µ,σ). 1 e 2π f ( x) = 1 x−µ − 2 σ 2 x x e = 2.71828… The base of the natural logarithm π = pi = 3.14159… 3 8/26/2011 A family of density curves Here, means are the same (µ = 15) while standard deviations are different (σ = 2, 4, and 6). 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Here, means are different (µ = 10, 15, and 20) while standard deviations are the same (σ = 3) 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 The 68-95-99.7% Rule for Normal Distributions About 68% of all observations Inflection point are within 1 standard deviation (σ) of the mean (µ). About 95% of all observations are within 2 σ of the mean µ. Almost all (99.7%) observations are within 3 σ of the mean. mean µ = 64.5 standard deviation σ = 2.5 N(µ, σ) = N(64.5, 2.5) 4 8/26/2011 Normal Distribution Percentages σ .14% 34% 2.1% 34% 13.6% µ−3σ µ−2σ 2.1% .14% 13.6% µ−1σ µ µ+1σ µ+2σ µ+3σ Using the 68-95-99.7% Rule – Example Heights of American men follow a normal distribution with mean 5ft 10in and standard deviation 3in. What % of American men are taller than 5ft 7in but shorter than 6ft 4in? What % of American men are shorter than 6ft 1in? σ=3” = 83.84% = 81.6% .14% 34% 2.1% 34% 13.6% 5’1” 5’4” 5’7” 2.1% .14% 13.6% 5’10” 6’1” 6’4” 6’7” 5 8/26/2011 Height Distribution of American Men There are about 100,000,000 adult men in America. The table shows the expected number of American men in various height ranges. 34% .14% 34% 2.1% 2.1% 13.6% 28 3,200 2,100,000 6'7" - 6'10" 6'10" - 7'1" 7'1" - 7'4" 7'4" - 7'7" 135,000 6'4" - 6'7" 34,000,000 6'1" - 6'4" 13,600,000 5'7" - 5'10" 5'10" - 6'1" 34,000,000 5'4" - 5'7" 13,600,000 135,000 3,200 5'1" - 5'4" 2,100,000 4'7" - 4'10" 4'10" - 5'1" .14% 13.6% Some famous tall guys! S.D above average Players 3σ Michael Jordan 6'6", Kobe Bryant 6'7" 4σ Larry Bird 6'9", Karl Malone 6'9" 5σ Shaquille O'Neal 7"1', Wilt Chamberlain 7'1", Kareem Abdul-Jabbar 7'2" 6σ Yao Ming 7'5" US population this tall 130,000 3,200 28 2 in the world 6 8/26/2011 Normal Distribution Calculations The 68-95-99.7% rule gives a good way to compute normal distribution percentages for intervals with end points that are an integer multiple of σ away from the mean µ. What about intervals with end points that are not an integer multiple of σ away from the mean µ? σ=3” What % of American men are shorter than 5ft 5in tall? 5’5” ? 5’1” 5’4” 5’7” 5’10” 6’1” 6’4” 6’7” Normal Distribution Percentages Normal distribution percentages for any interval under the normal curve can be computed using software, calculators, or tables. A =% of men shorter than 5’ 8” tall 5’8” B =% of men taller than 5’ 8” but shorter than 6’ 3” σ=3” C =% of men taller than 6’ 3” B 6’3” A C 5’1” 5’4” 5’7” 5’10” 6’1” 6’4” 6’7” 7 8/26/2011 Normal Calculations Using TI-83/84 Press 2nd VARS/DISTR Normal Calculations Using TI-83/84 normalcdf(a, b, µ, σ) = fraction (proportion, or %) of population values that are larger than a but smaller than b P[ a ≤ X ≤ b] σ a µ b X 8 8/26/2011 % of men taller than 5’ 8” but shorter than 6’ 3” normalcdf(a, b, µ, σ) = normalcdf(68, 75, 70, 3) = .69972 or 69.97% % of men taller than 5’ 8” but shorter than 6’ 3” a = 5’8” Probability that a randomly selected man will be taller than 5’ 8” but shorter than 6’ 3” σ=3” b = 6’3” 5’1” 5’4” 5’7” 5’10” 6’1” 6’4” 6’7” % of men shorter than 5’ 8” normalcdf(a, b, µ, σ) = normalcdf(0*, 68, 70, 3) = .25249 or 25.25% Note* Use a number for a that’s 5σ or more below the mean µ. b = 5’8” % of men shorter than 5’ 8” 5’1” σ=3” 5’4” 5’7” 5’10” 6’1” 6’4” 6’7” 9 8/26/2011 % of men taller than 6’ 3” normalcdf(a, b, µ, σ) = normalcdf(75, 100*, 70, 3) = .04779 or 4.78% Note* Use a number for b that’s 5σ or more above the mean µ. % of men taller than 6’ 3” σ=3” a = 6’3” 5’1” 5’4” 5’7” 5’10” 6’1” 6’4” 6’7” Normal Calculations Using STANDARD NORMAL TABLES 10 8/26/2011 The standard Normal distribution Because all Normal distributions share the same properties, we can standardize data to transform any Normal curve N(µ,σ) into the standard Normal curve N(0,1). N(64.5, 2.5) N(0,1) => x Standardized height (no units) z For each x we calculate a new value, z (called a z-score). Standardizing: calculating z-scores A z-score measures the number of standard deviations that a data value x is from the mean µ. z= (x − µ ) σ When x is 1 standard deviation larger than the mean, then z = 1. for x = µ + σ , z = µ +σ − µ σ = =1 σ σ When x is 2 standard deviations larger than the mean, then z = 2. for x = µ + 2σ , z = µ + 2σ − µ 2σ = =2 σ σ When x is larger than the mean, z is positive. When x is smaller than the mean, z is negative. 11 8/26/2011 Normal Distribution Tables Normal Distribution Tables Example: If z = -2.57 Area under the curve below -2.57 = .0051 or 0.51% Example: If z = 1.02 Area under the curve below 1.02 = .8461 or 84.61% 12 8/26/2011 Ex. Women heights Women’s heights follow the N(64.5”,2.5”) N(µ, σ) = N(64.5, 2.5) distribution. What percent of women are shorter than 67 inches tall (that’s 5’7”)? Area= ??? Area = ??? mean µ = 64.5" standard deviation σ = 2.5" x (height) = 67" µ = 64.5” x = 67” z=0 z = 1.4 We calculate z, the standardized value of x: z= (x − µ) σ , z = ( 67 − 64 . 5 ) 2 .5 = = 1 => 1 stand. dev. from mean 2 .5 2 .5 Percent of women shorter than 67” For z = 1, the area under the standard Normal curve to the left of z is 0.8413. N(µ, σ) = N(64.5”, 2.5”) Area ≈ 0.84 Conclusion: Area ≈ 0.16 84.13% of women are shorter than 67”. By subtraction, 1 - 0.8413, or 15.87% of women are taller than 67". µ = 64.5” x = 67” z=1 13 8/26/2011 The National Collegiate Athletic Association (NCAA) requires Division I athletes to score at least 820 on the combined math and verbal SAT exam to compete in their first college year. The SAT scores of 2003 were approximately normal with mean 1026 and standard deviation 209. What proportion of all students would be NCAA qualifiers (SAT ≥ 820)? x = 820 µ = 1026 σ = 209 z= (x − µ) σ z= (820 − 1026 ) 209 z= − 206 ≈ −0.99 209 0.1611 1 Table A : area under N(0,1) to the left of z = -.99 is 0.1611 or approx. 16%. ≈ 84% Tips on using Table A Because the Normal distribution is symmetrical, there are 2 ways Area = 0.9901 that you can calculate the area under the standard Normal curve Area = 0.0099 to the right of a z value. z = -2.33 area right of z = area left of -z area right of z = 1 - area left of z 14 8/26/2011 Tips on using Table A To calculate the area between 2 z- values, first get the area under N(0,1) to the left for each z-value from Table A. Then subtract the smaller area from the larger area. A common mistake made by students is to subtract both z values. But the Normal curve is not uniform. area between z1 and z2 = area left of z1 – area left of z2 The area under N(0,1) for a single value of z is zero. (Try calculating the area to the left of z minus that same area!) The NCAA defines a “partial qualifier” eligible to practice and receive an athletic scholarship, but not to compete, with a combined SAT score of at least 720. What proportion of all students who take the SAT would be partial qualifiers? That is, what proportion have scores between 720 and 820? x = 720 µ = 1026 σ = 209 (x − µ) z= σ (720 − 1026 ) 209 − 306 z= ≈ −1.46 209 Table A : area under z= area between 720 and 820 ≈ 9% = = area left of 820 0.1611 - area left of 720 0.0721 N(0,1) to the left of z - .99 is 0.0721 About 9% of all students who take the SAT have scores or approx. 7%. between 720 and 820. 15 8/26/2011 The cool thing about working with normally distributed data is that we can manipulate it, and then find answers to questions that involve comparing seemingly noncomparable distributions. We do this by “standardizing” the data. All this involves is changing the scale so that the mean now = 0 and the standard deviation =1. If you do this to different distributions it makes them comparable. z= N(0,1) (x − µ ) σ Ex. Gestation time in malnourished mothers What is the effect of better maternal care on gestation time and preemies? The goal is to obtain pregnancies 240 days (8 months) or longer. What improvement did we get by adding better food? µ 266 σ 15 µ 250 σ 20 180 200 220 240 260 280 300 320 Gestation time (days) Vitamins only Vitamins and better food 16 8/26/2011 Under each treatment, what percent of mothers failed to carry their babies at least 240 days? Vitamins Only µ=250, σ=20, x=240 x = 240 µ = 250 σ = 20 z= (x − µ) σ (240 − 250) z= 20 − 10 = −0.5 z= 20 (half a standard deviation) 190 210 230 250 270 290 310 Gestation time (days) Table A : area under N(0,1) to the left of z - 0.5 is 0.3085. Vitamins only: 30.85% of women would be expected to have gestation times shorter than 240 days. Vitamins and better food µ=266, σ=15, x=240 x = 240 µ = 266 σ = 15 z= (x − µ) σ (240 − 266) z= 15 − 26 = −1.73 z= 15 (almost 2 sd from mean) Table A : area under N(0,1) to the left of z - 1.73 is 0.0418. 221 236 251 266 281 296 311 Gestation time (days) Vitamins and better food: 4.18% of women would be expected to have gestation times shorter than 240 days. Compared to vitamin supplements alone, vitamins and better food resulted in a much smaller percentage of women with pregnancy terms below 8 months (4% vs. 31%). 17 8/26/2011 Inverse Normal Distribution Calculations Inverse Normal Distribution Calculations Deal with computing percentiles of the normal distribution Examples – How tall does an American man should be to fall in the lower 25% of the men’s height distribution? A university admits students that place in the top 20% of the SAT scores distribution. How high an SAT score must a college candidate have to be eligible for admittance to this university? 18 8/26/2011 Inverse Normal Calculations Using TI-83/84 Press 2nd VARS/DISTR Finding Percentiles Using the TI-83/84 Percentile (x) = invNorm(p, µ, σ) x σ p µ−3σ µ−2σ µ−1σ µ µ+1σ µ+2σ µ+3σ X 19 8/26/2011 25th Percentile of Men’s Heights invNorm(p, µ, σ) = invNorm(0.25, 70, 3) = 67.9765” = 5’ 8” 25th percentile = 5’8” 25% of men are shorter than 5’ 8” σ=3” 25% 5’1” 5’4” 5’7” 5’10” 6’1” 6’4” 6’7” Top 20% of the SAT distribution invNorm(p, µ, σ) = invNorm(0.80, 1200, 210) = 1376.740 = 1377 80th percentile = 1377 σ = 210 Top 20% of SAT scores 80% 20% 570 780 990 1200 1410 1620 1830 20 8/26/2011 Inverse normal calculations using Normal tables To find the range of values that correspond to a given proportion/ area under the curve: 1. Find the desired area/ proportion in the body of the table, 2. Read the corresponding z-value from the left column and top row. 3. To find the percentile (x), use the formula x = µ + (σ*z) (σ ) Example: The z value that has an area of 1.25% (0.0125) to it’s left is -2.24 Vitamins and better food How long are the longest 75% of pregnancies when mothers with malnutrition are given vitamins and better food? µ=266, σ=15, upper area 75% µ = 266 σ = 15 upper area = 75% x=? lower area = 25% x=? upper 75% Table A : z value for the lower area 25% under N(0,1) is about - 0.67. z= (x − µ) σ ⇔ x = µ + ( z *σ ) x = 266 + (−0.67 *15) x = 255.95 ≈ 256 221 236 251 266 281 296 311 Gestation time (days) Remember that Table A gives the area to the left of z. Thus, we need to search for the lower 25% in Table A in order to get z. The 75% longest pregnancies in this group are about 256 days or longer. 21 8/26/2011 Five-Number Summary & Boxplots for Normal Distributions Q1 = 25th percentile Med = Mean Q3 = 75th percentile Min = Q1 – 1.5*IQR Max = Q3 + 1.5*IQR µ =250 σ =20 µ =266 σ =15 Min Q1 Med Q3 Max 196 237 250 263 304 226 256 266 276 306 Normal Calculations Using Excel NORMDIST(x,µ µ,σ σ,1) 22 8/26/2011 Normal Calculations Using Excel NORMSDIST(z) Inverse Normal Calculations Using Excel NORMINV(p,µ µ,σ σ) 23 8/26/2011 Inverse Standard Normal Calculations Using Excel NORMSINV(p) Lesson Summary Key Concepts Density curves & Properties Mean & Median points Mean & SD (population versus sample mean & standard deviation) Normal density curves 68-95-99.7 Rule Z-scores Standard Normal distribution Normal quantile plot Skills Learned Computing z-scores Normal distribution calculations Computing proportions by finding areas under a normal curve Computing normal distribution percentiles 24 8/26/2011 Heights of Fortune 500 CEOs A survey of Fortune 500 CEO height in 2005 revealed that they were on average 6 ft 0 in (1.83 m) tall, which is approximately 2–3 inches (5.1–7.6 cm) taller than the average American man. 30% were 6 ft 2 in (1.88 m) tall or more; in comparison only 3.9% of the overall United States population is of this height.[11] Similar surveys have uncovered that less than 3% of CEOs were below 5 ft 7 in (1.70 m) or taller than 6 ft 2 in (1.88 m) in height. Ninety percent of CEOs are of above average height.[12 Dating and marriage Heightism is also a factor in dating preferences. For some people, height is the major factor in sexual attractiveness. The greater reproductive success of taller men is attested to by studies indicating that taller men are more likely to be married and to have more children, except in societies with severe gender imbalances caused by war.[17][18] Quantitative studies of woman-for-men personal advertisements have shown strong preference for tall men, with a large percentage indicating that a man significantly below average height was unacceptable.[19] Conversely, studies have shown that women of below average height are more likely to be married and have children than women of above average height. Some reasons which have been suggested for this situation include earlier fertility of shorter women, and that a shorter woman makes her mate feel taller in comparison and therefore more masculine.[20] It is unclear and debated as to the extent to which such preferences are innate or are the function of a society in which height discrimination impacts on socio-economic status. Certainly, much is always made in newspapers and magazines of celebrity couples with a notable height difference, especially where a man is shorter than his wife (for example, Jamie Cullum, 5 inches (13 cm) shorter at 5 ft 6 in (1.68 m) than Sophie Dahl, though the difference is often exaggerated). 25