Review Exercises Normal Approximation to Data Chapter 5, FPP, p

advertisement
Review Exercises
Normal Approximation to Data
Chapter 5, FPP, p. 93-96
Dr. McGahagan
Problem 1. Test scores and the normal approximation.
Given: Mean = 50, SD = 10
A 1.25 SD interval around the mean = 50 +/- 12.5 = 37.5 to 62.5
Using the normal tables from A-1, look up a standard score of 1.25. The
corresponding area from - 1.25 to + 1.25 is 78.87 percent.
A sorted list should make the counting easier:
Data = (29 36 37 37
39 41 44 47 47 48 49 49 50 50 52 52 53 54 56 58 59 62
64 65 72)
Eighteen of the 25 values are within the bounds, this is 18 / 25 = 0.72 or 72 percent of the data, rather less than
expected.
Use EcLS by first defining the data: (bind data (list 29 ... 72)) as above.
(stats data) will confirm the text assertions about the mean and SD.
(density-plot data) shows that the data is roughly normal, so the normal tables will give a fair idea of what
percentages to expect.
(normal-area -1.25 1.25) was the command used to generate the graph on the right.
(anorm 1.25) is a computer shortcut to getting the table value without the graph.
Problem 2. Computer printout of standardized test scores
First 10 entries: ( - 6.2 3.5 1.2 -0.13 4.3 -5.1
-7.2
-11.3 1.8
6.3)
Is the data surprising? Was the standardization procedure botched?
The fact that some numbers are negative is NOT surprising: we would subtract the mean of (say) 75
from all the scores before standardizing, so scores below 75 would have a negative standard score.
The fact that the sum of these scores is negative (- 12.83) is more surprising, but we have only a few
values out of 100 test scores, and the sample may not be representative.
But having so many numbers which are greater than 3 in absolute value is VERY surprising if the data
are even remotely normal. In a normal distribution, we would expect only 3 observations out of a thousand to be
greater than 3 or less than -3 SDs away from the mean, and be amazed to find ANY more than 6 SDs away from
the mean. Hence it is almost certain that the computer program was in error if the scores were normally
distributed.
Longer explanation:
To find the percentage area under the normal curve less than a given number, we ask for the value of the
cumulative distribution function up to a desired point:
First set the computer to show 10 decimals (we'll need them in a bit):
> (normal-cdf -3) = .001349898. Take the reciprocal and to see that 1 in 741 observations will fall in the normal
distribution to the left of -3 standard units; another 1 of those 741 observations will fall to the right of +3
> (normal-cdf -6) = 0.0000000010. Reciprocal = 1, 013,594, 692, so one in a billion observations fall to the left of
a standardized value of -6; another one falls to the right of + 6. We have FOUR observations in that category.
(- 6.2 -7.2 -11.3 and 6.3), which stretches credibility.
> (/ 1.0 (normal-cdf -11)) = 5, 233, 794, 723, 805, 674, 230, 000, 000, 000. One in that many observations (one in
5 octillion if my counting of the commas is correct) can be expected to fall 11 SDs below the mean.
What if the data is not normal? The Russian mathematician Panufty Chebyshev showed that WHATEVER the
distribution of data, the chance of being further in either direction than k standard deviations from the mean is 1 / k; that is,
the chance of being more than 6 SDs away from the mean is 1/6 = 0.17 or 17 percent; the chance of being more than 11
SDs away from the data is 1/11 or about 9 percent. So if the data is not normal, this is possible. I would want to look at a
histogram of all the data before betting my house on there being a mistake, but I think that the computer program probably
has a problem.
Problem 3. SAT scores -- Verbal
Part a. In 1967, verbal SAT scores were distributed normally with mean = 466 and SD 110.
To find the percentage of students scoring above 600,
(1) standardize the score:
600 - 466
134
Z = -------------------- = ------- = 1.2182 or approximately 1.22 (closest value given)
110
110
(2) look up the area from -1.2 to + 1.2 in table A-1. Area = 76.99
(3) Find the tail area above 1.22 = 11.50
Procedure: (100 - 76.99) = 23.01 gives the TWO tail area outside the center; divide by 2 to get the
one tail area, which is the percentage scoring above 600.
Part b. In 1994, verbal SAT scores had mean 423 and SD 110. The Z-score is for the percentage above 600 is
Z = ( 600 - 423) / 110 = 1.6091; look up the area 1.6 in the tables, and you find that
Central area = 89.04; two-tail area = 100 - 89.04 = 10.96; one tail area = 5.48 percent.
11.5 percent of SAT verbal scores were above 600 in 1967; only 5.5 percent were above 600 in 1994.
ASSIGNED. Problem 4. SAT scores -- Math
In 1994, male SAT scores were distributed with mean of 500 and SD of 120.
women's SAT scores had mean 460 and SD of 120.
Find the percentage of (a) men and (b) women scoring above 660.
Problem 5. Fill-in referring to normal curve and histogram, heights of men (mean = 69, SD = 3).
Percent of men with heights between 66 and 72 inches is equal to the area between (a) -1 and
(b) __+1 under the (c) normal curve . This percentage is approximately equal to the area between
(d) 66 inches and (e) 72 inches under the (f) histogram.
Problem 6. Is the curve normal ?
LSAT at one law school had mean score of 169 and SD of 9; the highest score was 178.
Was the curve normal? Note that ONLY ONE SD will get you up to the highest score;
with a normal curve, we would expect 17 percent of the data to lie outside the 1 SD range, and half of that or
over 8 percent of the scores to be HIGHER than 178.
The data is more tightly packed about the mean than would be the case with a normal distribution;
the density of the histogram would be higher, and the distribution more pointed than a normal distribution.
We call this excess "pointiness" leptokurtosis. For the difference between lepto- and platy- kurtosis, you can
consult this sketch by W. E. Gosset (the "Student" in the Student's t distribution)
Source: "Student", "Errors of Routine Analysis", Biometrika, v.19, No. 1/2, July 1927, p. 160.
Note that we would expect leptokurtosis if students self-select by ability when applying to law schools.
Note also a slight error in the diagram: leptokurtic distributions typically have THICKER tails than the normal distribution,
and the kangaroo tails look thinner than they should.
ASSIGNED. Problem 7. Finding percentiles.
Explain your procedure for this problem carefully.
Assume the math SAT for applicants to a school had a mean of 500 and SD of 100 and followed a normal curve.
Part a. A score of 350 was at the percentile of the distribution.
Part b. To be at the 75th percentile of the distribution a student would need as score of __
Question 8. True/False
a. True. Adding 7 to each number of a list adds 7 N to the sum of the N numbers; when we divide by N,
we will be left with an average that is larger by 7 than the original average.
b. False. Adding 7 to the list will make both each number and the mean larger by 7, so subtracting the
mean from each number leaves the deviation unchanged.
c. True. Doubling each entry on the list doubles the average; N is no larger, but the sum of numbers is
twice what is was.
d. True. Example: x = ( 4 11) Average = 7.5, SD = 3.5; y = ( 8 22) Average = 15; SD = 7.
When we calculate the variance (mean squared deviation), each term in the numerator has the form:
(square ( X - Xbar)).
If we double all terms, they will be: (square ( 2X - 2Xbar)) = (square (2 (X - Xbar) = 4 (square (X - Xbar))
Factor out the 4 from all terms, and we have found that the variance of the new series is 4 times that of the old.
But when we calcuate the SD,we take the square root of both sides, so the SD of the new series is twice that of the old.
e. True. Changing the sign of each number in a list changes the sign of the average.
Example: x = ( 10 20), average is 15; x = (-10 -20) and the average is minus 15.
f. False. Changing the sign of each number in a list DOES NOT change the SD.
Changing the sign means multiplying each number by -1; when we calculate the SD, we will be
squaring the minus 1.
Example: (bind x (rnd 10)) = (34 3 34 83 59 79 82 84 50 4)
(mean x) = 51.2 (sd x) = 30.0227
(bind y (* -1 x)) = (-34 -3 -34 -83 -59 -79 -82 -84 -50 -4)
(mean y) = -51.2 (sd y) = 30.0227
The (rnd 10) command means "generate 10 random integers between 0 and 100"
Question 9. More True and False.
a. False. Mean and median are close only if the distribution is SYMMETRICAL.
Example: x = ( 1 2 3 4 90) Mean = 20, Median = 3.
b. False. Half of the list is not necessarily "below average"; in the previous example, 4 out of 5 numbers
were below the mean of 20.
c. False. Histograms of a sample of data, however large, will follow the normal curve only if the
underlying distribution is normal.
Example: generate a sample of 100 non-normal variates by:
(bind x (rchisq 50 3)) [random from the chi-squared distribution with 3 degrees of
freedom -- don't worry about the meaning of "degrees of
freedom" yet.
(hist x)
Repeat, substituting (bind x (rchisq 5000 3)). Is the distribution more normal?
Of course, if the data generating function were (rnorm x), the histogram would very likely look
more normal if we had 5000 observations rather than 50.
d. False. One counter-example is sufficient to demonstrate falsity.
List A = ( 40 40 60 60) has mean = 50, SD = 10, and NO numbers between 40 and 60.
List B = (45 55 (50 + x) (50 - x))
Problem 10. Percentages in a skewed distribution are not normal.
Income distribution in 1992: mean: $ 35,000 SD: $ 23,000.
Percentage of incomes above the mean is likely to be much smaller than with the normal distribution,
here it would be 50 percent; go with the 40 percent figure.
Only if the income distribution were LEFT SKEWED rather than right skewed would it be possible for
the percentage above the mean to be greater than 50 percent.
Problem 11. Statistics at Berkeley.
Knowing more about Statistics 2 would help: is Statistics 1 a prerequisite ? Or is Stat 2 an alternative
designed for students who placed out of Stat 1 on an exam?
If Stat 1 were prerequisite, we would begin with an average of 1, with a very few students who had
another course or two pulling the average up to 1.1. The right-skewed distribution (i) would definitely be the one
to choose.
If Stat 1 were not prerequisite, the right-skewed distribution is still the likely choice: there is no
possibility of taking a negative number of math courses, so a long left tail as in (iii) could be ruled out, and one
would expect a few students who have taken a large number of courses, making symmetry unlikely.
Problem 12. Census "households" and "families"
Households include both single-person households and multiple person households -- obviously the
income of multiple person households stands a good chance of being higher than of just one. A bit less
obviously, single persons are more likely to be young, and hence more likely to be getting starting salaries.
The figures from the Census Bureau from Income, Poverty and Health Insurance Coverage in the
United States , 2007 (Aug. 2008, accessible from www.census.gov/hhes/www/income/income.html. Table 1, p.
7) are:
Alll households
:
Family households :
Married couples :
Non-family
:
Median income
50,233
62,359
72,785
38,910
90 percent CI
+/- 230
+/- 322
+/- 528
+/- 260
Number of households (000)
116,783
77,873
58,370
38,910
The mean income of all households was 67,609 (with standard error of 236) -- Table A-1, p.31.
Download