Statistics 515 Statistical Methods I First Examination (with Answers) February 21, 2002 E. A. Pena's Class NAME _____________________________________ SCORE ______________ Part A [25 points] (Numerical Summary Measures): Iron status in athletes is important because of the central role of this mineral in the synthesis of hemoglobin and enzymes fundamental to energy production. The following seven observations are the hemoglobin level (g/dl) for female alpine skiers. Table 1: Unarranged data set for the hemoglobin level of seven female alpine skiers. 14.6 14.3 15.1 12.7 11.8 13.4 13.8 For this sample data set, 1. Compute the sample mean. Answer: Sample Mean = 95.7/7 = 13.67 2. Compute the sample median. Answer: First arrange the data: 11.8, 12.7, 13.4, 13.8, 14.3, 14.6, 15.1 Median= 13.8 3. Compute the first quartile. First Quartile: Either the average of 12.7 and 13.4 which is 13.05, or you may just take 12.7 as the first quartile. 4. Compute the sample variance. Sample variance = [ 1316.19 - (95.7)^2/7]/(7-1) = 1.306 5. Compute the sample standard deviation. Sample Standard Deviation = Square Root of 1.306 = 1.14 1 Part B [30 points] (Data Organization and Interpretations): In the August 29, 2001 issue of The State, Columbia's daily newspaper, SAT scores for South Carolina's 86 school districts for the years 1998-2001 were reported. The variables in this data set are: SAT98 = school district SAT score for 1998. SAT99 = school district SAT score for 1999. SAT00 = school district SAT score for 2000. SAT01 = school district SAT score for 2001. Using Minitab, the following numerical summary measures and graphical displays were obtained for this data set. Figure 1. The following frequency histogram is that associated with the SAT scores for Year 2001 for the 86 school districts. Frequency Histogram of 2001 SAT Scores of 86 South Carolina School Districts Frequency 20 10 0 720 760 800 840 880 920 960 1000 1040 1080 1120 SAT01 2 Figure 2: Comparative boxplots of the SAT scores for the 86 school districts for years 19982001. Comparative BoxPlots of the SAT Scores for 86 of South Carolina's School Districts for 1998-2001 1100 SAT Score 1000 900 800 700 SAT98 SAT99 SAT00 SAT01 Table 2: The following are numerical summary measures for the SAT Scores for the 86 districts for each of the years 1998-2001. Variable SAT98 SAT99 SAT00 SAT01 N 86 86 86 86 Mean 918.43 911.38 922.91 934.49 Variable SAT98 SAT99 SAT00 SAT01 Minimum 741.00 731.00 730.00 753.00 Median StDev 926.50 68.83 921.50 75.85 938.00 77.19 944.50 74.43 Maximum 1051.00 1049.00 1056.00 1063.00 Q1 879.25 857.25 882.50 903.00 Q3 969.50 969.00 979.25 988.25 On the basis of the information in Table 2, Figure 1, and Figure 2, answer the following questions pertaining to the SAT scores of South Carolina's school districts. 1. By examining Figure 1, describe the shape of the distribution of the SAT scores for the 86 school districts for Year 2001. (That is, would you describe the shape as symmetric, left-skewed. or right-skewed?) Is your answer in agreement with the relationship between the mean and median [which you could obtain from Table 2] for SAT01? Answer: The distribution is left-skewed. This is consistent with the observation that the sample mean is smaller than the median as a consequence of the effect of extreme values in the left on the mean. 3 2. From Figure 1, how many out of the 86 school districts got SAT scores of at most 800 points? Answer: From the histogram, the number is 1 + 5 = 6. 3. Using information in Table 2, what value will "balance" or serve as the "center of gravity" of the distribution of the SAT01 scores? Answer: The center of gravity coincides with the sample mean, so this is 934.49. The median need not balance the distribution … it divides it into two equal parts. 4. Using Table 2, which value divides the SAT01 scores into a 25:75 split? Answer: The quantity that splits the data set into a 25:75 split is Q1 = 903. 5. From Table 2, the mean and standard deviation for the SAT01 scores are 934.49 and 74.43, respectively. If you are to use the empirical rule, what percentage of the 86 school districts would you expect to have scores between 934.49 - 2(74.43) = 785.63 and 934.49 + 2(74.43) = 1083.35? Answer: Since this is a 2 standard deviation from the mean interval, the empirical rule dictates that there will be approx 95% of all observations in the interval. If one is to use the Chebyshev's rule, then we could claim that there will be at least 75% of all observations in this interval. 6. By referring to Figure 2 (Comparative Boxplots) and Table 2 (Numerical Summary Measures), make a comparison of the SAT scores of South Carolina school districts for the years 1998 to 2001. In particular, could you conclude that the SAT scores have improved from 1998 to 2001 for the 86 South Carolina school districts? Provide a brief discussion. Answer: Looking at the box plots and the values of means and medians, there seems to be a slight increase in the SAT scores over the 4-year period. On whether the increase is significant remains to be seen. 4 Part C [30 points] (Basic Probability): Below is a two-way table of 31510 suicides committed in 1993, categorized by the sex of the victim and the method used. ("Hanging" also includes suffocation.) Table 3: A two-way table of suicides classified according to sex of victim and the method used. Method\Sex of Victim Firearms Poison Hanging Other TOTAL Male Female TOTAL 16381 3569 3824 1641 25415 2559 2110 803 623 6095 18940 5679 4627 2264 31510 Consider the experiment of choosing one suicide victim among the 31510 suicides committed in 1993 as depicted in Table 3. For this experiment, the method of suicide used and the sex of the victim will be observed. Let A be the event that the victim used firearms to commit suicide, and B be the event that the victim is female. 1. What is P(A)? Answer: 18940/31510 = .6011 2. What is P(B)? Answer: 6095/31510 = .1934 3. Find P(A or B). Answer: (18940+6095-2559)/31510 = .7133 4. Find P(A and B). Answer: 2559/31510 = .0812. Note that you cannot multiply P(A) and P(B) since we do NOT know that they are independent. 5. Find P(B|A). Answer: 2559/18940 = .1351 6. Are events A and B independent events? Answer: Since P(B|A) is not equal to P(B), A and B are dependent. 5 Part D [10 points] (Probability Updating): In a genetic setting, either a parent is a carrier or is not a carrier of some trait (for example, the trait of "being smart"). If the parent is a carrier, then the conditional probability that an offspring will have the trait is 0.75; while if the parent is not a carrier, then the conditional probability that an offspring will have the trait is 0.25. Assume that the prior probability that the parent is a carrier of the trait is 0.30. Suppose that this parent has one offspring. [HINT: Would help to draw a tree diagram!] 1. What is the probability that the offspring will have the trait? Answer: P(trait) = P(carrier and trait) + P(not a carrier and trait) = (.3)(.75) + (.7)(.25) = .225 + .175 = .40 2. Given that the offspring possesses the trait, what is the conditional probability that the parent is a carrier of the trait? Answer: P(carrier|trait) = P(carrier and trait)/P(trait) = (.3)(.75)/(.4) = .5625 Part E [10 points]. A random variable X takes values 1, 4, 5 according to the following probability function: x p(x) = P(X = x) 1 .5 4 .3 5 .2 1. Compute the (population) mean, , of X. Answer: Mean = (1)(.5) + (4)(.3) + (5)(.2) = 2.7 2. Compute the (population) variance, 2, of X. Answer: Variance = (1-2.7)2(.5) + (4-2.7)2(.3) + (5-2.7) 2(.2) = 3.01 6 Part F [15 points]. On the basis of past examinations, the probability that a student will pass the First Examination in a Stat 515 is 0.90. Furthermore, the performance of each of the students in the class can be considered to be independent of each other. Suppose that there are 20 students in a Stat 515 class who will take the First Examination. Denote by X the number of students out of these 20 students who will pass the First Examination. 1. Explain why it is reasonable to assume that the distribution of X is binomial with parameters n = 20 and p = .90. Answer: The binomial distribution is appropriate since there are 20 trials, each with two possible outcomes, the trials are independent, the probability of "pass" per trial remains the same at .90, and X denotes the number of "passes" in the 20 trials. 2. What are the mean and standard deviation of X? Answer: Mean = np = (20)(.9) = 18 Variance = n(p)(1-p) = (20)(.9)(1-.9) = 1.8 Standard Deviation = Square Root of 1.8 = 1.34 3. Using the binomial table that is provided, determine P{15 < X < 18}. Answer: P(15 < X < 18) = P(X < 18) - P(X < 14) = .608 - .011 = .597. 7 Some Formulas That May Be Useful X 1 n Xi n i 1 2 n Xi 1 n 1 n 2 i 1 2 2 S Xi ( X i X ) n 1 n 1 i 1 n i 1 M = value that divides arranged data into two equal parts Q1 = Divides arranged data into 25:75 split Q3 = Divides arranged data into 75:25 split P(A or B) = P(A) + P(B) - P(A and B) P(B|A) = P(A and B)/P(A) P(B) = P(A)P(B|A) + P(Ac)P(B|Ac) P(A|B) = P(A)P(B|A)/P(B) P(A and B) = P(A)P(B) if A and B are independent n! = (n)(n-1)(n-2)...(2)(1) with 0! = 1 n xp(x) ; n n! Cr r r! (n r )! 2 ( x ) 2 p( x ) ; 2 n p( x ) p x (1 p ) n x , x = 0, 1, 2, …, n x = np and 2 = np(1-p) 8