Sihyun Kim MAT 120.7688 February 16, 2006 Statistics Project Data Set 8: New York City Marathon Finishers. 1. Describe the data you selected and explain why they are important. The data depict a sample of 150 runners who were selected from the population of 29,733 runners who finished the New York City Marathon in a recent year. The data show the time and rank in which these runners were able to complete the marathon. Furthermore, the data list the sex of each runner in the sample. Because the data consist of a large sample of runners, a careful analysis of the statistics may help us determine whether patterns exist among New York City Marathon runners. For example: Can gender substantially affect the likelihood of somebody finishing faster than others? Through various investigative techniques that we have thus far encountered in our Statistics course, we may be able to answer these questions in our attempt to make assumptions concerning not just the sample of 150 runners, but the general population of 29,373 New York City Marathon runners as a whole. Only through a better understanding of the special individuals who participate in the sporting event can we truly appreciate the incredible feat of running 42.195 km nonstop. 2. Using SPSS, compute descriptive statistics on your data. That is, you will find mean, median, mode, quartiles, variance, standard deviation, maximum, and minimum, range and mid-range. Please see the following page Frequencies of Marathon Runners Statistics Order N Valid Missing Mean Std. Error of Mean Median Mode Std. Deviation Skewness Std. Error of Skewness Kurtosis Std. Error of Kurtosis Range Minimum Maximum Gender Time (sec) 150 150 0 150 0 0 14309.53 38.87 15878.84 705.693 13279.50( a) 130(b) .828 256.725 15322.00(a ) 9631(b) 8642.936 74700337. 674 .016 Variance 37.73(a) 30(b) 10.144 0 .443 3144.226 9886157.97 4 .645 102.895 .198 .198 .198 -1.175 -.377 .551 .394 .394 .394 28915 49 16267 130 19 9631 29045 68 25898 2146429 5830 2381826 25 7093.00(c) 31.00(c) 13854.00(c) 50 13279.50 37.73 15322.00 75 21017.00 46.25 17397.00 Sum Percentiles Age 150 a Calculated from grouped data. b Multiple modes exist. The smallest value is shown c Percentiles are calculated from grouped data. Descriptive Statistics Between Genders Case Processing Summary Cases Valid Time (sec) Gender M F N Missing 111 Percent 100.0% 39 100.0% N Total 0 Percent .0% 0 .0% N 111 Percent 100.0% 39 100.0% Descriptives Time (sec) Gender M Statistic Mean 15415.23 95% Lower Bound Confidence Upper Bound Interval for Mean 5% Trimmed Mean 14844.02 Median Variance Std. Deviation F 15297.23 14942.00 9221888.2 90 3036.756 9631 Maximum 24384 Interquartile Range 288.236 15986.45 Minimum Range Std. Error 14753 3844.00 Skewness .582 .229 Kurtosis .214 .455 Mean 17198.33 497.545 95% Lower Bound Confidence Upper Bound Interval for Mean 5% Trimmed Mean 16191.11 Median Variance Std. Deviation 18205.56 17008.16 16792.00 9654503.1 23 3107.170 Minimum 12047 Maximum 25898 Range Interquartile Range Skewness Kurtosis 13851 3503.00 .984 .378 1.277 .741 3. Explain in your words what each of the above statistics mean, in general. Also explain what it means as related to your project. Compare the different measures of central tendency, measures of dispersion and explain what they tell you about your individual data. Mean- The mean is the measure of center in a set of values that is found by adding the values and dividing the total by the sum of the number of values. The mean is generally considered the most important of all numerical measurements used to describe data. The sample mean of time needed by male New York City Marathon runners was 15415.23 seconds (4 hours 16 minutes 55 seconds), while the sample mean of time needed by female New York City Marathon runners was 17198.33 seconds (4 hours 46 minutes 38 seconds). In other words, the typical male New York City Marathon runner was able to finish the event about 30 minutes before the average female Marathon runner. Median- The median of a data set is the measure of center in a set of values that is found by choosing the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude. The usefulness of the median is especially clear in the case that a set of values contain exceptional values (in such a case, the mean will be dramatically affected). The sample median of time needed by male New York City Marathon runners was 14942.00 seconds (4 hours 9 minutes 2 seconds), while the sample median of time needed by female New York City Marathon runners was 16792.00 seconds (4 hours 39 minutes 52 seconds). In other words, 50% of male marathon runners (those who were chosen for the sample) were faster than the male runner who finished the marathon in 14942.00 seconds was (and the remaining 50% of male marathon runners were slower than he). Likewise, 50% of female marathon runners (those who were chosen for the sample) were faster than the female runner who finished the marathon in 16792.00 seconds was (and the remaining 50% of female marathon runners were slower than she). In addition, we can observe that the runner whose marathon time is the male’s sample median was able to finish the event 30 minutes earlier than the person whose marathon time is the female’s sample median could. Mode- The mode of a data set is the value that occurs most frequently. The application of mode is limited to nominal data. Therefore, the mode cannot be applied to the data that are in consideration for this statistical analysis. The time needed by each marathon runner is unique, and, therefore, no time allotted will appear more than once in the data. Quartiles- Just as the median divides the data into two equal parts, the three quartiles, denoted by Q1 (first quartile), Q2 (second quartile), and Q3 (third quartile) divide the sorted values into four equal parts. The first quartile separates the bottom 25% of the sorted values from the top 75%, the second quartile, which is the same as the median, separates the bottom 50% of the sorted values from the top 50%, and the third quartile separates the bottom 75% of the sorted values from the top 25%. Quartiles are especially helpful in the analysis of the data in question. For the male New York City Marathon runners, the quartiles of time needed to complete the event are as followed: Q1 = 12907 seconds, Q2 = 14942 seconds, and Q3 = 16977 seconds. For the female New York City Marathon runners, the quartiles of time needed to complete the event are as followed: Q1 = 14711 seconds, Q2 = 16792 seconds, and Q3 = 18874 seconds. We shall see later on in the box plot analysis that these statistics can give us a quick, but firm, understanding of the overall distribution of time needed to complete the marathon between males and females. Standard Deviation- The standard deviation of a set of sample values is a measure of variation of values about the mean. Values close together will yield a small standard deviation, whereas values spread farther apart will yield a larger standard deviation. The standard deviation is the measure of variation that is generally considered the most important and useful in statistical analysis. The standard deviation of time needed to complete the New York City Marathon for male runners is 3036.76 seconds. Because the mean is 15415.23 seconds, we can interpret this statistics as followed: time allotted within the interval 15415.23 3036.76 seconds (within 1 standard deviation of the mean) make up 68% of total observations in the sample, and time allotted within the interval 15415.23 6073.52 seconds (within 2 standard deviations of the mean) make up 95% of total observations in the sample. The standard deviation of time needed to complete the New York City Marathon for female runners is 3107.17 seconds. Because the mean is 17198.33 seconds, we can interpret this statistics as followed: time allotted within the interval 17198.33 3107.17 seconds (within 1 standard deviation of the mean) make up 68% of total observations in the sample, and time allotted within the interval 17198.33 214.34 seconds (within 2 standard deviations of the mean) make up 95% of total observations in the sample. Variance- The variance of a set of values is a measure of variation equal to the square of the standard deviation. The usefulness of variance is especially clear when we deal with sampling distributions—variance is an unbiased estimator that is likely to yield good results when we use a sample statistic to estimate a population parameter. Standard deviation, on the other hand, tends not to target population parameters in cases in which the sample sizes are relatively small. The variances of the time needed to complete the New York City Marathon for male and female runners are 9221888.2 and 9654503.1 seconds respectively. Because we are dealing with sampling distribution to make estimations of the general population of the 29,733 runners who finished the New York City Marathon, the variance will serve as a better statistic than can the standard deviation to predict population parameters. Maximum and Minimum- The maximum is the highest observed value in a given set of data. Conversely, the minimum represents the lowest observed value in a given set of data. Both values depict extreme values in a given data set. Together, the maximum and the minimum can show us examples of observations that are very unusual in a set of values. The minimum time (in the sample) needed to complete the New York City Marathon for male runners was 9631 seconds (2 hours 40 minutes 31 seconds), whereas the maximum time (in the sample) needed was 24384 seconds (6 hours 46 minutes 24 seconds). These values deviate significantly from the mean of 15415.23 seconds (4 hours 16 minutes 55 seconds), and, therefore, certainly do not represent the time needed by the typical male runner to finish the marathon. Similarly, the minimum time (in the sample) needed to complete the New York City Marathon for female runners was 12047 seconds (3 hours 20 minutes 47 seconds), whereas the maximum time (in the sample) needed was 25898 seconds (7 hours 11 minutes 38 seconds). Again, these values certainly do not represent the time needed by the typical female runner to finish the marathon. Range- The range of a set of data is the difference between the highest value and the lowest value. Although the range is very easy to compute, it is not as useful as the other measures of variation due to the fact that it depends on only the extremes of a given data set. The range of time needed to complete the New York City Marathon for male runners was 14753 seconds. What this means is that 4 hours 5 minutes 53 seconds passed after the first male to be chosen in the sample had completed the marathon for the last male to be chosen to finish. Likewise, the range of time needed to complete the New York City Marathon for female runners was 13851 seconds. In other words, 3 hours 50 minutes 51 seconds passed after the first female to be chosen in the sample had completed the marathon for the last female to be chosen to finish. Midrange - The midrange is the measure of center that is the midway value between the highest and lowest values in a given data set. It is calculated by dividing the sum of the maximum and the minimum by two. Although the midrange is rarely used (like the range, it is also dependent on the extremes of a given data set), it has some advantages: (1) it is easily computed; (2) it helps to reinforce the important concept that there are many ways to define the center of a data set. The midrange of the times needed by male New York City Marathon runners to complete the event was 17007.5 seconds, whereas the midrange for female runners was 18972.5 seconds. These statistics give us yet another way to interpret the data in question—the male runner who required about 17007.5 seconds to complete the marathon finished the event halfway between two men who required very unusual periods of time to accomplish very same thing. Likewise, the female runner who required about 18972.5 seconds to complete the marathon finished the event halfway between two women who required very unusual periods of time to accomplish very same thing. 4. Construct as many of the graphical representation of your data as possible. Discuss the results as they pertain to your set of date. Histogram: Frequency of Time (Male Marathon Runners) Histogram: Frequency of Time (Female Marathon Runners) Through observations of the histograms that depict the frequency distribution of time needed by male and female runners to complete the New York City Marathon, we can see that the time for both sexes is normally distributed. More specifically, the distribution of data is, for the most part, symmetrical—indeed, the mean and median for both males (mean = 15415.23 seconds, median = 14942.00 seconds) and females (mean = 17198.33 seconds, median = 16792.00 seconds) are quite close to each other. Boxplot Analysis of Male and Female Marathon Runners 30000 150 149 148 147 20000 Time (sec) 10000 0 N= 111 39 M F Gender Boxplot analysis is one of the most useful tools that can help us gain a quick, but very firm, understanding of the differences in distribution of time that is needed by males and females to complete the New York City Marathon. Through the boxplot analysis, we can make the following conclusions: The sample median of time needed by male runners to complete the New York City Marathon is larger than that of female runners. The range of time needed by male runners to complete the New York City Marathon is slightly bigger than that of female runners. In other words, the time needed by male runners varied slightly more than the time needed by their counterparts. However, by excluding in our analysis the four outliers that the male and female samples have, the difference in range between the two genders becomes even more apparent. By disregarding the four extreme observations in the given data set, we can see that men overall tend to vary much more than do women in their abilities to finish the marathon. The interquartile range of the time needed by female runners to complete the marathon is smaller than that of the male runners. What this means is that 50% of the times needed by the female runners to complete the marathon are clumped closer to the mean than are those of the male runners. Scatter Plot Analysis : The Effect of Age on Order Although the main focus of this project is on the effect of gender on time needed to complete the New York City Marathon, I wished to see if age too significantly affects the likelihood of a runner finishing faster than his or her competitors. Through this scatter plot, we can see that the sample does not provide concrete evidence to support this hypothesis: there seems to be no correlation between age and the order in which a runner finished the marathon. 6. Find a confidence interval for the population mean and write a short paragraph about your results. The SPSS-generated descriptive statistics show us that the 95% confidence interval for the mean time needed by male runners to complete the New York City Marathon is 14844.02 < < 15986.45. What this means is that we are 95% confident that the interval from 14844.02 to 15986.45 seconds actually does contain the true value of the average time needed by the male runners of this particular New York City Marathon was. The SPSS-generated descriptive statistics also show us that the 95% confidence interval for the mean time needed by female runners to complete the New York City Marathon is 16191.11 < < 18205.56. Again, we can be 95% confident that this interval actually does contain the true value of the population mean. Through the confidence interval, we can make a reasonable estimation of what the average time needed by the female population of runners to complete this particular New York City Marathon was. 7. Make a claim about the data. What is the null and alternate hypothesis? According to MarathonGuide.com, a website that is devoted to anything relevant to the sport, the average time needed by American male runners to complete a marathon was 4 hours 41 minutes 32 seconds (16832 seconds), whereas the average time needed by female runners was 5 hours 1 minute 6 seconds (18066 seconds). Furthermore, the standard deviation for men was 1:01:03 (3663 seconds), whereas the standard deviation for women was 1:06:59 (4019 seconds).1 Because the website gathered the statistics from the analysis of all recorded marathon finishing times in the United States in 2005, the data can be assumed as that of the population of U.S. marathon runners as a whole (thereby incorporating New York City Marathon runners and much more). Now, the New York City Marathon, along with the Boston Marathon, is one of the most popular professional marathons in the world. For this reason, it seems safe to assume that the New York City Marathon attracts some of the world’s most skilled marathon runners. It seems highly probable that the mean running times of both male and female participants in the event are lower than those of the average male and female marathoners in the United States. To test this claim, we should test the following two hypotheses: (1) The mean time needed by male runners of New York City Marathon to complete the competition is less than 16832 seconds. HO (null hypothesis) : HA (alternative hypothesis) : 1 = 16832 < 16832 MarathonGuide.com, “USA Marathoning: 2005 Overview,” [http://www.marathonguide.com/features/Articles/2005RecapOverview.cfm#FinishingTimes], 2006. (2) The mean time needed by female runners of New York City Marathon to complete the competition is less than 18066 seconds. = 18066 < 18066 HO (null hypothesis) : HA (alternative hypothesis) : 8. Explain the test you will use and say why. We are given a rare situation in which we know the standard deviation of the population being analyzed. We already know that the standard deviations of the time needed for the typical American male and female marathoners to finish a race are 3663 and 4019 seconds respectively. As a result, in order to support the claim that the outstanding performance of the average New York City Marathon runners is not based on random chances—thereby gaining the right to imply that the marathon attracts exceptional individuals—performing a hypothesis test that involves the normal distribution will suffice. Because we already know the values of , we do not have to use the Student t distribution to test the hypothesis. Conclusion The analysis of the sample of 150 runners who were selected from the population of 29,733 runners who finished the New York City Marathon in a recent year seem to suggest that men are generally faster than women are. The mean and median of time needed to complete the race were lower for men than they were for women. Interestingly enough, however, women tend to vary less than men did in the Marathon—the box plot analysis of the distribution of time between men and women seem to suggest that the times needed by women to complete the marathon were cluttered closer to the mean than were those needed by men. Nevertheless, we cannot so hastily come to a conclusion. After all, only 39 out of the 150 runners who were selected were women. Because women represent only 26% of the sample, we must be careful in our generalization of whether gender can significantly affect the time needed by a runner to finish the marathon. Bibliography MarathonGuide.com. “USA Marathoning: 2005 Overview.” [http://www.marathonguide.com/fea tures/Articles/2005RecapOverview.cfm#FinishingTimes]. 2006.