New York City Marathon

advertisement
Sihyun Kim
MAT 120.7688
February 16, 2006
Statistics
Project
Data Set 8: New York City Marathon Finishers.
1. Describe the data you selected and explain why they are important.
The data depict a sample of 150 runners who were selected from the population of 29,733
runners who finished the New York City Marathon in a recent year. The data show the time and
rank in which these runners were able to complete the marathon. Furthermore, the data list the
sex of each runner in the sample. Because the data consist of a large sample of runners, a careful
analysis of the statistics may help us determine whether patterns exist among New York City
Marathon runners. For example: Can gender substantially affect the likelihood of somebody
finishing faster than others? Through various investigative techniques that we have thus far
encountered in our Statistics course, we may be able to answer these questions in our attempt to
make assumptions concerning not just the sample of 150 runners, but the general population of
29,373 New York City Marathon runners as a whole. Only through a better understanding of the
special individuals who participate in the sporting event can we truly appreciate the incredible
feat of running 42.195 km nonstop.
2. Using SPSS, compute descriptive statistics on your data. That is, you will find mean, median,
mode, quartiles, variance, standard deviation, maximum, and minimum, range and mid-range.
Please see the following page
Frequencies of Marathon Runners
Statistics
Order
N
Valid
Missing
Mean
Std. Error of Mean
Median
Mode
Std. Deviation
Skewness
Std. Error of Skewness
Kurtosis
Std. Error of Kurtosis
Range
Minimum
Maximum
Gender
Time (sec)
150
150
0
150
0
0
14309.53
38.87
15878.84
705.693
13279.50(
a)
130(b)
.828
256.725
15322.00(a
)
9631(b)
8642.936
74700337.
674
.016
Variance
37.73(a)
30(b)
10.144
0
.443
3144.226
9886157.97
4
.645
102.895
.198
.198
.198
-1.175
-.377
.551
.394
.394
.394
28915
49
16267
130
19
9631
29045
68
25898
2146429
5830
2381826
25
7093.00(c)
31.00(c)
13854.00(c)
50
13279.50
37.73
15322.00
75
21017.00
46.25
17397.00
Sum
Percentiles
Age
150
a Calculated from grouped data.
b Multiple modes exist. The smallest value is shown
c Percentiles are calculated from grouped data.
Descriptive Statistics Between Genders
Case Processing Summary
Cases
Valid
Time (sec)
Gender
M
F
N
Missing
111
Percent
100.0%
39
100.0%
N
Total
0
Percent
.0%
0
.0%
N
111
Percent
100.0%
39
100.0%
Descriptives
Time (sec)
Gender
M
Statistic
Mean
15415.23
95%
Lower Bound
Confidence
Upper Bound
Interval for
Mean
5% Trimmed Mean
14844.02
Median
Variance
Std. Deviation
F
15297.23
14942.00
9221888.2
90
3036.756
9631
Maximum
24384
Interquartile Range
288.236
15986.45
Minimum
Range
Std. Error
14753
3844.00
Skewness
.582
.229
Kurtosis
.214
.455
Mean
17198.33
497.545
95%
Lower Bound
Confidence
Upper Bound
Interval for
Mean
5% Trimmed Mean
16191.11
Median
Variance
Std. Deviation
18205.56
17008.16
16792.00
9654503.1
23
3107.170
Minimum
12047
Maximum
25898
Range
Interquartile Range
Skewness
Kurtosis
13851
3503.00
.984
.378
1.277
.741
3. Explain in your words what each of the above statistics mean, in general. Also explain what it
means as related to your project. Compare the different measures of central tendency, measures
of dispersion and explain what they tell you about your individual data.
Mean- The mean is the measure of center in a set of values that is found by adding the values
and dividing the total by the sum of the number of values. The mean is generally considered the
most important of all numerical measurements used to describe data.
The sample mean of time needed by male New York City Marathon runners was
15415.23 seconds (4 hours 16 minutes 55 seconds), while the sample mean of time needed by
female New York City Marathon runners was 17198.33 seconds (4 hours 46 minutes 38
seconds). In other words, the typical male New York City Marathon runner was able to finish the
event about 30 minutes before the average female Marathon runner.
Median- The median of a data set is the measure of center in a set of values that is found by
choosing the middle value when the original data values are arranged in order of increasing (or
decreasing) magnitude. The usefulness of the median is especially clear in the case that a set of
values contain exceptional values (in such a case, the mean will be dramatically affected).
The sample median of time needed by male New York City Marathon runners was
14942.00 seconds (4 hours 9 minutes 2 seconds), while the sample median of time needed by
female New York City Marathon runners was 16792.00 seconds (4 hours 39 minutes 52
seconds). In other words, 50% of male marathon runners (those who were chosen for the sample)
were faster than the male runner who finished the marathon in 14942.00 seconds was (and the
remaining 50% of male marathon runners were slower than he). Likewise, 50% of female
marathon runners (those who were chosen for the sample) were faster than the female runner
who finished the marathon in 16792.00 seconds was (and the remaining 50% of female marathon
runners were slower than she). In addition, we can observe that the runner whose marathon time
is the male’s sample median was able to finish the event 30 minutes earlier than the person
whose marathon time is the female’s sample median could.
Mode- The mode of a data set is the value that occurs most frequently. The application of mode
is limited to nominal data.
Therefore, the mode cannot be applied to the data that are in consideration for this
statistical analysis. The time needed by each marathon runner is unique, and, therefore, no time
allotted will appear more than once in the data.
Quartiles- Just as the median divides the data into two equal parts, the three quartiles, denoted
by Q1 (first quartile), Q2 (second quartile), and Q3 (third quartile) divide the sorted values into
four equal parts. The first quartile separates the bottom 25% of the sorted values from the top
75%, the second quartile, which is the same as the median, separates the bottom 50% of the
sorted values from the top 50%, and the third quartile separates the bottom 75% of the sorted
values from the top 25%.
Quartiles are especially helpful in the analysis of the data in question. For the male New
York City Marathon runners, the quartiles of time needed to complete the event are as followed:
Q1 = 12907 seconds, Q2 = 14942 seconds, and Q3 = 16977 seconds. For the female New York
City Marathon runners, the quartiles of time needed to complete the event are as followed: Q1 =
14711 seconds, Q2 = 16792 seconds, and Q3 = 18874 seconds. We shall see later on in the box
plot analysis that these statistics can give us a quick, but firm, understanding of the overall
distribution of time needed to complete the marathon between males and females.
Standard Deviation- The standard deviation of a set of sample values is a measure of variation
of values about the mean. Values close together will yield a small standard deviation, whereas
values spread farther apart will yield a larger standard deviation. The standard deviation is the
measure of variation that is generally considered the most important and useful in statistical
analysis.
The standard deviation of time needed to complete the New York City Marathon for male
runners is 3036.76 seconds. Because the mean is 15415.23 seconds, we can interpret this
statistics as followed: time allotted within the interval 15415.23  3036.76 seconds (within 1
standard deviation of the mean) make up 68% of total observations in the sample, and time
allotted within the interval 15415.23  6073.52 seconds (within 2 standard deviations of the
mean) make up 95% of total observations in the sample.
The standard deviation of time needed to complete the New York City Marathon for
female runners is 3107.17 seconds. Because the mean is 17198.33 seconds, we can interpret this
statistics as followed: time allotted within the interval 17198.33  3107.17 seconds (within 1
standard deviation of the mean) make up 68% of total observations in the sample, and time
allotted within the interval 17198.33  214.34 seconds (within 2 standard deviations of the mean)
make up 95% of total observations in the sample.
Variance- The variance of a set of values is a measure of variation equal to the square of the
standard deviation. The usefulness of variance is especially clear when we deal with sampling
distributions—variance is an unbiased estimator that is likely to yield good results when we use a
sample statistic to estimate a population parameter. Standard deviation, on the other hand, tends
not to target population parameters in cases in which the sample sizes are relatively small.
The variances of the time needed to complete the New York City Marathon for male and
female runners are 9221888.2 and 9654503.1 seconds respectively. Because we are dealing with
sampling distribution to make estimations of the general population of the 29,733 runners who
finished the New York City Marathon, the variance will serve as a better statistic than can the
standard deviation to predict population parameters.
Maximum and Minimum- The maximum is the highest observed value in a given set of data.
Conversely, the minimum represents the lowest observed value in a given set of data. Both
values depict extreme values in a given data set. Together, the maximum and the minimum can
show us examples of observations that are very unusual in a set of values.
The minimum time (in the sample) needed to complete the New York City Marathon for
male runners was 9631 seconds (2 hours 40 minutes 31 seconds), whereas the maximum time (in
the sample) needed was 24384 seconds (6 hours 46 minutes 24 seconds). These values deviate
significantly from the mean of 15415.23 seconds (4 hours 16 minutes 55 seconds), and,
therefore, certainly do not represent the time needed by the typical male runner to finish the
marathon. Similarly, the minimum time (in the sample) needed to complete the New York City
Marathon for female runners was 12047 seconds (3 hours 20 minutes 47 seconds), whereas the
maximum time (in the sample) needed was 25898 seconds (7 hours 11 minutes 38 seconds).
Again, these values certainly do not represent the time needed by the typical female runner to
finish the marathon.
Range- The range of a set of data is the difference between the highest value and the lowest
value. Although the range is very easy to compute, it is not as useful as the other measures of
variation due to the fact that it depends on only the extremes of a given data set.
The range of time needed to complete the New York City Marathon for male runners was
14753 seconds. What this means is that 4 hours 5 minutes 53 seconds passed after the first male
to be chosen in the sample had completed the marathon for the last male to be chosen to finish.
Likewise, the range of time needed to complete the New York City Marathon for female runners
was 13851 seconds. In other words, 3 hours 50 minutes 51 seconds passed after the first female
to be chosen in the sample had completed the marathon for the last female to be chosen to finish.
Midrange - The midrange is the measure of center that is the midway value between the highest
and lowest values in a given data set. It is calculated by dividing the sum of the maximum and
the minimum by two. Although the midrange is rarely used (like the range, it is also dependent
on the extremes of a given data set), it has some advantages: (1) it is easily computed; (2) it helps
to reinforce the important concept that there are many ways to define the center of a data set.
The midrange of the times needed by male New York City Marathon runners to complete
the event was 17007.5 seconds, whereas the midrange for female runners was 18972.5 seconds.
These statistics give us yet another way to interpret the data in question—the male runner who
required about 17007.5 seconds to complete the marathon finished the event halfway between
two men who required very unusual periods of time to accomplish very same thing. Likewise,
the female runner who required about 18972.5 seconds to complete the marathon finished the
event halfway between two women who required very unusual periods of time to accomplish
very same thing.
4. Construct as many of the graphical representation of your data as possible. Discuss the results
as they pertain to your set of date.
Histogram: Frequency of Time (Male Marathon Runners)
Histogram: Frequency of Time (Female Marathon Runners)
Through observations of the histograms that depict the frequency distribution of time needed by
male and female runners to complete the New York City Marathon, we can see that the time for
both sexes is normally distributed. More specifically, the distribution of data is, for the most part,
symmetrical—indeed, the mean and median for both males (mean = 15415.23 seconds, median =
14942.00 seconds) and females (mean = 17198.33 seconds, median = 16792.00 seconds) are
quite close to each other.
Boxplot Analysis of Male and Female Marathon Runners
30000
150
149
148
147
20000
Time (sec)
10000
0
N=
111
39
M
F
Gender
Boxplot analysis is one of the most useful tools that can help us gain a quick, but very
firm, understanding of the differences in distribution of time that is needed by males and females
to complete the New York City Marathon. Through the boxplot analysis, we can make the
following conclusions:



The sample median of time needed by male runners to complete the New York City
Marathon is larger than that of female runners.
The range of time needed by male runners to complete the New York City Marathon is
slightly bigger than that of female runners. In other words, the time needed by male runners
varied slightly more than the time needed by their counterparts. However, by excluding in
our analysis the four outliers that the male and female samples have, the difference in range
between the two genders becomes even more apparent. By disregarding the four extreme
observations in the given data set, we can see that men overall tend to vary much more than
do women in their abilities to finish the marathon.
The interquartile range of the time needed by female runners to complete the marathon is
smaller than that of the male runners. What this means is that 50% of the times needed by the
female runners to complete the marathon are clumped closer to the mean than are those of the
male runners.
Scatter Plot Analysis : The Effect of Age on Order
Although the main focus of this project is on the effect of gender on time needed to
complete the New York City Marathon, I wished to see if age too significantly affects the
likelihood of a runner finishing faster than his or her competitors. Through this scatter plot, we
can see that the sample does not provide concrete evidence to support this hypothesis: there
seems to be no correlation between age and the order in which a runner finished the marathon.
6. Find a confidence interval for the population mean and write a short paragraph about your
results.
The SPSS-generated descriptive statistics show us that the 95% confidence interval for
the mean time needed by male runners to complete the New York City Marathon is
14844.02 <  < 15986.45. What this means is that we are 95% confident that the interval from
14844.02 to 15986.45 seconds actually does contain the true value of the average time needed by
the male runners of this particular New York City Marathon was.
The SPSS-generated descriptive statistics also show us that the 95% confidence interval
for the mean time needed by female runners to complete the New York City Marathon is
16191.11 <  < 18205.56. Again, we can be 95% confident that this interval actually does
contain the true value of the population mean. Through the confidence interval, we can make a
reasonable estimation of what the average time needed by the female population of runners to
complete this particular New York City Marathon was.
7. Make a claim about the data. What is the null and alternate hypothesis?
According to MarathonGuide.com, a website that is devoted to anything relevant to the
sport, the average time needed by American male runners to complete a marathon was 4 hours
41 minutes 32 seconds (16832 seconds), whereas the average time needed by female runners was
5 hours 1 minute 6 seconds (18066 seconds). Furthermore, the standard deviation for men was
1:01:03 (3663 seconds), whereas the standard deviation for women was 1:06:59 (4019 seconds).1
Because the website gathered the statistics from the analysis of all recorded marathon finishing
times in the United States in 2005, the data can be assumed as that of the population of U.S.
marathon runners as a whole (thereby incorporating New York City Marathon runners and much
more).
Now, the New York City Marathon, along with the Boston Marathon, is one of the most
popular professional marathons in the world. For this reason, it seems safe to assume that the
New York City Marathon attracts some of the world’s most skilled marathon runners. It seems
highly probable that the mean running times of both male and female participants in the event are
lower than those of the average male and female marathoners in the United States. To test this
claim, we should test the following two hypotheses:
(1) The mean time needed by male runners of New York City Marathon to complete the
competition is less than 16832 seconds.
HO (null hypothesis)
:
HA (alternative hypothesis) :
1
 = 16832
 < 16832
MarathonGuide.com, “USA Marathoning: 2005 Overview,”
[http://www.marathonguide.com/features/Articles/2005RecapOverview.cfm#FinishingTimes], 2006.
(2) The mean time needed by female runners of New York City Marathon to complete the
competition is less than 18066 seconds.
 = 18066
 < 18066
HO (null hypothesis)
:
HA (alternative hypothesis) :
8. Explain the test you will use and say why.
We are given a rare situation in which we know the standard deviation of the population
being analyzed. We already know that the standard deviations of the time needed for the typical
American male and female marathoners to finish a race are 3663 and 4019 seconds respectively.
As a result, in order to support the claim that the outstanding performance of the average New
York City Marathon runners is not based on random chances—thereby gaining the right to imply
that the marathon attracts exceptional individuals—performing a hypothesis test that involves the
normal distribution will suffice. Because we already know the values of , we do not have to use
the Student t distribution to test the hypothesis.
Conclusion
The analysis of the sample of 150 runners who were selected from the population of
29,733 runners who finished the New York City Marathon in a recent year seem to suggest that
men are generally faster than women are. The mean and median of time needed to complete the
race were lower for men than they were for women. Interestingly enough, however, women tend
to vary less than men did in the Marathon—the box plot analysis of the distribution of time
between men and women seem to suggest that the times needed by women to complete the
marathon were cluttered closer to the mean than were those needed by men. Nevertheless, we
cannot so hastily come to a conclusion. After all, only 39 out of the 150 runners who were
selected were women. Because women represent only 26% of the sample, we must be careful in
our generalization of whether gender can significantly affect the time needed by a runner to
finish the marathon.
Bibliography
MarathonGuide.com. “USA Marathoning: 2005 Overview.” [http://www.marathonguide.com/fea
tures/Articles/2005RecapOverview.cfm#FinishingTimes]. 2006.
Download