Solution: Summarizing and Describing Data – Use Height_Exam data set
1.
Using Minitab to make a histogram (Graph > Histogram) of the variable Height. Copy and paste the histogram below.
Histogram of Height
10
8
6
14
12
4
2
0
60 64
Height
68 72 76
2.
How would you describe the shape of the histogram?
The histogram shows a clear bell shaped distribution for the variable Height
3. Based on the shape of the histogram, would you expect the mean and median to be somewhat similar or quite different?
Since the data is bell shaped, we would expect the mean and median to be relatively similar.
4. Use Minitab to find the mean and median (Stat > Basic Statistics > Display Descriptive
Statistics and do NOT change any of the default statistics selected) of the variable Height. What are the mean and median values and do they agree with what you expected in question 3?
Descriptive Statistics: Height
Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3
Height 64 0 66.938
0.465 3.719 57.983 64.195 66.898
70.010
1
Variable Maximum
Height 76.176
As expected, the mean (66.938) and median (66.898) are very similar.
5. According to the Empirical Rule (also called the 68-95-99.7 Rule), for bell shaped data we would expect about 68% of all observations to fall within one standard deviation of the mean; about 95% to fall within two standard deviations of the mean; and about all (99.7%) to fall within three standard deviations of the mean. There are a total of 64 height values in the data set.
Using the Standard Deviation (refer back to your Minitab output you produced for question 4) apply the Empirical Rule to the Height data by completing the table below [ For simplicity, round the mean to nearest whole number and the SD to one decimal place. Also, you may want to copy/paste the Height data into Excel to sort. This will make counting the values within the range easier.] :
68%
95%
Replace mean and SD with values
67 ± 3.7
67 ± 2*3.7
99.7%
67 ± 3*3.7
Compute the range (lower and upper values) from prior column
63.3 to 70.7
59.6 to 74.4
55.9 to 78.1
The number of
Height values we would expect in this range
to find
The actual number of
Height values that fell in this range
68% of 64 = 43.5
95% of 64 = 60.8
42
60
99.7% of 64 = 63.8 64
6. From question 4, what is the five number summary for the variable Height?
Minimum: 57.983 Q1: 64.195 Median: 66.898 Q3: 70.01 Maximum: 76.176
7. As this summary relates to the variable Height, what is the interpretation of first and third quartile?
For the first quartile, Q1, we would expect about 25% of the heights to fall at or below
64.195 inches, and for the third quartile, Q3, we would expect about 75% of the heights to fall at or below 70.01 inches.
8. Do well does the Height data comply with these quartiles? That is, of total number of height values how many would we expect to fall at or below these quartiles compared to how many actually fell at or below these quartiles?
2
Quartile
First (Q1)
Third (Q3)
How many we expected
16
48
How many we observed
16
48
9. Repeat this the histogram for the variable Exam. Copy and paste the histogram below and describe the shape and relationship between mean and median.
Histogram of Exam
8
6
4
2
14
12
10
0
60 70 80
Exam
90 100
Notice how the "bulk" of the data gathers to the right and is pulled to the left. The graph is skewed left (or negatively skewed). The mean would be less than the median in such distribution.
10. Now let’s apply the Empirical Rule to a data set that is not symmetrical. For the Exam data the mean is 88 with 11 as the standard deviation and there are 32 observations. Using this information, complete the following table. [ For simplicity, round the mean and SD to one decimal place. Also, you may want to copy/paste the Exam data into Excel to sort. This will make counting the values within the range easier.] :
68%
95%
Replace mean and
SD with values
88 ± 11
88 ± 2*11
99.7%
88 ± 3*11
Compute the range
(lower and upper values) from prior column
77 to 99
66 to 110
55 to 121
The number of
Height values we would expect to
The actual number of Height values that fell in find in this range this range
68% of 32 = 21.76 23
95% of 32 = 30.4 31
99.7% of 32 = 31.9 32
3
Notice that the part of the Empirical Rule that is most affected (though not terribly) is that of the 68%. For 95% and 99.7% the rule is quite accurate. This is commonly what occurs for skewed distributions.
11. Outliers and side-by-side boxplots. Outliers are "extreme observations" for a set of data, but how does one determine what is extreme? Boxplots help us in identifying such observations, and side-by-side boxplots are very useful when we want to display quantitative data across levels of a categorical variable, e.g. heights by gender. Create the side-by-side boxplots for Height by
Gender. Click Graph > Boxplot > With Groups. Select Heights for the "Graph Variables" and
Gender for the "Categorical Variables" and click OK. Copy and paste your graph below and answer the following questions related to the graph.
Boxplot of Height
76
72
68
64
60
Female Male
Gender
A. The '*' symbol in the Female boxplot represents an outlier. By placing your mouse over the this symbol in Minitab you can determine what the outlier value is and the data row in which it is located. What is the value and data row?
Row = 63, Height = 57.98
B. How was this observation determined to be an outlier? To do this, read the online notes for "Finding Outliers Using IQR" and apply this technique to demonstrate why this observation would be considered an outlier [NOTE: after reading the online notes you can then place your mouse over the "box" part of the boxplot in Minitab to get the needed data].
Q1= 62.8557 IQR = 3.189
Q1 - 1.5*IQR = 62.8557 - 1.5*3.189 = 62.8557 - 4.7835 = 58.07
Since the Height of 57.98 in row 63 is less than 58.07 it is considered an outlier.
4
C. In comparing the two boxplots, how would you describe the center and spread of the heights by gender? More specifically, which gender has the greater mean and median height, and what about spread, specifically the variance and standard deviation?
The mean and median are clearly greater for the Males - mean and median values, respectively are: Males - 69.59 and 69.86; Females - 64.29 and 64.44
The spread, on the other hand, is very similar. The variances and standard deviations are
(recall the latter is the square root of the variance!):
Males: 7.191 and 2.682 Females: 6.447 and 2.539
This concept of "equal variances" will be discussed in later topics.
5