ACTIVITY SET 1 Jan

advertisement
Displaying Data - Solutions
1.1 Suppose a medical researcher compares the average blood pressures of women who take oral
contraceptives to the blood pressures of women who do not.
a. Is blood pressure a categorical variable or a quantitative variable?
Quantitative; we would measure blood pressure. If we categorized the values (e.g. low, medium, high)
then this would be a categorical variable.
b. Is oral contraceptive use (or not) a categorical variable or a quantitative variable?
Categorical; data produced from yes/no questions are categorical.
c. What variables that affect blood pressure might confuse the comparison of average blood pressures for
users and nonusers? That is, what factors affecting blood pressure might differ for users and nonusers.
Explain.
Answers will vary. One possibility is age. Age affects blood pressure and also there may age differences
between users and non-users of oral contraceptives.
1.2 A statistics class at UC Davis was asked “About how many hours do you watch television per week?
A five-number summary of the responses from 173 students follows.
Median
6
Quartiles
2
12.5
Extremes
0
100
a. What were the median hours of weekly television watching? In the context of this situation, write a
sentence that interprets the median.
Median = 6. About 50% of the class watched 6 or less hours per week.
b. Give the value that completes the following sentence. About 1/4 of the students watch less than ___
hours of television per week.
2 (lower quartile, Q1) since Q1 represents the first quartile which is the value where 25% of the data falls
at or below.
c. Give the value that completes the following sentence. About 1/4 of the students watch more than ___
hours of television per week.
12.5 (upper quartile, Q3) since Q3 represents the third quartile which is the value where 75% of the data
falls at or below, and conversely 25% watch at or above.
d. What is an interval that describes the middle 1/2 of the student’s television watching amounts?
2 to 12.5 (between the quartiles Q1 and Q3)
e. The mean for these data is 8.9 hours per week.
How do you think the mean is calculated? Sum of all values divided by total number of values
Why do you think it is larger than the median in this instance? There might be an outlier(s)
1
1.3 In ANGEL on the Lessons page, access the Datasets folder. Within this folder, click on the link for
the data set named U.S. Smoking (Minitab file). This should cause a program named Minitab to open,
with the data in place. The data are estimates of the percentage of adults who smoke in each state of the
U.S. (and also District of Columbia).
a. In Minitab, use Graph>Stem-and-Leaf to create a stemplot of the percents that smoke in the 50 states
and Washington D.C. In the dialog box, double click on the name of the second column to enter it as the
variable you want to plot.
1
1
2
5
15
(18)
18
13
3
1
1
1
1
1
1
2
2
2
2
2
3
3
2
6
999
0001111111
222222222233333333
44444
6666666777
89
2
About where do most states fall, in terms of percent smoking? In the 22 to 23% smoking range
About what is lowest percent in the dataset? From the stemplot it is 12% corresponding to the 12.7% for
UTAH
About what is the highest percent in the dataset? 32% corresponding to the 32.6% for KENTUCKY
What do you notice about the values in the worksheet and the values displayed in the stemplot? The
values in the stemplot have been truncated. That is, the decimal place has been dropped from the data in
the worksheet when creating the stemplot.
How would you describe the shape of this data? Roughly symmetric
b. Use Stat>Basic Statistics>Display Descriptive Statistics to determine summary statistics for the
percents. As in part d, double click the name of the second column to list it as the variable we’re
analyzing. Inspect the output, to find these values:
Mean Percent = 23.353
Standard Deviation = 3.327
Median percent = 23.100
lower quartile (denoted by Q1) = 21.500
upper quartile ( denoted by Q3) = 26.000
c. Write a sentence that interprets the median in the context of this situation.
One interpretation: The smoking percent is 23.1% or less in about one-half of the states (or conversely,
the smoking percent is 23.1% or higher in about one-half of the states)
d. What value completes the following sentence? In about 1/4 of the states, the percent that smokes is less
than 21.5 (which is Q1).
2
e. What interval includes the middle 1/2 of the values of the state smoking percentages?
21.5 to 26% (between the Q1 and Q3)
f. Use Calc>Calculator to manipulate the data in column 2. In the Store Result in Window type in
‘Plus10’ and in the Expression Window double click on the name of the second column and use the
calculator pad to add 10 (click the ‘+’ and 1 then 0), and click OK. Repeat this step but in the Store
Result Window enter ‘Times10’ and in the Expression box change the ‘+’ to ‘*’. Again find the
Descriptive Statistics to get the mean and standard deviation for the original data (column 2) as well as
the new data in Columns 3 and 4. For ease, enter all three variables into the Variables box at once.
What do you notice about the changes in the mean and standard deviation from the original to the new
data?
Variable
Percentage of Ad
Plus 10
Times 10
Mean
23.353
33.353
233.53
StDev
3.327
3.327
33.27
Note that when you add a constant (in this case add 10) the mean increases by the constant but the
standard deviation remains the same. This is because by adding a constant you have only in affect shifted
the location of the center (mean) in the direction of the constant. But when you multiply all values by a
constant, then both the mean and standard deviation are affected by that constant.
1.4 Car and truck speeds at a particular location have approximately a bell-shaped distribution with mean = 65
mph and standard deviation = 5 mph. mean here is 65 and s = 5, then 68% of the observations will fall
between 65 +/- 5. Continuing, for 95% this will be the mean +/- 2s and for 99.7% this will be the mean
+/- 3s.
a. About 68% of cars and trucks travel between ___60____ and ___70____ at this location. Since mean here
is 65 and s = 5, then 68% of the observations will fall between 65 +/- 5.
b. About 95% of cars and trucks travel between ___55___ and ____75____at this location. [Mean +/- 2s]
c. About 99.7% of cars and trucks travel between___50___ and ____80____at this location. [Mean +/- 3s]
d. A z-score is a measure of how many standard deviations a value is from the mean. Later in the course, we
will see that it is an important measure of the size of a value.
The formula is Z =
Observed Value - Mean
.
Standard deviation
Determine a z-score for a vehicle speed of 72 mph. Plugging the numbers into the formula:
Z=
72 - 65
= 1.4
5
3
e. What vehicle speed has a z-score = −1? Said another way, what vehicle speed is one standard deviation
below the mean? (You will need to do some algebra to solve for Observed Value)
You need to solve for the observed value which will produce the following formula:
OV = Mean + SZ, and using this equation to answer the question we plug in:
OV = 65 + (5)*(-1) = 60. So 60 mph would be -1 z-score from the mean.
1.5 Open the Class Survey (Minitab File) data file from the Datasets folder in ANGEL on the Lessons
page. This data are from a survey given to students in my Stat200 courses last semester. You are a researcher
and want to use this class survey data to research how PSU undergraduate students compare to these national
averages. .
a. The purpose of most statistical studies is to use the sample data to generalize to a larger group. What do you
think are the weaknesses of using this class survey data for generalizing to all PSU undergraduate students? In
order to generalize to a larger group, i.e. all PSU undergraduate students, your sample group needs
to be representative of this larger group. That is, the make up of your sample group should reflect
that of the larger group. For instance, does our class reflect the undergraduates as a whole in
regards to percentage of females and percentage of race? The best way to accomplish this would be
through random sampling.
b. (Importance of checking data). Compute the Descriptive Statistics (for SATM (C16) and SATV (C17).
Note the minimum and maximum value for each. (REMEMBER: Stat > Basic Statistics > Display Descriptive
Statistics. Enter together into the Variables window SATM and SATV.)
i.
From the output, what does the * represent? The * represents missing data and the N*
represents the number of missing data points for that variable. That is, 10 students
did not provide and answer to SATM and 11 students did not answer SATV.
ii.
How many students answered the question regarding their SATM and SATV scores?
SATM 216
SATV 215
c. Now find the Descriptive Statistics for SATM and SATV by Gender (Repeat what you did for part b but
now enter Gender in the By Variable window) and use the output to answer the following:
Female SATM: Q1 537.50
Female SATV: Q1 530.00
Male SATM: Q1 570.00
Male SATV: Q1 525.00
Q3 650.00
Q3 620.00
Q3 670.00
Q3 645.00
IQR 112.50
IQR 90.00
IQR 100.00
IQR 120.00
4
d. Using the 5-number summary, a data point is considered an outlier on a boxplot if it is either larger than
Q3+ (1.5IQR), or smaller than Q1  (1.5IQR). Calculate and identify any outliers for the Female group.
SATM: Calculate the value of Q3+ (1.5IQR) = 650 + 168.75 = 818.75
SATM: Calculate the value of Q1  (1.5IQR) = 537.50 – 168.75 = 368.75
SATV: Calculate the value of Q3+ (1.5IQR) = 620 + 135 = 755
SATV: Calculate the value of Q1  (1.5IQR) = 530 – 135 = 395
f. Based on the Descriptive Statistics you calculated by Gender and to answer the following:

How do the SAT scores from our survey compare across gender? Do you believe that any
differences are significant? That is, do you think these differences are large enough that statistically
they are the different? For both the Math and Verbal sections of the SAT, the males
reported higher mean scores. However, since both of the female means are within the
IQR of the males, I doubt if the difference is significant. Again, we will study later this
semester how do determine if these differences are statistically significant or just represent
small, but not noteworthy, differences.
1.6 Staying with the Class Survey (Minitab File). In column C20 Book Cost are the responses to how
much students expected to pay for books that semester.
a. Use Graph>Histogram click on Simple, and then enter Book Cost in the Variables box to draw a
histogram. Use the mouse to identify in the graph the characteristics of the various bars in the histogram.
Do this to complete the following sentences.
The most frequently reported amount spent was between 275 and 325. Of the 226 students, 60 students
said they spent that much.
The second most frequently reported amount spent was between 375 and 425.
b. Using Minitab, draw a boxplot of the Book Cost (Use Graph>Boxplot, select “Simple” and then enter
Book Cost in the variable window). The boxplot provides a graph of the 5-number summary for a set of
data.. By placing the mouse pointer over the “box” a pop-up will appear displaying part of the 5-number
summary.

What does the * represent in a boxplot? Identify any outlier(s) based on 1.5*IQR rule

How many * are there for the variable Book Cost? 12

What are the outlier values? [place your mouse over the * to see the value]
100, 110, 140, 600, 800
5

What is the 5-number summary for Book Cost?
Minimum = 100
Q1 = 300
Median = 350
Q3 = 400
Maximum = 800

The shape of the data represented by the box plot can be determined by the location of
the median bar in the box and by comparing the length of the “whiskers” – the two lines
that extend from either end of the box. If the median is in the center and the whiskers are
of roughly equal length then the data is symmetrical. If the median is near the bottom of
the box and the top whisker is longer, then the distribution is said to be skewed to the
right or positively skewed. If the median is near the top of the box and the bottom
whisker is longer, then skewed to the left or negatively skewed. What is the shape of
Book Cost based on the boxplot? Does this concur with how you would interpret the
histogram?
The shape is symmetrical and yes the shapes are similar. Keep in mind that the larger
the number of observations in your data set the less influenced the shape becomes by a
few outliers.
6
Download