Homework #3 Solutions

advertisement
PHCO 0504 – Introduction to Biostatistics
Homework #3 - Solutions
This homework will not be collected. However, please do it conscientiously as the
material in this homework will be included on the midterm. Homework solutions will be
posted on Monday. If you do not get to the homework before Monday, I would strongly
encourage you to do the homework before looking at the solutions. You will not learn the
material as well if you look at the solutions without trying to complete the homework
problems first!
1. The following are the ages of 60 patients seen in the emergency room of a hospital on
a Friday night. (Same data set as question 2 from Homework #2.)
36 45 36 22 32 12 23 45 38 21 54 64 55 35 43 45 10 44 56 39 37
34 55 45 60 53 22 46 51 48 15 50 13 52 41 27 31 32 42 60 39 38
21 28 26 51 63 46 43 41 52 22 59 34 38 28 45 51 11 77
a. Do you expect the empirical rule to apply to this set of data? Why or why
not? (You may need to recall information generated for your last set of
homework.)
Since the mean and the median are not equal, the distribution of this set of
data is not symmetric, the empirical rule will not hold exactly. We can apply
the empirical rule only for data with approximate normal distribution
(symmetric).
(The point of this question is to know how to distinguish whether a
distribution looks “normal.”)
However, since there is a slight bell shape to the curve (albeit a little skewed),
we might expect the empirical rule to provide a very rough approximation. In
general, the farther the distribution is away from “normal,” the worse the
approximation by the empirical rule will be.
b. Using the value of the standard deviation, calculate the interval given by the
sample mean, plus or minus one standard deviation. According to the
empirical rule, approximately what percent of the data should fall within this
interval? What percent actually do fall within this interval?
The interval given by the sample mean, plus or minus one standard deviation
is (39.70 – 14.63, 39.70 + 14.63) = (25.07, 54.33).
According to the empirical rule, approximately 68% of the data should fall
within this interval.
To answer the last question, we need to count how many observations falling
in between the interval we have above (by ordering the data from smallest to
largest), and then divide it by the total number of the observations, which is 60
Page 1/5
in this problem. We have 40 observations falling within this interval, so the
actual percent that the sample falls within this interval would be 40/60 =
66.67%.
In general, PLEASE do not round off these types of numbers to whole
integers! As we discussed in class, it doesn’t necessarily make sense to do so!
For example, a mean of integer numbers (1, 2, 3,…) is typically NOT an
integer! If you obtain a “1” with probability of ½ and a “2” with a
probability of ½, the mean is NOT a 1 or a 2; the mean is 1½=(1+2)/2.
Also, note in this example that the integer ages presented in the data set are
actually rounded off numbers. Only for a single moment out of the year out of
one’s lifetime is someone actually exactly 16 years old.
c. Repeat part b, using plus or minus two standard deviations.
The interval given by the sample mean, plus or minus two standard deviations
is (39.70 – 14.63 2, 39.70 + 14.63 2) = (10.44, 68.96).
According to the empirical rule, approximately 95% of the data should fall
within this interval.
There are 58 observations falling within this interval, so the actual percent that
the sample falls within this interval would be 58/60 = 96.67%.
Please see the caution about rounding off to integer numbers above.
d. Calculate and interpret a 95% confidence interval for the mean age of patients
seen in the emergency room. (Note: The validity of this interval is based on
the Central Limit Theorem, requiring a “large” sample size. With 60 patients
it is reasonable to say that we have a large sample. However, it is possible
that this is not really an independent, random sample. This is particularly true
if we collected this information from all subjects at an ER on a particular
Friday night and we are interested in all Friday nights over the course of a
year. In this case, the assumption that we have an independent, random
sample is violated.)
A 95% confidence interval for the mean age of patients seen in the emergency
room is (39.70 – 1.96*14.63/sqrt(60), 39.70 + 1.96*14.63/sqrt(60)) =
(35.998, 43.402). Since the assumption of an independent, random sample is
violated, this sample cannot represent the general population. We can only
make inference about this particular sample. I am 95% confident that the
mean age of patients seen in the emergency room of a particular hospital on a
particular Friday night falls between 35.998 and 43.402.
2. The National Health and Nutrition Examination Survey of 1976-80, based on a
random sample of men from the general U.S. population, found that the mean serum
Page 2/5
cholesterol level for U.S. males aged 20-74 years was 211. The standard deviation
was approximately 90.
Consider a new experiment in which a sample of size 50 is drawn from the
population of all US males between 20 and 74 years old.
a. Which numbers (whether available or unavailable) would be used to graph the
population distribution for this example?
The Serum cholesterol levels for all U.S. males aged 20-74 years.
On the x-axis would be the serum cholesterol levels; on the y-axis would be
the relative frequencies.
b. Which numbers would be used to graph the sample distribution for this
example?
Serum cholesterol level for the 50 U.S. males aged 20-74 years from the
current sample. If we were to use a histogram, the x-axis would be divided
into intervals representing the serum cholesterol levels; the y-axis would
represent the counts or relative frequencies in each interval.
c. With the observations available, can we graph the sampling distribution?
Why or why not?
We cannot graph the sampling distribution because we would need several
sets of data from repeated experiments. Recall that the sampling distribution
is the distribution of all possible values of the test statistic, in this case all
possible values of the sample mean.
d. What is the mean of the sampling distribution? What is the standard error?
Interpret the value for the standard error. What is the approximate shape of
the sampling distribution?
The mean for the sampling distribution has the same value as the population
mean, 211, because the sampling distribution is the distribution of the sample
mean, which is an unbiased estimator of the population mean.
The standard error is the standard deviation divided by square root of sample
size, that is, 90/sqrt(50) = 12.7279.
The value for the standard error tells how far we are from estimate population
mean.
The approximate shape of the sampling distribution would be bell-shaped,
Gaussian (normal) distribution, due to the Central Limit Theorem.
Page 3/5
3. The study cited in the previous problem also looked at a subpopulation, men aged 2024 years. The average reported serum cholesterol level was 180. The standard
deviation was approximately 43.
a. Explain why you might expect the standard deviation for this target
population to be so much smaller than that of the target population in the
previous problem.
The subpopulation of men aged 20-24 years includes a group of men that
are much more similar to each other than the group of men between 20
and 74 years old. Because they are more similar, there will be smaller
differences between them and the spread of serum cholesterol levels will
be much smaller.
b. Suppose you randomly selected and sampled a particular man who was 23
years old and measured his serum cholesterol level to be 287.5. Assuming
that the serum cholesterol levels are distributed according to a Gaussian
(normal) distribution, using normal probabilities, describe where this man
falls in relation to the rest of the population. (How many standard
deviations is he above the mean? Does his value fall within the same
interval that 68% of the population falls? 95% of the population? Is this
man unusual with respect to his cholesterol level.)
The z-score is (287.5 – 180)/43 = 2.5.
So, his serum cholesterol level is 2.5 standard deviations above the mean.
From the normal table, it gives us 98.76%, which means that 98.76% is
the probability of falling 2.5 standard deviations of the mean. So, his value
does not fall within the same interval that 68% nor 95% of the population
falls. This man is unusual with respect to his cholesterol level.
c. Suppose that you randomly select a simple random sample of size 60 from
the subpopulation of men aged 20 to 24 years. Find the probability that
the sample mean serum cholesterol level will be between 170 and 190.
We can write the probability as P(170 < Y < 190). After standardizing it,
we have
170-mean
Y - mean
190 - mean
P( ------------- < --------------- < --------------- )
SEM
SEM
SEM
= P(-1.8014 < Z < 1.8014), (SEM = 5.551276, mean = 180)
where Z follows a standard normal distribution.
Then, we can find the probability from the normal table, which is 92.81%.
Page 4/5
4. Suppose that in an independent experiment, you sample 60 men aged 20-24 who
regularly work out at a local gym. You find that the sample mean and standard
deviation for this sample are 165 and 40, respectively.
a. Calculate the 95% confidence interval for the population mean based on this
mean and standard deviation (165 and 40).
Use z-interval. We have SEM = 40 / sqrt(60) = 5.1640, and z-score = 1.96 for
95% confidence interval.. So the 95% confidence interval for the population
mean is (154.8786, 175.1214).
b. Which population does this sample represent?
This sample represents the population of men (who work out regularly) aged
20-24.
c. Give a formal interpretation of the confidence interval (I am 95% confident
that …).
I am 95% confident that the mean serum cholesterol level for men aged 20-24
who regularly work out at a local gym falls between 154.8786 and 175.1214.
d. Make a conclusion using the information given above about whether the mean
of the current population is the same as the national average for the men aged
20 to 24 as determined by the National Health and Nutrition Examination
Survey.
Since the average reported serum cholesterol level was 180 in the study,
which is not included in the 95% confidence interval here, the mean of the
current population is not the same as (is lower than) the national average for
the same aged men.
Page 5/5
Download