chapter 7: sample problems for homework, class

advertisement
CHAPTER 7: SAMPLE PROBLEMS FOR HOMEWORK, CLASS OR EXAMS
These problems are designed to be done without access to a computer, but they may require a
calculator.
1. You are reading a research paper that describes a study of Y = mean class score on arithmetic
exam versus X = number of arithmetic homework problems per week assigned in the class. Since
the data set is small, with only n = 10 classes, you perform the regression on a scientific
calculator. The screen output states;
Y = ax + b
a = 1.4
b = 2.1
r = 0.71
a. Interpret the coefficients, and describe the relationship in simple language.
b. Is there significant evidence of a linear relationship, using  = 5%?
c. Give a 95% confidence interval for the expected change in Y, if one additional homework
problem is assigned per week. (Hint: solve for the t statistic for the slope.)
2. You are reading a research paper that describes a regression of Y = fruit sugar content on the
independent variable X = low temperature at harvest time in degrees Celsius. The data consisted
of a sample of 16 observations. The authors provide the following summary information:
Intercept
X
Estimated Regression
parameters
14.26
-0.35
MSE=0.11
a. In simple language, describe the relationship between fruit sugar content and low temperature.
b. Give a point estimate of the expected fruit sugar content when the low temperature is 10C.
c. Assume that the values of X in the data had a sample mean of 12C and a sample standard
deviation of 4C. Give a 95% confidence interval for the expected fruit sugar content when the
low temperature is 10C.
3. In a sample of 25 observations, the sample correlation between systolic blood pressure and
body mass index is 0.29. Is there significant evidence of a linear relationship between these two
variables, using  = 5% ?
4. You are researching the relationship between survival time for liver transplant patients with
particular emphasis on the possible relation with age. In an article in a medical journal, you find
the following statement.
The correlation between age and log(survival time) was -0.65 (p value = 0.094).
a. Describe the apparent relationship between age and log(survival time) in simple language.
b. What is the most likely reason for stating the correlation in terms of the logarithms rather than
the survival times?
c. The presence of the p-value implies a hypothesis was tested. What hypothesis was most likely
tested, and what conclusion would you reach if you use  = 5%?
d. How could a correlation coefficient with such a large absolute value have such a high p
value?
5. A research article states that
The fitted regression of Y = ln(asphalt strength) on X = ln(sand content) was
Yˆ  3.42  0.49X, MSE  0.16
Assume this regression came from a sample of 15 observations where the sample mean of X was
ln(20). What can you say about the typical value of asphalt strengths when sand content is 20?
Use a 95% confidence interval, and be specific with regard to the parameter for which you are
giving an interval.
6. A utility company is attempting to predict summer daily demand for electricity using the twoday ahead weather forecast for the day’s high. (Using the two-day ahead forecast gives the utility
company time to prepare.) A regression is done using records for 40 randomly selected summer
days during the past two years. Demand is in megawatts. Part of the printout is shown below.
ANOVA
Source
Model
Error
Total
SS
191.00
_______
500.00
Parameter estimates
Intercept
-12.320
High_temp
0.350
df
___
___
MS
_____
_____
F
____
Std. error.
2.4631
0.0149
a. Fill in the blanks in the printout.
b. Give a 95% confidence interval for the difference in expected electricity consumption for two
days, one of which is 5 degrees hotter than the other day.
c. Consider two days with exactly the same predicted high temperature. Write the symbolic
expression (in terms of 0, 1 and  ) for the difference in their actual electricity consumption.
Knowing what we have discussed regarding independent random variables, use the information
above to give a rough 95% confidence interval for the amount by which these two days could
differ in their actual electricity consumption. [This is a stretch-your-brain problem, do as much
as you can.]
7. You are modeling vehicle carbon emissions as a function of ethanol content of fuel. Each of the
residual plots below is for some transform of the dependent and independent variables. For each, say what
is the most obvious violation of the assumptions, and suggest the remedy (e.g. non-normal residuals,
transform Y). If the model appears acceptable, say so.
A.
5
4
3
2
1
0
- 1
- 2
- 3
- 4
- 5
- 6
- 7
0
1
2
3
4
5
6
7
8
9
10
6
7
8
9
10
6
7
8
9
10
Et h a n o l
B.
2
1
0
- 1
- 2
0
1
2
3
4
5
Et h a n o l
C.
4
3
2
1
0
- 1
- 2
- 3
0
1
2
3
4
5
Et h a n o l
8. You have data on a random sample of 15 fifth-grade classes where Y = mean score on standardized
reading test and X = percent of students whose parents are not native speakers of English. Here is a
portion of the regression printout:
Yˆ  32.4  0.07 X, MSE  5.09, x  15.0, sX  5.0
Give a 95% confidence interval for the mean score in an individual class where 20 percent of the students
have parents who are not native speakers of English.
SOLUTIONS
1. a. As number of homework problems increases, the mean score tends to increase. For each extra
homework problem per week, the class mean is expected to increase by 1.4 points. If 0 homework
problems are assigned, the class mean is expected to be 2.1.
b. F =
8(0.71)2
(1  0.712)
 8.13 with 1 and 8 df. The critical value at  = 5% is 5.32. There is significant
evidence of a relationship.
c. t 
1.4

std. error
8.13  std. error  0.491
Confidence interval 1.4  2.306(0.491)  1.4  1.13  (0.27,2.53)
2 a. As the low temperature increases (that is, weather is warmer), the fruit sugar content declines.
b. 10.76
4 
 1

  10.76  0.20  (10.56,10.96)
15(42)
 16
c. 10.76  2.1448 0.11 
3. F 
23(0.292)
 2.11 with 1 and 23 df. The critical value at  = 5% is 4.28. There is no
(1  0.292)
significant evidence of a relationship between systolic blood pressure and body mass index.
4. a. As age increases, log(survival time) tends to decline.
b. The relationship of age and survival time is most likely nonlinear.
c. Ho:   0 where  is the population correlation between age and log(survival time). Since the p
value is 0.094, we cannot reject Ho. There is no significant evidence of a linear relation between age and
log(survival time).
d. Must have been a small sample, or F would have been large.
5. Must remember to transform Sand Content = 20 to X = ln(20).
Point estimate for mean ln(asphalt strength) is 3.42 – 0.49*ln(20) = 1.952. Since value of X for which the
prediction is desired is at the sample mean, do not need the sums of squares for the X values.
Conf. Interval is 1.952  2.1604 0.16(1 / 15)  1.952  0.223  1.729,2.175
Note that this is a confidence interval for the mean of ln(asphalt strength) when Sand Content is 20. To
obtain a statement about typical Asphalt Strength, must exponentiate.
With confidence 95%, median Asphalt Strength is between 5.64 and 8.80 when Sand Content is 20.
6. a.
Source
Model
Error
Total
SS
191.00
309.00
500.00
df
2
38
MS
95.5
8.132
F
11.744
b. 5(0.350  2.03(0.0149))  (1.6, 1.9)
With confidence 95%, the expected difference is between 1.6 and 1.9.
c. For Day 1, Y1  0  1X  1 and for Day 2, Y2  0  1X  2 . The difference in their actual
consumption is 1  2 . Since the errors are independent with variance  2 the difference has mean 0
and variance 2 2 . Hence, a rough confidence interval for the difference would be
2 2 2  2 2(8.132)   8.07
7. A. nonconstant variance, transform the Y variable
B. Looks good
C. Nonlinearity, but no strong change in variance. Transform the X variable.
8. point estimate 32.4 – 0.07(20) = 31.0
Confidence interval 31.0  2.1604 5.09(1 
1
25

 31.0  5.20  (25.8,36.2)
15
14(52)
Download