Uploaded by tanglu1997830

Homework1

advertisement
Linear Regression Models
Statistics 4205/5205 — Fall 2020
Assignment 1
Reading:
By Monday, September 14, read Chapter 1 and Appendix A.2 of Applied Linear Regression,
fourth edition; by Sanford Weisberg; and Chapters 0–1 of Weisberg’s Computing Primer for
Applied Linear Regression Using R.
For Wednesday, September 16, read Chapter 2 and Appendices A.3–A.4 of the textbook, and
Chapter 2 of the R primer.
Homework 1:
The following problems are due by the end of the day on Wednesday, September 23.
1. The data file wblake gives Age in years, Length in mm, and Scale radius in mm for
n = 439 smallmouth bass measured in West Bearskin Lake in Northeastern Minnesota in
1991.
(a) Convert the continuous variable Age into a new categorical variable AgeClass as
follows:
> wblake$AgeClass <- as.factor(wblake$Age)
The dataframe wblake will now have four variables, the three original variables plus
AgeClass. Use the command summary(wblake), and explain what all the numbers
mean.
(b) Compute the means and standard deviations of Length and Scale for each of the
eight subpopulations (by Age) of smallmouth bass.
You will want to use the command tapply; type help(tapply) to learn about it.
(c) Draw a graph of average length by age in years and compare with the scatterplot of
Length versus Age.
(d) Draw a graph of the standard deviations of length by age. If the variance function is
constant, this graph should be a null plot. Summarize the information.
(e) Draw the plot
> plot(Length ~ AgeClass, data=wblake)
and explain what it shows.
2. The data in the file UN11 contains several variables, including ppgdp, the gross national
product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both
from the year 2009. The data are for 199 localities, mostly UN member countries, but
also other areas such as Hong Kong that are not independent countries. We will study the
dependence of fertility on ppgdp.
(a) Identify the predictor and the response.
(b) Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal
axis, and summarize the information in this graph. Does a straight-line mean function
seem to be plausible for a summary of this graph?
(c) Draw the scatterplot of log(fertility) versus log(ppgdp), using natural logarithms.
Does the simple linear regression model seem plausible for a summary of this graph?
3. The data file oldfaith gives information about eruptions of Old Faith Geyser during
October 1980. Variables are the Duration in seconds of the current eruption, and the
Interval, the time in minutes to the next eruption. The park service uses data like these
to obtain a prediction equation for the time to the next eruption.
Draw the relevant summary graph for predicting interval from duration, and summarize
your results.
4. The data file Mitchell gives average soil temperature in degrees C at 20 cm depth in
Mitchell, Nebraska, for 17 years beginning January 1976.
Draw the relevant summary graph to study the dependence of soil temperature on month
number, and summarize your results.
5. Can Southern California’s water supply in future years be predicted from past data? One
factor affecting water availability is stream runoff. If runoff could be predicted, engineers,
planners, and policy makers could do their jobs more efficiently. The data file water
contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra
Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream
runoff volume at a site near Bishop, California, labeled BSAAM.
Draw the scatterplot matrix for these data and summarize the information available from
these plots.
Download