Linear Regression Models Statistics 4205/5205 — Fall 2020 Assignment 1 Reading: By Monday, September 14, read Chapter 1 and Appendix A.2 of Applied Linear Regression, fourth edition; by Sanford Weisberg; and Chapters 0–1 of Weisberg’s Computing Primer for Applied Linear Regression Using R. For Wednesday, September 16, read Chapter 2 and Appendices A.3–A.4 of the textbook, and Chapter 2 of the R primer. Homework 1: The following problems are due by the end of the day on Wednesday, September 23. 1. The data file wblake gives Age in years, Length in mm, and Scale radius in mm for n = 439 smallmouth bass measured in West Bearskin Lake in Northeastern Minnesota in 1991. (a) Convert the continuous variable Age into a new categorical variable AgeClass as follows: > wblake$AgeClass <- as.factor(wblake$Age) The dataframe wblake will now have four variables, the three original variables plus AgeClass. Use the command summary(wblake), and explain what all the numbers mean. (b) Compute the means and standard deviations of Length and Scale for each of the eight subpopulations (by Age) of smallmouth bass. You will want to use the command tapply; type help(tapply) to learn about it. (c) Draw a graph of average length by age in years and compare with the scatterplot of Length versus Age. (d) Draw a graph of the standard deviations of length by age. If the variance function is constant, this graph should be a null plot. Summarize the information. (e) Draw the plot > plot(Length ~ AgeClass, data=wblake) and explain what it shows. 2. The data in the file UN11 contains several variables, including ppgdp, the gross national product per person in U.S. dollars, and fertility, the birth rate per 1000 females, both from the year 2009. The data are for 199 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. We will study the dependence of fertility on ppgdp. (a) Identify the predictor and the response. (b) Draw the scatterplot of fertility on the vertical axis versus ppgdp on the horizontal axis, and summarize the information in this graph. Does a straight-line mean function seem to be plausible for a summary of this graph? (c) Draw the scatterplot of log(fertility) versus log(ppgdp), using natural logarithms. Does the simple linear regression model seem plausible for a summary of this graph? 3. The data file oldfaith gives information about eruptions of Old Faith Geyser during October 1980. Variables are the Duration in seconds of the current eruption, and the Interval, the time in minutes to the next eruption. The park service uses data like these to obtain a prediction equation for the time to the next eruption. Draw the relevant summary graph for predicting interval from duration, and summarize your results. 4. The data file Mitchell gives average soil temperature in degrees C at 20 cm depth in Mitchell, Nebraska, for 17 years beginning January 1976. Draw the relevant summary graph to study the dependence of soil temperature on month number, and summarize your results. 5. Can Southern California’s water supply in future years be predicted from past data? One factor affecting water availability is stream runoff. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more efficiently. The data file water contains 43 years’ worth of precipitation measurements taken at six sites in the Sierra Nevada mountains (labeled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE) and stream runoff volume at a site near Bishop, California, labeled BSAAM. Draw the scatterplot matrix for these data and summarize the information available from these plots.