F71SC3 Statistics 3 2008 - Project 2: Probability and Data Analysis Your report for this project must be returned to the Actuarial Mathematics box outside Room EM 1.16 by 4pm on Monday 2nd June 2008. Add an appropriate cover sheet and make sure to include your name on the submission. There are two questions, worth a total of 30 marks. It should be possible to produce a reasonable report in less than 1000 words, together with any graphs or tables that may be needed. 1. The R data frame lead is found on the module web page, and is made available in R by using the function dget as described there. It contains measurements, for a sample of 10 competitive cyclists, of the number of hours per week spent training, and of measured blood lead levels ( mol/L). It is hypothesised that increased time spent training results in increased uptake in lead from surrounding traffic. Thus of interest is the dependence of the distribution of the response variable level on the explanatory variable hours, so we wish to consider an appropriate statistical model for this dependence. We consider only linear models of the form level a b hours where is a residual random variable, which is assumed to have mean 0 and to be independent and identically distributed between observations. (a) Plot level (vertical axis) against hours and comment on the observed dependence. (b) Use least squares regression to fit a model of the form above. Give the estimates of a and b and their standard errors. Add also the fitted relation to the plot constructed above. Use both your numerical and graphical information to comment on the usefulness of the model. (c) It is speculated that the fitting procedure for the model may be unduly influenced by the data for cyclist 10, who has an unusually high blood lead level. Refit the model again using least squares but omitting the data for cyclist 10 from the fitting procedure. Again give the estimates of the coefficients a and b and their standard errors, and add also the fitted line to the plot already constructed. Comment briefly. (d) If the data for cyclist 10 had not been available, what would then have been your assessment of the usefulness of the model? How legitimate do you consider it to be to omit the data for cyclist 10? Note: You should use residual plots, standard errors and an appropriate correlation coefficient to help you decide how well any straight line model fits. 2. The R data frame “january” can be found on the course webpages. This is made available using the function dget as in Question 1. It records the maximum January temperatures (degrees F), over the period 1931-1960, for 40 selected cities in the US, together with the latitude, longitude, and altitude (feet) of each. Use >january = dget(“january.R”) and remember to attach the file. We treat temperature as the response variable and are interested in modelling its dependence on the explanatory variables latitude and longitude (for this exercise we do not consider altitude). In doing so it is important to use our physical knowledge of how temperature might be expected to depend on these variables, as well as the data given. (a) Give plots of temperature against latitude and temperature against longitude. In each case, comment on any dependence observable in the scatter plot and calculate the Product Moment Correlation Coefficient. Identify (by name) and discuss briefly any anomalous observations. [Note: You can label points on a plot by the following method: >identify(temperature~latitude,labels=row.names(january)) Then, move your mouse over the plot until you reach the point you wish to identify. Click the (left) mouse button and the name will appear. Repeat as desired. To get out of this mode you need to press the RIGHT mouse button and select “stop”]. Construct and give also a plot of longitude against latitude and comment. (b) Now consider further the dependence of temperature on latitude. Use least squares regression to construct a linear relationship temperature = a + b x latitude + ε (where ε is as in Question 1). Give the fitted values of the coefficients a and b, and display a further plot of temperature against latitude showing the fitted line. Construct and give also a plot of the residuals from the fitted model against latitude. Discuss the success of the fitted model. (c) Discuss briefly whether it might it be a good idea to exclude from further analysis the observation corresponding to Juneau, Alaska (observation 3)? (d) Refit the linear relationship omitting from the analysis the observation corresponding to Juneau, Alaska. Again give the (new) fitted values of the coefficients a and b, and again display a further plot of temperature against latitude showing the refitted line. >regress2=lm(temperature[-3]~latitude[-3]), etc. Construct a plot of the residuals from the refitted model against latitude. By comparing these plots with those obtained for the earlier analysis, comment on the improvement, or otherwise, in the fitted relationship. (e) Now give a plot the residuals from the (new) fit of the model against the second explanatory variable longitude. Comment on the observed pattern and on the likely dependence on longitude of the original response variable temperature. (f) Would you expect either of the fitted models to be valid outside the continental USA? Note on Report writing. Your report should be word processed. It should include all relevant figures and tables and should be written in good, clear, and concise English. In general your statistics calculator should be invisible, so, for example, vast amounts of R code should not be given.