Project 2

advertisement
F71SC3
Statistics 3
2008 - Project 2: Probability and Data Analysis
Your report for this project must be returned to the Actuarial Mathematics box outside
Room EM 1.16 by 4pm on Monday 2nd June 2008. Add an appropriate cover sheet and
make sure to include your name on the submission. There are two questions, worth a total
of 30 marks. It should be possible to produce a reasonable report in less than 1000
words, together with any graphs or tables that may be needed.
1. The R data frame lead is found on the module web page, and is made available in R by
using the function dget as described there. It contains measurements, for a sample of 10
competitive cyclists, of the number of hours per week spent training, and of measured
blood lead levels (  mol/L). It is hypothesised that increased time spent training results in
increased uptake in lead from surrounding traffic. Thus of interest is the dependence of the
distribution of the response variable level on the explanatory variable hours, so we wish to
consider an appropriate statistical model for this dependence. We consider only linear
models of the form
level  a  b  hours  
where  is a residual random variable, which is assumed to have mean 0 and to be
independent and identically distributed between observations.
(a) Plot level (vertical axis) against hours and comment on the observed dependence.
(b) Use least squares regression to fit a model of the form above. Give the estimates of a
and b and their standard errors. Add also the fitted relation to the plot constructed above.
Use both your numerical and graphical information to comment on the usefulness of the
model.
(c) It is speculated that the fitting procedure for the model may be unduly influenced by
the data for cyclist 10, who has an unusually high blood lead level. Refit the model again
using least squares but omitting the data for cyclist 10 from the fitting procedure. Again
give the estimates of the coefficients a and b and their standard errors, and add also the
fitted line to the plot already constructed. Comment briefly.
(d) If the data for cyclist 10 had not been available, what would then have been your
assessment of the usefulness of the model? How legitimate do you consider it to be to
omit the data for cyclist 10?
Note: You should use residual plots, standard errors and an appropriate correlation
coefficient to help you decide how well any straight line model fits.
2.
The R data frame “january” can be found on the course webpages. This is made
available using the function dget as in Question 1. It records the maximum January
temperatures (degrees F), over the period 1931-1960, for 40 selected cities in the US,
together with the latitude, longitude, and altitude (feet) of each.
Use >january = dget(“january.R”) and remember to attach the file.
We treat temperature as the response variable and are interested in modelling its
dependence on the explanatory variables latitude and longitude (for this exercise we do
not consider altitude). In doing so it is important to use our physical knowledge of how
temperature might be expected to depend on these variables, as well as the data given.
(a) Give plots of temperature against latitude and temperature against longitude. In each
case, comment on any dependence observable in the scatter plot and calculate the
Product Moment Correlation Coefficient. Identify (by name) and discuss briefly any
anomalous observations.
[Note: You can label points on a plot by the following method:
>identify(temperature~latitude,labels=row.names(january))
Then, move your mouse over the plot until you reach the point you wish to identify. Click
the (left) mouse button and the name will appear. Repeat as desired. To get out of this
mode you need to press the RIGHT mouse button and select “stop”].
Construct and give also a plot of longitude against latitude and comment.
(b) Now consider further the dependence of temperature on latitude. Use least squares
regression to construct a linear relationship
temperature = a + b x latitude + ε
(where ε is as in Question 1).
Give the fitted values of the coefficients a and b, and display a further plot of temperature
against latitude showing the fitted line. Construct and give also a plot of the residuals from
the fitted model against latitude. Discuss the success of the fitted model.
(c) Discuss briefly whether it might it be a good idea to exclude from further analysis the
observation corresponding to Juneau, Alaska (observation 3)?
(d) Refit the linear relationship omitting from the analysis the observation corresponding to
Juneau, Alaska. Again give the (new) fitted values of the coefficients a and b, and again
display a further plot of temperature against latitude showing the refitted line.
>regress2=lm(temperature[-3]~latitude[-3]),
etc.
Construct a plot of the residuals from the refitted model against latitude. By comparing
these plots with those obtained for the earlier analysis, comment on the improvement, or
otherwise, in the fitted relationship.
(e) Now give a plot the residuals from the (new) fit of the model against the second
explanatory variable longitude. Comment on the observed pattern and on the likely
dependence on longitude of the original response variable temperature.
(f) Would you expect either of the fitted models to be valid outside the continental USA?
Note on Report writing. Your report should be word processed. It should include all
relevant figures and tables and should be written in good, clear, and concise English. In
general your statistics calculator should be invisible, so, for example, vast amounts of R
code should not be given.
Download