Correlation and Regression Minitab Lab 2 Exercises

advertisement
Correlation and Regression
Minitab Lab 2 Exercises
Datasets for these can examples can be accessed at:
http://personal.strath.ac.uk/david.young/SQA/
Question 1
The file Foetus contains the data on the age and length of 84 foetuses as described in
the lectures.
(i) Produce descriptive statistics to summarise the variables age and length and
comment on the distribution of each.
(ii) Produce a scatterplot with age on the x-axis and length on the y-axis. Edit the
graph to have a suitable title and axis labels and comment on the relationship
between the age and length of typically developing foetuses.
(iii) Compute the correlation coefficient between age and length. What can be concluded about the relationship between age and length.
Hint: Use Stat > Basic Statistics > Correlation. Remember to use the help
function for more information.
(iv) Compute the least squares linear regression line which would model the length
of a foetus in terms of the age.
Hint: Use Stat > Regression > Fitted Line Plot.
Using this option produces a plot of the original data with the least squares linear
regression line superimposed.
(v) Compute the coefficient of determination for the regression model. How this can
be interpreted in terms of the fitted model.
Hint: There is no separate option in Minitab to compute this. You will see
it in the top right corner of the fitted line plot in (iv). Note that for simple
linear regression, the coefficient of determination is just the correlation coefficient
squared.
(vi) Use the fitted model to predict the length of a foetus at 85 days and the length
of a foetus at 120 days. For each prediction, comment on two factors from the
exercises above which would indicate the predictions are accurate.
Hint: Use Stat > Regression > Regression > Fit Regression Model to
fit the least squares linear regression line. Length goes in the Responses box
and Age is a Continuous Predictor. This stores the regression equation in
Minitab and it is now possible to use the package for predictions with Stat >
Regression > Regression > Predict by inserting the ages in the appropriate
box.
Question 2
The Cucumbers dataset contains randomly collect data on growing season precipitation
and cucumber yield. This data is available at:
http://www.physicalgeography.net/fundamentals/3h.html
It is reasonable to suggest that the amount of water received on a field during the
growing season will influence the yield of cucumbers growing on it.
(i) Produce a scatter plot with precipitation on the x-axis and cucumber yield on
the y-axis.
(ii) Compute the correlation between these two variables and comment on the result.
Question 3
Please note that a similar question will be used in the live unit assessment.
This dataset Olympics contains the gold medal performances in the men’s 100 metres.
(i) Produce a scatterplot of year and 100m times with each variable plotted on the
appropriate x and y-axes.
(ii) Comment on the relationship between the variables seen in the scatterplot in (i)
and point out anything of interest in the graph with regard to missing values
and/or outliers.
(iii) Compute the correlation between year and 100m times and write a sentence to
interpret the result.
(iv) Compute the least squares linear regression line between year and time and interpret the coefficients in the context of the problem.
(v) Predict the results of the 100m at the 2012 Olympics. Comment on the accuracy
of this prediction.
Outline Solutions
Question 1
(i) Both age and length are roughly normally distributed (check the histograms),
and the appropriate measures of location and spread for each are the mean and
standard deviation:
Descriptive Statistics: age, length
Variable
age
length
N
84
84
Mean
70.94
5.854
StDev
14.18
1.729
Minimum
45.00
2.449
Maximum
100.00
9.487
(ii) The scatterplot is shown in Figure 1. The graph can be edited in Minitab by
right clicking on it and choosing the appropriate options.
Figure 1: Relationship between age and length of a foetus
(iii) The correlation is 0.988:
Correlation: age, length
Pearson correlation of age and length = 0.988
P-Value = 0.000
(iv) Figure 2 shows the least squares linear regression line superimposed on the original data.
Figure 2: Fitted linear model
(v) The coefficient of determination is 97.6% (see Figure 2). This indicates that
97.6% of the variability in lengths can be explained by the age of the foetus.
(vi) The output for the predictions is shown below:
Prediction for length
Regression Equation
length = -2.691 +0.12046age
Variable
age
Setting
85
Fit
7.54783
SE Fit
0.0409156
Variable
age
Setting
120
Fit
SE Fit
95% CI
(7.46643, 7.62922)
95% CI
95% PI
(7.01333, 8.08232)
95% PI
11.7638
0.104889
(11.5551, 11.9724)
(11.1958, 12.3317)
XX
XX denotes an extremely unusual point relative to predictor levels
used to fit the model.
At 85 days the predicted length is 7.55mm. At 120 days the predicted length is
11.76mm. Since the linear model is a good fit, the prediction at 85 days should
be fairly accurate since this lies within the range of data used to compute the
regression line. The prediction at 120 days it outwith the range of data (note
that Minitab gives a warning message with this prediction), and unless the linear
model continues beyond 100 days, this accuracy of this prediction is questionable.
Question 2
(i) The scatterplot is shown in Figure 3.
Figure 3: Relationship between cucumber yield and rainfall
(ii) The correlation is 0.871. This indicates a fairly strong, positive linear relationship
between rainfall and cucumber yield i.e. increased precipitation is associated with
an increase in cucumber yield.
Question 3
(i) The scatterplot is shown in Figure 4.
Figure 4: Scatterplot of 100m results over time
(ii) There is a clear, negative linear relationship between the two variables. There is
missing data during the war years.
(iii) The correlation coefficient is -0.901. This indicates a strong, negative linear
relationship between year and the 100m times.
(iv) The least squares linear regression line is:
Regression Equation
Time_100m (seconds) = 36.42 -0.01333Year
The intercept is 36.4 seconds which is the time to run the 100m at year 0 meaningless in the context of this problem. The slope parameter of -0.013 seconds
is the reduction in time each year to run the 100m.
(v) The predicted time in 2012 is 9.59 seconds.
Prediction for Time_100m (seconds)
Regression Equation
Time_100m (seconds) = 36.42 -0.01333Year
Variable
Setting
Year
Fit
9.59471
2012
SE Fit
0.0886048
95% CI
(9.41223, 9.77720)
95% PI
(9.08114, 10.1083)
This should be an accurate prediction since the linear regression model is a good
fit. It should be interpreted with caution since 2012 is outwith the range of data
use for the model. The true result can be found at
http://www.olympic.org/olympic-results/london-2012/athletics/100m-m
and lies within the 95% prediction interval.
Download