Correlation and Regression Minitab Lab 2 Exercises Datasets for these can examples can be accessed at: http://personal.strath.ac.uk/david.young/SQA/ Question 1 The file Foetus contains the data on the age and length of 84 foetuses as described in the lectures. (i) Produce descriptive statistics to summarise the variables age and length and comment on the distribution of each. (ii) Produce a scatterplot with age on the x-axis and length on the y-axis. Edit the graph to have a suitable title and axis labels and comment on the relationship between the age and length of typically developing foetuses. (iii) Compute the correlation coefficient between age and length. What can be concluded about the relationship between age and length. Hint: Use Stat > Basic Statistics > Correlation. Remember to use the help function for more information. (iv) Compute the least squares linear regression line which would model the length of a foetus in terms of the age. Hint: Use Stat > Regression > Fitted Line Plot. Using this option produces a plot of the original data with the least squares linear regression line superimposed. (v) Compute the coefficient of determination for the regression model. How this can be interpreted in terms of the fitted model. Hint: There is no separate option in Minitab to compute this. You will see it in the top right corner of the fitted line plot in (iv). Note that for simple linear regression, the coefficient of determination is just the correlation coefficient squared. (vi) Use the fitted model to predict the length of a foetus at 85 days and the length of a foetus at 120 days. For each prediction, comment on two factors from the exercises above which would indicate the predictions are accurate. Hint: Use Stat > Regression > Regression > Fit Regression Model to fit the least squares linear regression line. Length goes in the Responses box and Age is a Continuous Predictor. This stores the regression equation in Minitab and it is now possible to use the package for predictions with Stat > Regression > Regression > Predict by inserting the ages in the appropriate box. Question 2 The Cucumbers dataset contains randomly collect data on growing season precipitation and cucumber yield. This data is available at: http://www.physicalgeography.net/fundamentals/3h.html It is reasonable to suggest that the amount of water received on a field during the growing season will influence the yield of cucumbers growing on it. (i) Produce a scatter plot with precipitation on the x-axis and cucumber yield on the y-axis. (ii) Compute the correlation between these two variables and comment on the result. Question 3 Please note that a similar question will be used in the live unit assessment. This dataset Olympics contains the gold medal performances in the men’s 100 metres. (i) Produce a scatterplot of year and 100m times with each variable plotted on the appropriate x and y-axes. (ii) Comment on the relationship between the variables seen in the scatterplot in (i) and point out anything of interest in the graph with regard to missing values and/or outliers. (iii) Compute the correlation between year and 100m times and write a sentence to interpret the result. (iv) Compute the least squares linear regression line between year and time and interpret the coefficients in the context of the problem. (v) Predict the results of the 100m at the 2012 Olympics. Comment on the accuracy of this prediction. Outline Solutions Question 1 (i) Both age and length are roughly normally distributed (check the histograms), and the appropriate measures of location and spread for each are the mean and standard deviation: Descriptive Statistics: age, length Variable age length N 84 84 Mean 70.94 5.854 StDev 14.18 1.729 Minimum 45.00 2.449 Maximum 100.00 9.487 (ii) The scatterplot is shown in Figure 1. The graph can be edited in Minitab by right clicking on it and choosing the appropriate options. Figure 1: Relationship between age and length of a foetus (iii) The correlation is 0.988: Correlation: age, length Pearson correlation of age and length = 0.988 P-Value = 0.000 (iv) Figure 2 shows the least squares linear regression line superimposed on the original data. Figure 2: Fitted linear model (v) The coefficient of determination is 97.6% (see Figure 2). This indicates that 97.6% of the variability in lengths can be explained by the age of the foetus. (vi) The output for the predictions is shown below: Prediction for length Regression Equation length = -2.691 +0.12046age Variable age Setting 85 Fit 7.54783 SE Fit 0.0409156 Variable age Setting 120 Fit SE Fit 95% CI (7.46643, 7.62922) 95% CI 95% PI (7.01333, 8.08232) 95% PI 11.7638 0.104889 (11.5551, 11.9724) (11.1958, 12.3317) XX XX denotes an extremely unusual point relative to predictor levels used to fit the model. At 85 days the predicted length is 7.55mm. At 120 days the predicted length is 11.76mm. Since the linear model is a good fit, the prediction at 85 days should be fairly accurate since this lies within the range of data used to compute the regression line. The prediction at 120 days it outwith the range of data (note that Minitab gives a warning message with this prediction), and unless the linear model continues beyond 100 days, this accuracy of this prediction is questionable. Question 2 (i) The scatterplot is shown in Figure 3. Figure 3: Relationship between cucumber yield and rainfall (ii) The correlation is 0.871. This indicates a fairly strong, positive linear relationship between rainfall and cucumber yield i.e. increased precipitation is associated with an increase in cucumber yield. Question 3 (i) The scatterplot is shown in Figure 4. Figure 4: Scatterplot of 100m results over time (ii) There is a clear, negative linear relationship between the two variables. There is missing data during the war years. (iii) The correlation coefficient is -0.901. This indicates a strong, negative linear relationship between year and the 100m times. (iv) The least squares linear regression line is: Regression Equation Time_100m (seconds) = 36.42 -0.01333Year The intercept is 36.4 seconds which is the time to run the 100m at year 0 meaningless in the context of this problem. The slope parameter of -0.013 seconds is the reduction in time each year to run the 100m. (v) The predicted time in 2012 is 9.59 seconds. Prediction for Time_100m (seconds) Regression Equation Time_100m (seconds) = 36.42 -0.01333Year Variable Setting Year Fit 9.59471 2012 SE Fit 0.0886048 95% CI (9.41223, 9.77720) 95% PI (9.08114, 10.1083) This should be an accurate prediction since the linear regression model is a good fit. It should be interpreted with caution since 2012 is outwith the range of data use for the model. The true result can be found at http://www.olympic.org/olympic-results/london-2012/athletics/100m-m and lies within the 95% prediction interval.