Practical 3 CORRELATION and REGRESSION Practical 3 Context 1 : Home and School Behaviour This second part will consider illustrations of a) Ann Laybourn (Centre for Child Research). Context: As part of a large scale study on the position of ONLY children in Society a sample of 234 only children were assessed on their behaviour at age 16 both at home (by their mother) and at school (by their ‘guidance’ teacher). Basically this involved each child being assessed by a questionnaire on items of behaviour such as worry, irritability, bullying, fingernail biting, etc. how to assess the population correlation between two variables of interest and b) Source: how to model the dependence of a response variable on an explanatory variable through simple linear regression. Each questionnaire resulted in a behaviour score for each child at home and at school. The specific contexts dealt with in this practical are: 1: Home and School Behaviour; 2: Bronchodilator Use in Asthmatic Children; 3: Edible Mass of Horse Mussels. (High values on this scale were indicative of ‘bad’ behaviour while low values are ‘good’ - at least to this society’s norms!). Questions: In general is there evidence of any link/correlation between a child’s HOME BEHAVIOUR SCORE The demonstrator will go through contexts 1 and 2 with you in the practical and you should record all relevant material on the worksheets provided. and his/her You will be expected to analyse the data in context 3 by yourself and write up a report on the analysis. SCHOOL BEHAVIOUR SCORE? Data: This data is held in a Minitab worksheet called ‘BEHAVE’ The column entitled ‘HOME’ gives a child’s home behaviour score while the column entitled ‘SCHOOL’ gives the corresponding school behaviour score. Practical 3 Context 2 : Bronchodilator use in Asthmatic Children Practical 3 Context 3 : Horse Mussels Source: Allison Ferguson (Child Health, QMH). Source: Marlborough Sounds, New Zealand Context: Recent studies of asthmatic children have suggested that the frequency of use of bronchodilators/inhalers/puffers is not well related to the severity of asthma for a child at any particular time. Context: In a study of commercial mussel production measurements were taken on a sample of horse mussels. Interest was on obtaining a relationship to predict the edible mass of the mussels and understanding how this was affected by the mussel shape. Question: Assuming these are a representative sample of horse mussels, what is the relationship between edible mass of the mussel and the length? In a study of this at the Queen Mother’s Hospital 22 pre-school asthmatic children were assessed over a 2 month period as to a) the typical number of daily puffs of the inhaler they used What is the predicted edible mass for a mussel with length 200? Be sure to give by a point and interval estimate. and b) Question: the typical severity of their asthma over this time. Consider also the relationship between edible mussel mass and the width of the mussel. Which of the measurements, length or width, gives the best prediction of edible mussel mass? How, if at all, is the typical number of daily puffs of a pre-school asthmatic child DEPENDENT on the severity of his/her asthma? Be sure to use diagnostic plots to check your model assumptions. Do you have any reservations about your analyses? Can you suggest a possible simple remedy? (There is no need to report on any further analyses that you may do. Use this relationship to predict the average number of puffs used by a future child with a severity score of 3. Data: Data: This data is held in a Minitab Worksheet called The data is in a Minitab worksheet called mussels ‘BRONCHO’ with the typical number of puffs of each child in the column ‘PUFFS’ and the severity score (with 0 being NO asthma and 10 being SEVERE asthma) in the column ‘SEVERITY’. The column EDIBLE contains the edible mass of the mussel (in grams), HEIGHT, WIDTH and LENGTH are measurements on the shell (in mm) and SHELL gives the weight of the shell (in grams) REPORT WRITING For this context you should submit a report, pasting all relevant output into your report. Practical 3 WORKSHEET FOR BEHAVIOUR PROBLEM - 1 Assuming this is a representative sample of 16 year-old ‘ONLY children’ in Scotland, interest lies in assessing the correlation coefficient of such a population. Practical 3 Perhaps this seems somewhat surprising in the light of the plot above but remember there are 234 observations and if one produces an interval estimate of the magnitude of the correlation by %CORRCI ‘HOME’ ‘SCHOOL’ As always plot the data first to obtain One obtains an approximate 95% confidence interval for the population correlation coefficient of home and school behaviour scores of Home Behaviour Score Relationship of Home and School Behaviour to 50 40 i.e. 30 So, at best, there is a mild relationship (a maximum population correlation of 0.3) between the two scores. 40 50 60 70 this range is entirely positive but very close to 0 and really very far from 1 80 School Behaviour Score High values denote 'poor' behaviour From this plot there is certainly NO suggestion of a strong, if any, relationship between home behaviour and school behaviour of 16 year-old ‘ONLY children’. To formally assess the strength of (or lack of) such a relationship obtain the SAMPLE correlation coefficient by which is This is certainly not far from zero (i.e. NO relationship at all) and the next step is to test the null hypothesis that in fact the population correlation is equal to zero. Since the p-value is _______ than 0.05 there _______ a significant relationship between home and school behaviour scores of 16 year-old only children. Conclusion: There is a significant but non-substantial positive correlation between the home and school behaviour scores of 16 year old only children in Scotland. Note: From the plot, it looks as though at least one of the scores may not be Normally distributed. In this situation it may be worth calculating a rank correlation coefficient to assess the strength of the relationship. In Minitab, this is done by RANK C1 C11 RANK C2 C12 CORR C11 C12 , which is similar to the ‘normal correlation’ above, giving a value of confirming the earlier conclusions. Practical 3 WORKSHEET FOR BRONCHODILATOR PROBLEM : 2 This basically tells you that the estimate of Here the response variable is the No. of Puffs and Practical 3 The Average No. of Puffs the explanatory variable is the Severity of Asthma. = + * Severity of Asthma First plot the data (with the response on the vertical axis) to obtain with a variability (about this average) corresponding to a standard deviation of Asthma Severity puffs and The Use of Bronchodilators at each level of severity of asthma. Daily No. of Puffs 10 Also worth noting from the output on the adjacent page is the 5 R - SQUARED VALUE of 0 0 1 2 3 4 5 6 Asthma Severity Score This tells us that of the variability in the No. of puffs used by a child daily can be explained by its dependence on the severity of the child’s asthma (in a linear model). There is a clear, direct, roughly linear relationship with a reasonable amount of variability about such a line. To quantify this relationship use the SAMPLE of data to estimate the true but unknown underlying linear relationship by REGR ‘PUFFS’ 1 Clearly the relationship between no. of puffs and severity is substantial as can be seen by constructing an interval estimate for the true but unknown slope of such a relationship in the population of all sufferers. This is of the form estimate r 2 estimated standard error of the estimate ‘SEVERITY’ to obtain which is The regression equation is PUFFS = 0.172 + 1.30 SEVERITY Predictor Constant SEVERITY s = 1.199 Coef 0.1722 1.3046 Stdev 0.4150 0.1493 R-sq = 79.2% Analysis of Variance SOURCE DF SS Regression 1 109.74 Error 20 28.75 Total 21 138.49 Unusual Observations Obs. SEVERITY St.Resid 10 6.00 2.33R PUFFS 10.383 1.305 r 2 * 0.149 t-ratio 0.42 8.74 p 0.683 0.000 and hence is F 76.33 Fit 8.000 R denotes an obs. with a large st. resid. to _____________ This is completely positive so the slope (and hence the population correlation coefficient) is significantly greater than zero. R-sq(adj) = 78.2% MS 109.74 1.44 here p 0.000 Stdev.Fit 0.624 Note that the Minitab output does not give this CI but it does give a p-value for the test of . Hence we can reject the null hypothesis of zero zero slope. From the output, this is slope as the p-value is less than 0.05. This gives the same conclusion as the CI. Residual 2.384 Practical 3 Practical 3 A prediction interval for the number of puffs used by a child with a severity score of 3 can be obtained from the PREDICT subcommand as follows: Finally, to present the results of the analysis to the Paediatricians at the hospital, use the command REGR ‘PUFFS’ 1 ‘SEVERITY’ ; PREDICT 3. to obtain Regression Fit In addition to the earlier output, this gives Stdev. Fit 0.283 95% CI (3.496, 4.676) 95% PI (1.516, 6.656) Thus, such a child would be very likely to use between _______ and _______ puffs per day with a best estimate of _______ puffs. The validity of the assumption of linearity can be checked by a residual plot (a plot of the residuals against the fitted values). This is obtained by 10 Puffs FIT 4.086 Y = 0.172248 + 1.30459X R-Squared = 0.792 95.0% Confidence Bands 95.0% Prediction Bands 5 0 0 This gives: 1 2 3 4 5 6 Symptoms (Note: The annotation on your graph may differ slightly from this.) 2.5 2.0 This provides, in the outside dotted lines, Prediction bands of the likely no. of puffs for each level of severity of a child’s asthma. (Note: The confidence bands, which are not of particular interest in this example, could be omitted by omitting the ‘CI’ subcommand.) 1.5 resids 1.0 0.5 0.0 -0.5 For example, from the graph, a child with a severity score of 3 is likely to use between about 1.5 and 6.5 puffs daily (as seen earlier from the ‘PREDICT’ subcommand’). -1.0 -1.5 -2.0 0 1 2 3 4 5 6 7 8 fits If the straight line is an adequate fit to the data, this plot should show a random scatter of points with no pattern. Does the assumption of linearity seem reasonable for these data? Practical 3 Practical 3 MULTIPLE REGRESSION – Variable Selection Context 5 : This practical will consider some applications of Multiple Regression. In particular we will look at Context: the use of stepwise procedures for selection of explanatory variables to include in the model when there is a large number of explanatory variables. Predicting the Weight of a Horse’s Heart 46 terminally ill horses had a number of ultrasound measurements made on their hearts which were weighed post-mortem. The following ultrasound measurements were made: thickness of the outer wall of the heart during systole (the pumping phase); thickness of the outer wall during diastole (the recovery phase); thickness of the inner wall during systole; thickness of the inner wall during diastole. The specific contexts dealt with in this practical are: 4: Possum morphometric measurements Question: 5: Predicting the Weight of a Horse’s Heart; Context 4 : Which combination of these variables is most useful in predicting the weight of the heart? Using an appropriate regression model, obtain a prediction interval for the weight of the heart for a horse with ultrasound measurements as follows Mountain Possum Measurements siw = 4.0, Context: Various morphometric measurements were made on captured possums Data: diw = 3.0, sow = 3.5, dow = 3.5. The data are stored in a Minitab worksheet. HORSE4 Question: Which combination of the other measurement variables is most useful in predicting the total length of the possum? For your selected model, is there any gender difference? Try adding in a dummy variables for the gender categorisation/ Data: with columns. C1: SIW (systole inner wall) C2: DIW (diastole inner wall) C3: SOW (systole outer wall) C4: DOW (diastole outer wall) C5: WT (weight of horse’s heart; kg) The data are stored in a Minitab worksheet. possum REPORT WRITING For this context you should submit a report, pasting all relevant output into your report. For guidance see the analysis below for Context 5. Practical 3 Practical 3 Record your subjective impression below. WORKSHEET FOR HORSE PROBLEM - 5 1. Examining the relationship Firstly examine the relationships amongst the variables using plots and correlation coefficients. A matrix plot is best obtained from the pull-down menus by Graph > Matrix Plot In the dialog box, select ‘siw’, ‘diw’, ‘sow’, ‘dow’, ‘wt’ into the ‘Graph variabes’ box. Then click on ‘Options’ and specify ‘lower left’ in the ‘Matrix display’ and ‘Boundary’ under ‘Variable label placement’. (Add a title if you wish, under ‘Annotation’.) 2. Using Stepwise Regression This gives Since the 4 explanatory variables are highly correlated with each other as well as with the weight of the heart, it is unlikely that they will all be required in a multiple regression model. wt dow sow diw Matrix Plot of Horses Heart Data 3.32500 Stepwise regression can be used to help identify the ‘best’ model involving the smallest number of explanatory variables. To do this, type the command 1.97500 STEP ‘wt’ on C1-C4 3.97500 2.32500 giving 2.72500 F-to-Enter: 1.57500 Response is 3.76200 Step Constant 1.92000 00 00 00 00 7 5 .3 25 00 .400 4 3 1 .9 2.6 siw 00 00 25 .975 3 2.3 diw 00 00 75 .725 2 1.5 sow dow Sample correlation coefficients are also useful in studying the relationships among the variables. The sample correlation matrix is obtained by CORR C1-C5 giving: siw diw sow diw 0.909 sow 0.825 0.772 dow 0.756 0.699 0.908 wt 0.778 0.811 0.779 dow 0.686 diw T-Value sow T-Value 4.00 wt F-to-Remove: on 4.00 4 predictors, with N = 1 -1.062 2 -1.495 1.37 9.20 0.88 4.07 46 0.56 2.95 S 0.665 0.613 R-Sq 65.78 71.55 More? (Yes, No, Subcommand, or Help) SUBC> The best single explanatory variable is variability in the weights. The best explanatory variable in addition to diw is in addition to each other because of the variability in the weights. explain The stepwise process stops at this point because: , which explains of the . Both of these are useful and together they Practical 3 Practical 3 3. Checking and using the ‘best’ model The REGRESS command in Minitab can now be used for this ‘best’ model to obtain: the standard errors of the parameters (for constructing CIs if required) p values for hypothesis tests for the parameters prediction intervals. Note that Minitab’s stepwise output ends with the ‘SUBC!’ prompt. This allows you to modify the stepwise process by entering or removing explanatory variables from the model. It is often of interest (but not necessary) to see how the process would continue if the restriction on entering only ‘significant’ explanatories were removed. To do this, type , at the ‘SUBC>’ prompt FENTER = 0. To obtain the required prediction interval, type REGRESS ‘wt’ 2 ‘diw’ ‘sow’; PREDICT 3.0, 3.5. This gives: This gives: Step Constant The regression equation is wt = - 1.49 + 0.880 diw + 0.561 sow 3 -1.455 4 -1.433 diw T-Value 0.88 4.03 0.92 2.72 Predictor Constant diw sow sow T-Value 0.72 2.19 0.73 2.13 s = 0.6133 dow T-Value -0.25 -0.58 -0.25 -0.57 siw T-Value Coef -1.4948 0.8797 0.5612 Stdev 0.3728 0.2164 0.1901 R-sq = 71.5% t-ratio -4.01 4.07 2.95 p 0.000 0.000 0.005 R-sq(adj) = 70.2% Analysis of Variance -0.05 -0.17 S 0.618 0.625 R-Sq 71.78 71.80 More? (Yes, No, Subcommand, or Help) SUBC> SOURCE Regression Error Total DF 2 43 45 SS 40.672 16.173 56.845 SOURCE diw sow DF 1 1 SEQ SS 37.395 3.277 MS 20.336 0.376 F 54.07 p 0.000 Unusual Observations 2 As you can see, the model with all 4 explanatories has an R value which is only 0.25% higher than that for the model with 2 explanatories and this small increase is achieved at the cost of adding another 2 explanatories to the model. However, since neither of these 2 explanatories were significant in addition to ‘diw’ and ‘sow’ at step 3, this model with two explanatories is the best one to use in practice. >To escape from the STEPWISE subcommands, type ‘NO’@ Obs. 44 45 diw 2.20 2.10 wt 4.0100 2.9700 Fit 2.1803 2.8219 Stdev.Fit 0.1210 0.3537 Residual 1.8297 0.1481 St.Resid 3.04R 0.30 X R denotes an obs. with a large st. resid. X denotes an obs. whose X value gives it large influence. Fit 3.1085 Stdev.Fit 0.1236 ( 95% C.I. 2.8593, 3.3578) ( 95% P.I. 1.8466, 4.3705) Practical 3 Practical 3 Thus, a future horse with diw = 3.0 and sow = 3.5 (ignoring the values given for the other explanatories not included in the model) is very likely to have a heart weighing between and with a best estimate of . Residuals Versus sow (response is wt) (response is wt) 3 3 Standardized Residual Standardized Residual As in Practical P, the simplest way to produce all the appropriate residual plots for assumption checking is to run the multiple regression through the pull-down menus as follows: Residuals Versus diw 2 1 0 -1 -2 Stat > Regression > Regression standardised residuals Normal plot of residuals Residuals vs fits and select ‘diw’, ‘sow’ into the ‘Residuals vs variables’ box. The requested plots are then produced in separate graph windows. Normal Probability Plot of the Residuals Residuals Versus the Fitted Values (response is wt) (response is wt) 3 2 Standardized Residual Standardized Residual 3 1 0 -1 -2 2 1 0 -1 -2 0 Normal Score 1 2 3 diw 4 2 3 4 5 sow In the probability plot (plot 1) the relationship is reasonably linear, so that the assumption of Normality is reasonable. Under ‘Graphs’, check the buttons for -1 0 -1 There are no obvious patterns in plots 2, 3 or 4 to suggest problems with the assumptions of linearity and constant variance. and select ‘wt’ into the ‘Response’ box, with ‘diw’, ‘sow’ in the ‘Predictors’ box. -2 1 -2 1 (You need not work through this again since the final model is identical to that in Practical P.) 2 2 0 1 2 Fitted Value 3 4