Context 1 : Home and School Behaviour

advertisement
Practical 3
CORRELATION and REGRESSION
Practical 3
Context 1 : Home and School Behaviour
This second part will consider illustrations of
a)
Ann Laybourn (Centre for Child Research).
Context:
As part of a large scale study on the position of ONLY children in Society a
sample of 234 only children were assessed on their behaviour at age 16 both at
home (by their mother) and at school (by their ‘guidance’ teacher). Basically
this involved each child being assessed by a questionnaire on items of
behaviour such as worry, irritability, bullying, fingernail biting, etc.
how to assess the population correlation
between two variables of interest
and
b)
Source:
how to model the dependence of a response variable on an
explanatory variable through
simple linear regression.
Each questionnaire resulted in a behaviour score for each child at home and at
school.
The specific contexts dealt with in this practical are:
1:
Home and School Behaviour;
2:
Bronchodilator Use in Asthmatic Children;
3:
Edible Mass of Horse Mussels.
(High values on this scale were indicative of ‘bad’ behaviour while low values
are ‘good’ - at least to this society’s norms!).
Questions:
In general is there evidence of any link/correlation between a child’s
HOME BEHAVIOUR SCORE
The demonstrator will go through contexts 1 and 2 with you in the practical and you
should record all relevant material on the worksheets provided.
and his/her
You will be expected to analyse the data in context 3 by yourself and write up a
report on the analysis.
SCHOOL BEHAVIOUR SCORE?
Data:
This data is held in a Minitab worksheet called
‘BEHAVE’
The column entitled ‘HOME’ gives a child’s home behaviour score while the
column entitled ‘SCHOOL’ gives the corresponding school behaviour score.
Practical 3
Context 2 : Bronchodilator use in Asthmatic Children
Practical 3
Context 3 : Horse Mussels
Source:
Allison Ferguson (Child Health, QMH).
Source:
Marlborough Sounds, New Zealand
Context:
Recent studies of asthmatic children have suggested that the frequency of use
of bronchodilators/inhalers/puffers is not well related to the severity of asthma
for a child at any particular time.
Context:
In a study of commercial mussel production measurements were taken on a
sample of horse mussels. Interest was on obtaining a relationship to predict the
edible mass of the mussels and understanding how this was affected by the
mussel shape.
Question:
Assuming these are a representative sample of horse mussels, what is the
relationship between edible mass of the mussel and the length?
In a study of this at the Queen Mother’s Hospital 22 pre-school asthmatic
children were assessed over a 2 month period as to
a)
the typical number of daily puffs of the inhaler they used
What is the predicted edible mass for a mussel with length 200? Be sure to
give by a point and interval estimate.
and
b)
Question:
the typical severity of their asthma over this time.
Consider also the relationship between edible mussel mass and the width of
the mussel.
Which of the measurements, length or width, gives the best prediction of
edible mussel mass?
How, if at all, is the typical number of daily puffs of a pre-school asthmatic
child DEPENDENT on the severity of his/her asthma?
Be sure to use diagnostic plots to check your model assumptions.
Do you have any reservations about your analyses?
Can you suggest a possible simple remedy? (There is no need to report on any
further analyses that you may do.
Use this relationship to predict the average number of puffs used by a future
child with a severity score of 3.
Data:
Data:
This data is held in a Minitab Worksheet called
The data is in a Minitab worksheet called
mussels
‘BRONCHO’
with the typical number of puffs of each child in the column ‘PUFFS’ and the
severity score (with 0 being NO asthma and 10 being SEVERE asthma) in the
column ‘SEVERITY’.
The column EDIBLE contains the edible mass of the mussel (in grams),
HEIGHT, WIDTH and LENGTH are measurements on the shell (in mm) and
SHELL gives the weight of the shell (in grams)
REPORT WRITING
For this context you should submit a report, pasting all relevant output into your report.
Practical 3
WORKSHEET FOR BEHAVIOUR PROBLEM - 1
Assuming this is a representative sample of 16 year-old ‘ONLY children’ in Scotland,
interest lies in assessing the correlation coefficient of such a population.
Practical 3
Perhaps this seems somewhat surprising in the light of the plot above but remember there are
234 observations and if one produces an interval estimate of the magnitude of the correlation
by
%CORRCI ‘HOME’
‘SCHOOL’
As always plot the data first to obtain
One obtains an approximate 95% confidence interval for the population correlation
coefficient of home and school behaviour scores of
Home Behaviour Score
Relationship of Home and School
Behaviour
to
50
40
i.e.
30
So, at best, there is a mild relationship (a maximum population correlation of 0.3) between
the two scores.
40
50
60
70
this range is entirely positive but very close to 0 and really very far from 1
80
School Behaviour Score
High values denote 'poor' behaviour
From this plot there is certainly NO suggestion of a strong, if any, relationship between home
behaviour and school behaviour of 16 year-old ‘ONLY children’.
To formally assess the strength of (or lack of) such a relationship obtain the SAMPLE
correlation coefficient by
which is
This is certainly not far from zero (i.e. NO relationship at all) and the next step is to test the
null hypothesis that in fact the population correlation is equal to zero.
Since the p-value is _______ than 0.05 there _______ a significant relationship between
home and school behaviour scores of 16 year-old only children.
Conclusion:
There is a significant but non-substantial positive correlation between the
home and school behaviour scores of 16 year old only children in Scotland.
Note:
From the plot, it looks as though at least one of the scores may not be
Normally distributed. In this situation it may be worth calculating a rank
correlation coefficient to assess the strength of the relationship.
In Minitab, this is done by
RANK C1 C11
RANK C2 C12
CORR C11 C12
, which is similar to the ‘normal correlation’ above,
giving a value of
confirming the earlier conclusions.
Practical 3
WORKSHEET FOR BRONCHODILATOR PROBLEM : 2
This basically tells you that the estimate of
Here the response variable is the No. of Puffs
and
Practical 3
The Average No.
of Puffs
the explanatory variable is the Severity of Asthma.
=
+
*
Severity of Asthma
First plot the data (with the response on the vertical axis) to obtain
with a variability (about this average) corresponding to a standard deviation of
Asthma Severity
puffs
and
The Use of Bronchodilators
at each level of severity of asthma.
Daily No. of Puffs
10
Also worth noting from the output on the adjacent page is the
5
R - SQUARED VALUE of
0
0
1
2
3
4
5
6
Asthma Severity Score
This tells us that
of the variability in the No. of puffs used by a child daily can be
explained by its dependence on the severity of the child’s asthma (in a linear model).
There is a clear, direct, roughly linear relationship with a reasonable amount of variability
about such a line.
To quantify this relationship use the SAMPLE of data to estimate the true but unknown
underlying linear relationship by
REGR
‘PUFFS’
1
Clearly the relationship between no. of puffs and severity is substantial as can be seen by
constructing an interval estimate for the true but unknown slope of such a relationship in the
population of all sufferers. This is of the form
estimate r 2 estimated standard error of the estimate
‘SEVERITY’
to obtain
which is
The regression equation is
PUFFS = 0.172 + 1.30 SEVERITY
Predictor
Constant
SEVERITY
s = 1.199
Coef
0.1722
1.3046
Stdev
0.4150
0.1493
R-sq = 79.2%
Analysis of Variance
SOURCE
DF
SS
Regression
1
109.74
Error
20
28.75
Total
21
138.49
Unusual Observations
Obs. SEVERITY
St.Resid
10
6.00
2.33R
PUFFS
10.383
1.305 r 2 * 0.149
t-ratio
0.42
8.74
p
0.683
0.000
and hence is
F
76.33
Fit
8.000
R denotes an obs. with a large st. resid.
to
_____________
This is completely positive so the slope (and hence the population correlation coefficient) is
significantly greater than zero.
R-sq(adj) = 78.2%
MS
109.74
1.44
here
p
0.000
Stdev.Fit
0.624
Note that the Minitab output does not give this CI but it does give a p-value for the test of
. Hence we can reject the null hypothesis of zero
zero slope. From the output, this is
slope as the p-value is less than 0.05. This gives the same conclusion as the CI.
Residual
2.384
Practical 3
Practical 3
A prediction interval for the number of puffs used by a child with a severity score of 3 can be
obtained from the PREDICT subcommand as follows:
Finally, to present the results of the analysis to the Paediatricians at the hospital, use the
command
REGR ‘PUFFS’ 1 ‘SEVERITY’ ;
PREDICT 3.
to obtain
Regression Fit
In addition to the earlier output, this gives
Stdev. Fit
0.283
95% CI
(3.496, 4.676)
95% PI
(1.516, 6.656)
Thus, such a child would be very likely to use between _______ and _______ puffs per day
with a best estimate of _______ puffs.
The validity of the assumption of linearity can be checked by a residual plot (a plot of the
residuals against the fitted values). This is obtained by
10
Puffs
FIT
4.086
Y = 0.172248 + 1.30459X
R-Squared = 0.792
95.0% Confidence Bands
95.0% Prediction Bands
5
0
0
This gives:
1
2
3
4
5
6
Symptoms
(Note: The annotation on your graph may differ slightly from this.)
2.5
2.0
This provides, in the outside dotted lines, Prediction bands of the likely no. of puffs for each
level of severity of a child’s asthma. (Note: The confidence bands, which are not of
particular interest in this example, could be omitted by omitting the ‘CI’ subcommand.)
1.5
resids
1.0
0.5
0.0
-0.5
For example, from the graph, a child with a severity score of 3 is likely to use between about
1.5 and 6.5 puffs daily (as seen earlier from the ‘PREDICT’ subcommand’).
-1.0
-1.5
-2.0
0
1
2
3
4
5
6
7
8
fits
If the straight line is an adequate fit to the data, this plot should show a random scatter of
points with no pattern.
Does the assumption of linearity seem reasonable for these data?
Practical 3
Practical 3
MULTIPLE REGRESSION – Variable Selection
Context 5 :
This practical will consider some applications of Multiple Regression. In particular we will
look at
Context:
the use of stepwise procedures for selection of explanatory variables to include in
the model when there is a large number of explanatory variables.
Predicting the Weight of a Horse’s Heart
46 terminally ill horses had a number of ultrasound measurements made on
their hearts which were weighed post-mortem. The following ultrasound
measurements were made:
thickness of the outer wall of the heart during systole (the pumping phase);
thickness of the outer wall during diastole (the recovery phase);
thickness of the inner wall during systole;
thickness of the inner wall during diastole.
The specific contexts dealt with in this practical are:
4: Possum morphometric measurements
Question:
5: Predicting the Weight of a Horse’s Heart;
Context 4 :
Which combination of these variables is most useful in predicting the weight
of the heart?
Using an appropriate regression model, obtain a prediction interval for the
weight of the heart for a horse with ultrasound measurements as follows
Mountain Possum Measurements
siw = 4.0,
Context:
Various morphometric measurements were made on captured possums
Data:
diw = 3.0, sow = 3.5, dow = 3.5.
The data are stored in a Minitab worksheet.
HORSE4
Question:
Which combination of the other measurement variables is most useful in
predicting the total length of the possum?
For your selected model, is there any gender difference? Try adding in a
dummy variables for the gender categorisation/
Data:
with columns.
C1: SIW
(systole inner wall)
C2: DIW
(diastole inner wall)
C3: SOW
(systole outer wall)
C4: DOW
(diastole outer wall)
C5: WT
(weight of horse’s heart; kg)
The data are stored in a Minitab worksheet.
possum
REPORT WRITING
For this context you should submit a report, pasting all relevant output into your report. For
guidance see the analysis below for Context 5.
Practical 3
Practical 3
Record your subjective impression below.
WORKSHEET FOR HORSE PROBLEM - 5
1. Examining the relationship
Firstly examine the relationships amongst the variables using plots and correlation
coefficients. A matrix plot is best obtained from the pull-down menus by
Graph > Matrix Plot
In the dialog box, select ‘siw’, ‘diw’, ‘sow’, ‘dow’, ‘wt’ into the ‘Graph variabes’ box.
Then click on ‘Options’ and specify ‘lower left’ in the ‘Matrix display’ and ‘Boundary’
under ‘Variable label placement’. (Add a title if you wish, under ‘Annotation’.)
2. Using Stepwise Regression
This gives
Since the 4 explanatory variables are highly correlated with each other as well as with the
weight of the heart, it is unlikely that they will all be required in a multiple regression model.
wt
dow
sow
diw
Matrix Plot of Horses Heart Data
3.32500
Stepwise regression can be used to help identify the ‘best’ model involving the smallest
number of explanatory variables.
To do this, type the command
1.97500
STEP ‘wt’ on C1-C4
3.97500
2.32500
giving
2.72500
F-to-Enter:
1.57500
Response is
3.76200
Step
Constant
1.92000
00
00
00
00
7 5 .3 25
00 .400
4
3
1 .9
2.6
siw
00
00
25 .975
3
2.3
diw
00
00
75 .725
2
1.5
sow
dow
Sample correlation coefficients are also useful in studying the relationships among the
variables. The sample correlation matrix is obtained by
CORR C1-C5
giving:
siw
diw
sow
diw
0.909
sow
0.825
0.772
dow
0.756
0.699
0.908
wt
0.778
0.811
0.779
dow
0.686
diw
T-Value
sow
T-Value
4.00
wt
F-to-Remove:
on
4.00
4 predictors, with N =
1
-1.062
2
-1.495
1.37
9.20
0.88
4.07
46
0.56
2.95
S
0.665
0.613
R-Sq
65.78
71.55
More? (Yes, No, Subcommand, or Help)
SUBC>
The best single explanatory variable is
variability in the weights.
The best explanatory variable in addition to diw is
in addition to each other because
of the variability in the weights.
explain
The stepwise process stops at this point because:
, which explains
of the
. Both of these are useful
and together they
Practical 3
Practical 3
3. Checking and using the ‘best’ model
The REGRESS command in Minitab can now be used for this ‘best’ model to obtain:
the standard errors of the parameters (for constructing CIs if required)
p values for hypothesis tests for the parameters
prediction intervals.
Note that Minitab’s stepwise output ends with the ‘SUBC!’ prompt. This allows you to
modify the stepwise process by entering or removing explanatory variables from the model.
It is often of interest (but not necessary) to see how the process would continue if the
restriction on entering only ‘significant’ explanatories were removed. To do this, type , at the
‘SUBC>’ prompt
FENTER = 0.
To obtain the required prediction interval, type
REGRESS ‘wt’ 2 ‘diw’ ‘sow’;
PREDICT 3.0, 3.5.
This gives:
This gives:
Step
Constant
The regression equation is
wt = - 1.49 + 0.880 diw + 0.561 sow
3
-1.455
4
-1.433
diw
T-Value
0.88
4.03
0.92
2.72
Predictor
Constant
diw
sow
sow
T-Value
0.72
2.19
0.73
2.13
s = 0.6133
dow
T-Value
-0.25
-0.58
-0.25
-0.57
siw
T-Value
Coef
-1.4948
0.8797
0.5612
Stdev
0.3728
0.2164
0.1901
R-sq = 71.5%
t-ratio
-4.01
4.07
2.95
p
0.000
0.000
0.005
R-sq(adj) = 70.2%
Analysis of Variance
-0.05
-0.17
S
0.618
0.625
R-Sq
71.78
71.80
More? (Yes, No, Subcommand, or Help)
SUBC>
SOURCE
Regression
Error
Total
DF
2
43
45
SS
40.672
16.173
56.845
SOURCE
diw
sow
DF
1
1
SEQ SS
37.395
3.277
MS
20.336
0.376
F
54.07
p
0.000
Unusual Observations
2
As you can see, the model with all 4 explanatories has an R value which is only 0.25%
higher than that for the model with 2 explanatories and this small increase is achieved at the
cost of adding another 2 explanatories to the model. However, since neither of these 2
explanatories were significant in addition to ‘diw’ and ‘sow’ at step 3, this model with two
explanatories is the best one to use in practice.
>To escape from the STEPWISE subcommands, type ‘NO’@
Obs.
44
45
diw
2.20
2.10
wt
4.0100
2.9700
Fit
2.1803
2.8219
Stdev.Fit
0.1210
0.3537
Residual
1.8297
0.1481
St.Resid
3.04R
0.30 X
R denotes an obs. with a large st. resid.
X denotes an obs. whose X value gives it large influence.
Fit
3.1085
Stdev.Fit
0.1236
(
95% C.I.
2.8593, 3.3578)
(
95% P.I.
1.8466, 4.3705)
Practical 3
Practical 3
Thus, a future horse with diw = 3.0 and sow = 3.5 (ignoring the values given for the other
explanatories not included in the model) is very likely to have a heart weighing between
and
with a best estimate of
.
Residuals Versus sow
(response is wt)
(response is wt)
3
3
Standardized Residual
Standardized Residual
As in Practical P, the simplest way to produce all the appropriate residual plots for
assumption checking is to run the multiple regression through the pull-down menus as
follows:
Residuals Versus diw
2
1
0
-1
-2
Stat > Regression > Regression
standardised residuals
Normal plot of residuals
Residuals vs fits
and select ‘diw’, ‘sow’ into the ‘Residuals vs variables’ box.
The requested plots are then produced in separate graph windows.
Normal Probability Plot of the Residuals
Residuals Versus the Fitted Values
(response is wt)
(response is wt)
3
2
Standardized Residual
Standardized Residual
3
1
0
-1
-2
2
1
0
-1
-2
0
Normal Score
1
2
3
diw
4
2
3
4
5
sow
In the probability plot (plot 1) the relationship is reasonably linear, so that the assumption of
Normality is reasonable.
Under ‘Graphs’, check the buttons for
-1
0
-1
There are no obvious patterns in plots 2, 3 or 4 to suggest problems with the assumptions of
linearity and constant variance.
and select ‘wt’ into the ‘Response’ box, with ‘diw’, ‘sow’ in the ‘Predictors’ box.
-2
1
-2
1
(You need not work through this again since the final model is identical to that in Practical
P.)
2
2
0
1
2
Fitted Value
3
4
Download