Correlation and Simple Linear Regression

advertisement
Correlation and Simple
Linear Regression
Pearson’s Product Moment Correlation
(sample correlation r estimates population correlation r)
• Measures the strength of linear association
between two numeric variables X and Y.
• The correlation (r) is a unit-less quantity.
• -1 < r < 1
• If r < 0 then there is a negative association
between X and Y, i.e. as X increases Y generally
decreases
• If r > 0 then there is a positive association
between X and Y, i.e. as X increases Y generally
increases
Pearson’s Product Moment Correlation
(sample correlation r estimates population correlation r)
• The close r is to 0 the weaker the
linear association between X and Y.
•
Sample Correlation (r)
Adjective scale for
the sample
correlation
coefficient ( r ).
Pearson’s Product Moment
Correlation ( r )
Some examples of various positive and negative correlations.
Pearson’s Product Moment Correlation
(sample correlation r estimates population correlation r)
• Never calculate a correlation
coefficient without plotting the data!
• Correlation is NOT causation!
• Beware of influential points.
r = .11
r = .72
Pearson’s Product Moment Correlation
(sample correlation r estimates population correlation r)
• Never calculate a correlation
coefficient without plotting the data!
• Correlation is NOT causation!
• Beware of influential points.
r = .11
r = .72
Pearson’s Product Moment Correlation
(sample correlation r estimates population correlation r)
• Beware of outliers
r = .86
r = .66
Examples
where no form
of correlation
coefficient is
appropriate
Pearson’s Product Moment
Correlation ( r )
• The formula (and equivalencies)
n
r
 x
i 1
n
  xi
i 1
i
 x  y i  y 
 x
2
n
  yi
 y
2
i 1
n
 xi  x  y i  y 
1





n  1 i 1  s x  s y 
n
1

( z  score xi )( z  score y i )

n  1 i 1
Pearson’s Product Moment Correlation ( r )
1 n
z  score xi z  score yi 
r

n  1 i 1
zx , zy > 0
zx < 0 & z y > 0
y
zx > 0 & z y < 0
zx , zy < 0
x
Pearson’s Product Moment Correlation ( r )
1 n
z  score xi z  score yi 
r

n  1 i 1
zx < 0 & z y > 0
zx , zy > 0
y
zx , z y < 0
zx > 0 & z y < 0
x
Nonlinear Relationships
Pulmonary Artery Pressure vs. Relative Flow
Velocity Change
Not all relationships are
linear. In cases where there
is clear evidence of a
nonlinear relationship DO
NOT use Pearson’s Product
Moment Correlation ( r ) to
summarize the strength
of the relationship between
Y and X.
Clearly artery pressure (Y)
and change in flow velocity
(X) are nonlinearly related.
Testing for Significant Correlation
Testing Population Correlation (r)
Ho : r  0
H a : r  0 or ( or )
t
r n2
1 r
2
~ t - dist df  n  2
Most software packages will conduct this test for any
correlation you are interested in. When looking at multiple
correlations be sure consider “Bonferroni correcting”.
Other measures of correlation/association
• Spearman’s Rank Correlation ( rs ) – can be used
with ordinal data where the levels are in some
sense equidistant. Can also be used when
relationships are nonlinear but monotonic. It
involves ranking the x’s and y’s and finding
Pearson’s correlation based on the ranks.
• Kendall’s Tau (a, b, & c) – is used when X and Y
are both ordinal and they need not be “equidistant”.
a – does not adjust for ties
b – adjusts for ties, X and Y must have same levels
c – adjust for ties, X and Y don’t have same levels
Data for Kendall’s Tau-a, b, or c could summarized using a contingency table.
Other measures of correlation/association
• Biserial Correlation – correlation when X is
continuous and Y is naturally dichotomous
(e.g. male/female, smoker/non-smoker) or
a created (e.g. Age < 18/Age > 18).
• Polyserial Correlation – correlation when X
is continuous and Y is ordinal.
• JMP/SPSS compute Spearman’s and
Kendall’s tau however the others require
specialized software.
Examples: NC Births Data
Age of Father vs. Age of Mother
The Pearson Product
Moment Correlation
(r = .7543, p < .0001)
suggests a fairly strong
correlation between
father’s age and mother’s
age.
The line represents the
mean age of the father
given the mother’s age.
Examples: NC Births Data
Birth weight (g) and Gestational Age (wks)
Smooth Estimate of
Is the relationship between
the birth weight and
gestational age linear?
E(Birth Weight|Gest. Age)
Hard to say, consider adding a
smoothing spline to the plot.
The smoothing spline helps us
see the relationship between
the mean birth weight and
gestational age.
The notation we use for the
mean of Y given X is E(Y|X).
Here we have…
E(Birth Weight|Gest. Age)
Examples: NC Births Data
Birth weight (g) and Gestational Age (wks)
The Pearson Product Moment
Correlation (r = .5515,
p < .0001) suggests a
moderate correlation between
gestational age and birth
weight. However the smooth
curve estimate of the mean
birth weight suggests the
relationship is not linear, thus
the Pearson correlation may
not be appropriate.
We could consider using
Spearman’s rank correlation
instead (rs = .3056,
p < .0001).
Examples: NC Births Data
Birth Weight vs. Smoking Status
The biserial correlation
between smoking
status and birth weight
is (rbs = -.1175, p = .0002)
suggesting a weak
negative association
between smoking and
birth weight.
The biserial correlation is found by computing Pearson’s
correlation between birth weight and smoking status, where
smoking status is coded as 0 = no, 1 = yes and treated as a
numeric quantity.
Medicare Survey Data – General Health at
Baseline & Follow-up
Association between baseline & follow-up general
health (revisited)
Kendall’s Tau can be
used measure the
degree of
association between
two ordinal
variables, here
(t .5933, p < .0001)
Simple Linear Regression
• Regression refers to the estimation of the
mean of a response (Y) given information
about single predictor X in case of simple
regression or multiple X’s in the case of
multiple regression.
• We denote this mean as E(Y|X) in the
case simple regression and
E(Y|X1,X2,…,Xp) in the case multiple
regression with p potential predictors.
Simple Linear Regression
• In the case of simple linear regression we
assume that the mean can modeled using
function that is a linear function of
unknown parameters, e.g.
E (Y | X )   o  1 X
or
E (Y | X )   o  1 X   2 X 2
or
E (Y | X )   o  1 ln( X )
These are all example of simple
linear regression models. When
many people think of simple
linear regression they think of the
first mean function because it is
the equation of a line. This
model will be our primary focus.
Simple Linear Regression Model
The regression model…
y  mean  random scatter
y  E (Y | X )  error
observed y = trend + scatter
E(Y|X) + SD(Y|X)
E(Y|X)
E(Y|X) - SD(Y|X)
The data model
yi  E ( yi | xi )  ei
The fitted model
yi  Eˆ ( yi | xi )  eˆi  yˆ i  eˆi
Estimated mean function
Residual or
random error
Assumptions
1. The assumed functional form for the mean function
E(Y|X) is correct, e.g. we might assume a line for
the mean function, E(Y|X) = o1X.
2. The random scatter, i.e. ei’s “random errors”
We assume that ei ~ N(0,s2) and independent.
This means for a given X, Y is normal with
constant variation for all X.
3. Error standard deviation (s )
– Std. dev of random process producing the “errors” ei
– Governs amount of scatter about the mean E(Y|X)
• big s  lots of scatter, s    no scatter
Assumptions (cont’d)
Note: Normality is required for
inference, i.e. t-tests, F-tests,
and CI’s for model parameters
and predictions.
Gestational Age and Weight
In a study of premature infants researchers
looked at their gestational age weeks and
weight and the following data was gathered:
Simple Linear Regression
Example 1: Gestational Age (weeks) and Weight of
Premature Babies (g)
E(Weight|Age) = o  1Age
( xi , yˆ i )
( xi , yˆ i )
= -1404.36 + 80.06Age
r = .80
eˆi  yi  yˆ i
( xi , y i )
The regression equation is
estimated from the data by
minimizing the squared
vertical distances from the
data points to the fitted line.
These vertical distances are
called residuals. Some of
the residuals eˆ i have been
added to the plot.
The fitted values are the points that lie on the line or each of the xi
values. They represent the estimated mean for that value of x.
Fitting a line by least squares
Which line?
??
??
??
??
??
Fitting a line by least squares
(a) The data
(b) Which line?
• Choose line with smallest sum of squared
prediction errors
• i.e. smallest Residual Sum of Squares, RSS
??
Fitting a line by least squares
Least-squares line
i th data point
(xi , yi )
The prediction errors
Place
on any line
(residuals)
Choose line to
Minimize S(yi 
yi
)2
Prediction
error
i.e. with smallest
sum of squared
prediction errors
x1 x 2 . .
xi
. . .
i.e. with the smallest sum of squared
lengths of the “error” arrows
xn
Method of Least Squares
To estimate the regression line we choose
This requires calculus but the solutions are easy to
express in terms of standard summary statistics
Next we look at the estimated coefficients and their
interpretation.
The Regression Line
i.e. y = mx + b
^
0 = Intercept
=
-value at x = 0
Interpretable only if x = 0 is a
value of particular interest.
^
1 w units
^
0
^
1 = Slope
w units
= Change in for every
unit increase in x
0
x
Always interpretable !
Interpretation of Coefficients and the
Estimated Regression Equation
grams
What
are the units on the
intercept and slope parameters?
grams/week
How do we interpret the estimate values?
o  y-intercept, the mean of Y when X = 0, usually of no interest
unless 0 is reasonable value for X. Here X = 0 is meaningless.
1 = slope, the change in the mean of Y when X increases by 1.
Here we estimate the mean weight increases 86 grams/week.
Interpretation of Coefficients and the
Estimated Regression Equation
Estimating the mean weight as a function of the
gestational age.
Use the equation to estimate the mean weight of
infants born with a gestational age of 30 weeks.
Use the equation to estimate the mean weight of
infants born with a gestational age of 42 weeks.
This is beyond the range of the data as none of infants in these data were
full term or longer.
F-test in Regression ANOVA Summary
Source
Regression
Residual
Total
Sum
of Sqs
df
k
n-k-1
Mean SS
F-statistic
MSReg F0=
MSReg
/ MSE
p-val.
P( F  F0)
MSE
n-1
Note: F0 = MSReg /MSE
Tests
H0: The regression is NOT useful
i.e., H0: 1 = 0 (and 2 = 0 and …. k = 0 if we are
performing multiple regression)
• Almost always significant
(P-value almost always small)
• Very rare that an investigator’s intuition is so bad that none of
her or his explanatory variables have any predictive value
F-test in Regression ANOVA Summary
Fo = 67.49, p-value < .0001
Thus we conclude that the regression is useful
and that gestational age helps explain the
variation in the observed birth weights.
Summarizing the Fit
R2 = proportion of variation
explained by the regression of
Y on X. Here this proportion is
.6398 or 63.98%. This will be
discussed in next few slides.
Estimate of residual or error variance (s)
This is also called Root Mean Square Error (RMSE)
Towards “Percent Variation Explained”
y
Actual y-observations
Shows the variation
in the y's
x
Fitted or predicted values
y
Shows the variation
in the ’s
x
Percent of Variation Explained
x
y
x
In a situation where we had a perfectly fitting model,
we would get this much variation in the y’s
transmitted from the variation in the x’s
Percent of Variation Explained
x
y
x
Our data has slightly more variation in the y’s than that.
Percent of Variation Explained
x
y
x
Variation in the ’s:
This amount of variation can
be “explained” as transmitted
from the variation in the x’s
We see some additional
variation in the y-values here.
The excess (residual
variation) is not explained by
the model.
R-squared: Percent variation Explained
(R2 is also called the “Coefficient of determination”)
• When expressed as a percentage,
R2 is “percent variation explained ”
• It is the percentage of variation in the y-values that
the model can explain from the variation in the xvalues.
Summarizing the Fit
R2 = 63.98% of the
variation in the birth
weights can be
explained by the
regression on the
gestational age.
Inference for Model Parameters
Testing Parameters (j)
These both apply for
multiple regression as
well.
Confidence Interval for j
Inference for Model Parameters
Testing Parameters (j)
We have strong
evidence that the
slope is not 0 and
hence conclude that
gestational age is a
statistically significant
predictor.
Inference for Model Parameters
Confidence Interval for j
We estimate that infants in this population gained between
64.85 g and 107.27 g per week of gestation with 95%
confidence.
Estimating E(Y|X) and Predicting
for Response Values for Individuals
• We can construct CI’s for the mean of Y for
a given value of X or for an individual with a
given value of X. For the latter case we
refer to the interval as a prediction interval
(PI).
• Both intervals we use the same point
estimate from the regression equation as
the center of the confidence interval.
Estimating E(Y|X) and Predicting
for Response Values for Individuals
Confidence Interval for the
mean birth weight of
infants in this population
with a gestational age of
34 weeks.
Prediction Interval for the
birth weight of value of an
infant with a gestational
age of 30 weeks.
Estimating E(Y|X) and Predicting for Response
Values for Individuals
Gestational Age = 30 weeks
We estimate the mean birth weight of infants born with a gestational age of
30 weeks is between 1113.78 g and 1241.19 g. (CI)
We estimate that 95% of all infants born with a gestational age of 30 weeks
will have a birth weight between 795.89 g and 1559.08 g. (PI)
Gestational Age = 34 weeks
We estimate that the mean birth weight of infants born with a gestational
age of 34 weeks is between 1435.76 g and 1607.67 g. (CI)
We estimate that 95% of all infants born with a gestational age of 34 weeks
will have a birth weight between 1135.79 g and 1907.66 g. (PI)
Checking Assumptions
1. The assumed functional form for the mean function
E(Y|X) is correct, e.g. we might assume a line for
the mean function, E(Y|X) = o1X.
Plot of residuals vs. fitted values
Lack of fit tests (when available)
2. The errors are independent, normally distributed,
with constant variance, i.e. ei ~ N(0,s2).
Plot of residuals vs. fitted values
Normal quantile plot of the residuals
Checking Assumptions
Residuals
Residuals vs. Fitted Values
0
This is the ideal plot that suggests no violation of model
assumptions. The variation is constant and there is no
evidence of lack of fit.
Checking Assumptions
Systolic vs. Diastolic Blood
Pressure – relationship looks
linear with constant variation
throughout.
Residuals vs. Fitted Values –
residuals exhibit no discernible
trend or pattern and the variation is
clearly constant throughout. This
residual plot looks ideal with the
possible exception of a few mild
outliers.
Checking Assumptions
FEV vs. Height (in.) – there is
Residuals vs. Fitted Values
some evidence of curvature in the
scatter plot.
Suggests the mean function is
not linear and the error variance
is not constant.
Checking Assumptions
CO2 vs. Time – monthly carbon
Residuals vs. Fitted Values
dioxide readings at Mauna Loa
volcano in Hawaii. Again some
curvature is evident.
Suggests the mean function is
not linear.
Checking Assumptions
Residuals vs. Fitted Values
A closer inspection of suggests the mean function is not linear
AND more importantly the residuals are NOT independent as
they are clearly related to one another. There is clearly some
type of repeating seasonal pattern in these data that needs to
be accounted for in the modeling process.
Checking Assumptions
For the birth weight and gestational age
regression example the plot of the residuals
vs. the fitted values shows no problems
model inadequacies.
Checking Assumptions
To check normality we simply construct a
normal quantile plot of the residuals.
This is the normal
quantile plot for the
residuals from the birth
weight and gestational
age study. This plot
suggests approx.
normality for the error
distribution.
Identifying Outliers
To check for outliers determine the value of 2*RMSE, i.e. 2sˆ .
Any observations outside these bands are potential outliers
and should be investigated further to determine whether or
not they adversely affect the model.
 2sˆ
0
Example 2: Evaluating two-dimensional
echocardiography (2D ECHO)
A study evaluated 2D ECHO for the
assessment of left ventricular diastolic filling.
Data was collected on half-filling fraction
(1/2FF) for 27 patients as determined by
both 2D ECHO and angiography using
diagnostic cardiac catheterization.
Q: Do these data provide evidence that 1/2FF
obtained from angiography and 2D ECHO
are not equivalent?
Example 2: Evaluating two-dimensional
echocardiography (2D ECHO)
r = .8432
(p < .0001)
Regression Estimate
Angio = .077 + .857 2D ECHO
What should o and 1 if the methods are
equivalent?
Example 2: Evaluating two-dimensional
echocardiography (2D ECHO)
R-square = 71%
RMSE = .0835
Regression is useful, p < .0001
There is no evidence of lack of fit
(p = .2963)
Intercept (o) is not significantly
different from 0 (p = .2515)
Slope (1) is significantly different from
0 (p < .0001)
Example 2: Evaluating two-dimensional
echocardiography (2D ECHO)
If equivalent the intercept (o) should be 0
and the slope (1) should be 1, we can test
these using the t-test.
INTERCEPT
H o : o  0
SLOPE
H o : 1  1
H a : o  0
H a : 1  1
.077  0
t
 1.17
.066
P (| t | 1.17)  .2515
.857  1
t
 1.31
.109
P (| t | 1.31)  .2018
Example 2: Evaluating two-dimensional
echocardiography (2D ECHO)
Checking assumptions
Residuals vs. Fitted Values
Normal quantile plot of residuals
this plot suggest no model
violations although there is one
mild outlier.
normality looks good again with the
exception of the mild outlier.
Example 2: Evaluating two-dimensional
echocardiography (2D ECHO)
Because we fail to reject the null hypothesis for
both parameters we have no evidence
against equality of the two methods for
determining the half-filling fraction (1/2FF).
What other methods could be used to answer
the question of interest?
• Dependent samples t-test (paired t-test)
• Wilcoxon signed rank test
• Sign test (definitely the worst choice)
Summary
• Type of correlation measure depends on
- data types of X and Y
- nature of the relationship (linear?)
• Simple Linear Regression
- estimate the E(Y|X)
- E(Y|X) need not a be line
- be sure to check assumptions
- variety of inferential methods
• If assumptions are violated we need to
change model or transform X and/or Y.
Download