Regression Instruction

advertisement
Linear correlation and linear
regression + summary of tests
Week 12 Regression Theory
Recall: Covariance
n
cov ( x , y ) 
 ( x  X )( y
i 1
i
n 1
i
Y )
Interpreting Covariance
cov(X,Y) > 0
X and Y are positively correlated
cov(X,Y) < 0
X and Y are inversely correlated
cov(X,Y) = 0
X and Y are independent
Correlation coefficient

Pearson’s Correlation Coefficient is
standardized covariance (unitless):
r
cov ariance( x, y )
var x var y
Recall dice problem…
Var(x) = = 2.916666
Var(y) = 5.83333
Cov(xy) = 2.91666
R2=“Coefficient of
Determination” =
SSexplained/TSS
r
2.91666

2.91666 5.8333
1
 .707
2
R  .707  .5
2
2
 Interpretation of R2: 50% of the total
variation in the sum of the two dice is
explained by the roll on the first die.
Makes perfect intuitive sense!
Correlation
• Measures the relative strength of the linear
relationship between two variables
• Unit-less
• Ranges between –1 and 1
• The closer to –1, the stronger the negative linear relationship
• The closer to 1, the stronger the positive linear relationship
• The closer to 0, the weaker any positive linear relationship
Scatter Plots of Data with Various
Correlation Coefficients
Y
Y
Y
X
X
r = -1
r = -.6
Y

r=0
Y
Y
r = +1
X
X
X
r = +.3
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
X
r=0
Linear Correlation
Linear relationships
Y
Curvilinear relationships
Y
X
Y
X
Y
X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

X
Linear Correlation
Strong relationships
Y
Weak relationships
Y
X
Y
X
Y
X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

X
Linear Correlation
No relationship
Y
X
Y
X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Linear regression
http://www.math.csusb.edu/faculty/stanton/m262/regress/regress.html
In correlation, the two variables are treated as equals. In regression, one variable is considered
independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.
What is “Linear”?
• Remember this:
• Y=mX+B?
m
B
What’s Slope?
A slope of 2 means that every 1-unit change in X
yields a 2-unit change in Y.
Simple linear regression
P=.22; not
significant
The linear regression model: intercept
Love of Math = 5 + .01*math SAT score
slope
Prediction
If you know something about X, this knowledge helps you
predict something about Y. (Sound familiar?…sound
like conditional probabilities?)
Linear Regression Model
Y’s are modeled…
Yi=
100*X + random errori
Fixed –
exactly
on the
line
Follows a
normal
distribution
Assumptions (or the fine print)
• Linear regression assumes that…
– 1. The relationship between X and Y is linear
– 2. Y is distributed normally at each value of X
– 3. The variance of Y at every value of X is the same
(homogeneity of variances)
• Why? The math requires it—the mathematical
process is called “least squares” because it fits the
regression line by minimizing the squared errors from
the line (mathematically easy, but not general—relies
on above assumptions).
Expected value of y:
ŷ i
Expected value of y at level of x: xi=
ˆ
yˆ i  ˆ  xi
Residual
ˆ
ˆ
ˆ
ei  yi  yi  yi  (  xi )
We fit the regression coefficients such that sum
of the squared residuals were minimized (least
squares regression).
Residual
Residual = observed value – predicted value
Predicted value
33.5 weeks
Residual Analysis: check assumptions
ei  Yi  Yˆi
• The residual for observation i, ei, is the difference between
its observed and predicted value
• Check the assumptions of regression by examining the
residuals
–
–
–
–
Examine for linearity assumption
Examine for constant variance for all levels of X (homoscedasticity)
Evaluate normal distribution assumption
Evaluate independence assumption
• Graphical Analysis of Residuals
– Can plot residuals vs. X
Residual Analysis for Linearity
Y
Y
x
x
Not Linear
residuals
residuals
x
x

Linear
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Residual Analysis for
Homoscedasticity
Y
Y
x
x
Non-constant variance
residuals
residuals
x
x

Constant variance
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Residual Analysis for
Independence
Not Independent
X
Independent
residuals
residuals
X
residuals

Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

X
As a linear regression…
Intercept
represents the
mean value in
the even-day
group. It is
significantly
different than
0—so the
average Eng
SAT score is
not 0.
Slope represents
the difference in
means between odd
and even groups.
Diff is significant.
Parameter
Intercept
OddDay
Standard
Estimate
657.5000000
81.7307692
Error
t Value
Pr > |t|
23.66105065
32.81197359
27.79
2.49
<.0001
0.0204
Multiple Linear Regression
• More than one predictor…
=  + 1*X + 2 *W + 3 *Z
Each regression coefficient is the amount of change in
the outcome variable that would be expected per
one-unit change of the predictor, if all other variables
in the model were held constant.
ANOVA is linear regression!
A categorical variable with more than two groups:
E.g.: groups 1, 2, and 3 (mutually exclusive)
=  (=value for group 1) + 1*(1 if in group 2) + 2
*(1 if in group 3)
This is called “dummy coding”—where multiple
binary variables are created to represent being in
each category (or not) of a categorical variable
Other types of multivariate regression

Multiple linear regression is for normally
distributed outcomes

Logistic regression is for binary outcomes

Cox proportional hazards regression is used when
time-to-event is the outcome
Overview of statistical tests
The following table gives the appropriate choice of a
statistical test or measure of association for various
types of data (outcome variables and predictor
variables) by study design.
e.g., blood pressure= pounds + age + treatment (1/0)
Continuous outcome
Continuous predictors
Binary predictor
Alternative summary: statistics for
various types of outcome data
Are the observations independent or
correlated?
Outcome Variable
independent
correlated
Assumptions
Continuous
Ttest
ANOVA
Linear correlation
Linear regression
Paired ttest
Repeated-measures ANOVA
Mixed models/GEE modeling
Outcome is normally
distributed (important
for small samples).
Outcome and predictor
have a linear
relationship.
Difference in proportions
Relative risks
Chi-square test
Logistic regression
McNemar’s test
Conditional logistic regression
GEE modeling
Chi-square test
assumes sufficient
numbers in each cell
(>=5)
Kaplan-Meier statistics
Cox regression
n/a
Cox regression
assumes proportional
hazards between
groups
(e.g. pain scale,
cognitive function)
Binary or
categorical
(e.g. fracture yes/no)
Time-to-event
(e.g. time to fracture)
Continuous outcome (means)
Are the observations independent or correlated?
Outcome
Variable
Continuous
(e.g. pain
scale,
cognitive
function)
independent
correlated
Alternatives if the normality
assumption is violated (and
small sample size):
Ttest: compares means
Paired ttest: compares means
Non-parametric statistics
between two independent
groups
ANOVA: compares means
between more than two
independent groups
Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables
Linear regression:
multivariate regression technique
used when the outcome is
continuous; gives slopes
between two related groups (e.g.,
the same subjects before and
after)
Wilcoxon sign-rank test:
Repeated-measures
ANOVA: compares changes
Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
over time in the means of two or
more groups (repeated
measurements)
Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time
non-parametric alternative to the
paired ttest
parametric alternative to the ttest
Kruskal-Wallis test: non-
parametric alternative to ANOVA
Spearman rank correlation
coefficient: non-parametric
alternative to Pearson’s correlation
coefficient
Binary or categorical outcomes
(proportions)
Are the observations correlated?
Outcome
Variable
Binary or
categorical
(e.g.
fracture,
yes/no)
independent
correlated
Alternative to the chisquare test if sparse
cells:
Chi-square test:
McNemar’s chi-square test:
Fisher’s exact test: compares
Conditional logistic
regression: multivariate
McNemar’s exact test:
compares proportions between
more than two groups
compares binary outcome between
correlated groups (e.g., before and
after)
Relative risks: odds ratios
or risk ratios
Logistic regression:
multivariate technique used
when outcome is binary; gives
multivariate-adjusted odds
ratios
regression technique for a binary
outcome when groups are
correlated (e.g., matched data)
GEE modeling: multivariate
regression technique for a binary
outcome when groups are
correlated (e.g., repeated measures)
proportions between independent
groups when there are sparse data
(some cells <5).
compares proportions between
correlated groups when there are
sparse data (some cells <5).
Time-to-event outcome (survival
data)
Are the observation groups independent or correlated?
Outcome
Variable
Time-toevent (e.g.,
time to
fracture)
independent
correlated
Kaplan-Meier statistics: estimates survival functions for
n/a (already over
time)
each group (usually displayed graphically); compares survival
functions with log-rank test
Cox regression: Multivariate technique for time-to-event data;
gives multivariate-adjusted hazard ratios
Modifications to
Cox regression
if proportionalhazards is
violated:
Time-dependent
predictors or timedependent hazard
ratios (tricky!)
Download