Linear Regression and Correlation

Linear Regression and Correlation
Fitted Regression Line
200
Y=Weight(g)
180
160
140
120
100
80
54
56
58
60
62
Length (cm)
64
66
68
70
Equation of the Regression Line
Y  b0  b1 X
Least squares regression line of Y on X
b1 
 ( xi  x )( yi  y )
( xi  x ) 2

b0  y  b1x
Regression Calculations
Plotting the regression line
Residuals
 Using the fitted line, it is possible to
obtain an estimate of the y coordinate
yˆi  b0  b1xi
The “errror” in the fit we term the “residual
error”
yi  yˆi
200
Residual
Y=Weight(g)
180
160
140
120
100
80
54
56
58
60
62
Length (cm)
64
66
68
70
Residual Standard Deviation
( yi  yˆ )

n2
sY | X
2
200
Y=Weight(g)
180
160
140
120
100
80
54
56
58
60
62
Length (cm)
64
66
68
70
Residuals from example
Other ways to evaluate residuals
Lag plots, plot residuals vs. time delay of
residuals…looks for temporal structure.
Look for skew in residuals
Kurtosis in residuals – error not distributed
“normally”.
Model Residuals: constrained Model Residuals: freely moving
40
40
Pairwise
model
20
0
40
20
0
40
Independent
model
20
0
-0.3
0
Pairwise
model
Independent
model
20
0.3
0.08
0
-0.3
0
0.15
0.06
0.1
Pairwise
model
0.04
0.02
0.05
0
0.3
Pairwise
model
0
-0.05
-0.02
-0.1
-0.04
-0.15
-0.06
-0.08
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.05
-0.2
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.1
0
0
-0.05
-0.1
-0.15
-0.2
Independent
model
-0.1
-0.2
-0.25
Independent
model
-0.3
-0.3
-0.35
-0.35
-0.3
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
-0.4
-0.4
-0.3
-0.2
-0.1
0
0.1
Parametric Interpretation of regression: linear
models
Conditional Populations and Conditional
Distributions
A conditional population of Y values associated
with a fixed, or given, value of X.
A conditional distribution is the distribution of
values within the conditional population above
 Y|X  Population mean Y value for a given X
 Y | X  Population SD of Y value for a given X
The linear model
Assumptions:
Linearity
Constant standard deviation
Y  Y|X 
Y|X   0  1 X
Y   0  1 X 
Statistical inference concerning
 You can make statistical inference
on model parameters themselves
b0
estimates
b1
estimates
sY | X
estimates
0
1
Y|X
1
Standard error of slope
 95% Confidence
interval for 1
b1  t0.025 SEb1
where
SEb1 
sY | X
 (x  x)
i
2
Hypothesis testing: is the slope
significantly different from zero?
1 = 0
Using the test statistic:
b1
ts 
SE b1
df=n-2
Coefficient of Determination
 r2, or Coefficient of determination: how much of
the variance in data is accounted for by the linear
model.
ˆ
(
y

y
)

i
1
2
 ( yi  y )
2
Line “captures” most of the data
variance.
Correlation Coefficient
R is symmetrical under exchange of x and
y.
sY
b1  r *
sX
r
 ( x  x )( y  y )
 ( x  x )  ( y  y)
i
i
2
i
i
2
What’s this?
It adjusts R to compensate for the fact
That adding even uncorrelated variables to
the regression improves R
Statistical inference on correlations
Like the slope, one can define a t-statistic
for correlation coefficients:
b1
n 1
ts 
r
2
SEb1
1 r
Consider the following some “Spike
Triggered Averages”:
500V
0
-500V
-5
0
5
10
15
0
5
10
15
0
5
10
15
0
5
10
15
15
0
-2mV
-5
0
5
10
15
15
5mV
0
-5mV
-5
0
5
10
15
2mV
0
-2mV
-5
0
5
10
15
-5
500V
0
-500V
-5
500V
0
5
10
15
1mV
0
-1mV
0
5
10
0
-500V
-5
0
5
10
5mV
0
-5mV
-5
5mV
0
-5mV
-5
10mV
0
-10mV
-5
2mV
STA example
R2=0.25.
Is this correlation
significant?
b
n 1
ts  1  r
SEb1
1 r2
N=446, t =
0.25*(sqrt(445/(10.25^2))) = 5.45
6
4
2
0
-2
-4
-6
-10
-5
0
5
When is Linear Regression Inadequate?
Curvilinearity
Outliers
Influential points
Curvilinearity
0
-5
-10
-15
-20
-25
60
61
62
63
64
65
66
67
68
69
70
Outliers
 Can reduce correlations and unduly influence the
regression line
 You can “throw out” some clear outliers
 A variety of tests to use. Example? Grubb’s test
m ean value
Z
SD
 Look up critical Z value in a table
 Is your z value larger?
 Difference is significant and data
can be discarded.
Influential points
Points that have a lot of influence on
regressed model
Not really an outlier, as residual is small.
300
250
200
150
100
50
50
55
60
65
70
75
80
85
90
95
100
105
Conditions for inference
 Design conditions
 Random subsampling model: for each x observed, y is viewed as
randomly chosen from distribution of Y values for that X
 Bivariate random sampling: each observed (x,y) pair must be
independent of the others. Experimental structure must not
include pairing, blocking, or an internal hierarchy.
 Conditions on parameters
Y|X  0  1 X
sY | X is not a function of X
 Conditions concerning population distributions




Same SD for all levels of X
Independent Observatinos
Normal distribution of Y for each fixed X
Random Samples
Error Bars on Coefficients of Model
MANOVA and ANCOVA
MANOVA
Multiple Analysis of Variance
Developed as a theoretical construct by
S.S. Wilks in 1932
 Key to assessing differences in groups
across multiple metric dependent
variables, based on a set of categorical
(non-metric) variables acting as
independent variables.
MANOVA vs ANOVA
 ANOVA
Y1
=
(metric DV)
X1 + X2 + X3 +...+ Xn
(non-metric IV’s)
 MANOVA
Y1 + Y2 + ... + Yn = X1 + X2 + X3 +...+ Xn
(metric DV’s)
(non-metric IV’s)
ANOVA Refresher
SS
Df
k-1
Between
SS(B)
Within
SS(W)
N-k
Total
SS(W)+SS(B)
N-1
MS
SS ( B )
k 1
F
MS ( B)
MS (W )
SS (W )
N k
Reject the null hypothesis if test statistic is greater than critical F value with k-1
Numerator and N-k denominator degrees of freedom. If you reject the null,
At least one of the means in the groups are different
MANOVA Guidelines
Assumptions the same as ANOVA
Additional condition of multivariate
normality
all variables and all combinations of the
variables are normally distributed
Assumes equal covariance matrices
(standard deviations between variables
should be similar)
Example
 The first group receives
technical dietary information
interactively from an on-line
website. Group 2 receives the
same information in from a
nurse practitioner, while
group 3 receives the
information from a video tape
made by the same nurse
practitioner.
 User rates based on
usefulness, difficulty and
importance of instruction
 Note: three indexing
independent variables and
three metric dependent
Hypotheses
H0: There is no difference between
treatment group (online learners) from oral
learners and visual learners.
HA: There is a difference.
Order of operations
MANOVA Output 2
Individual ANOVAs not significant
MANOVA output
Overall multivariate effect is signficant
Post hoc tests to find the culprit
Post hoc tests to find the culprit!
Once more, with feeling: ANCOVA
Analysis of covariance
Hybrid of regression analysis and ANOVA
style methods
Suppose you have pre-existing effect
differences between subjects
Suppose two experimental conditions, A
and B, you could test half your subjects
with AB (A then B) and the other half BA
using a repeated measures design
Why use?
 Suppose there exists a particular variable that *explains*
some of what’s going on in the dependent variable in an
ANOVA style experiment.
 Removing the effects of that variable can help you
determine if categorical difference is “real” or simply
depends on this variable.
 In a repeated measures design, suppose the following
situation: sequencing effects, where performing A first
impacts outcomes in B.
Example: A and B represent different learning
methodologies.
 ANCOVA can compensate for systematic biases among
samples (if sorting produces unintentional correlations in
the data).
Example
Results
Second Example
How does the amount spent on groceries,
and the amount one intends to spend
depend on a subjects sex?
H0: no dependence
Two analyses:
MANOVA to look at the dependence
ANCOVA to determine if the root of there is
significant covariance between intended
spending and actual spending
MANOVA
Results
ANCOVA
ANCOVA Results
So if you remove the amount the subjects intend to spend from the equation,
No significant difference between spending. Spending difference not a result
Of “impulse buys”, it seems.
Principal Component Analysis
Say you have time series data,
characterized by multiple channels or
trials. Are there a set of factors underlying
the data that explain it (is there a simpler
exlplanation for observed behavior)?
In other words, can you infer the quantities
that are supplying variance to the
observed data, rather than testing
*whether* known factors supply the
variance.
Example: 8 channels of recorded
EMG activity
0
500
1000
1500
2000
2500
PCA works by “rotating” the data
(considering a time series as a spatial
vector) to a “position” in the abstract space
that minimizes covariance.
Don’t worry about what this means.
Note how a single component explains
almost all of the variance in the 8 EMGs
Recorded.
0
500
1000
1500
2000
2500
Next step would be to correlate
these components with some
other parameter in the experiment.
0
500
1000
1500
2000
2500
Neural firing rates
Largest PC
0
500
1000
1500
2000
2500
Some additional uses:
Say you have a very large data set, but
believe there are some common
features uniting that data set
Use a PCA type analysis to identify
those common features.
Retain only the most important
components to describe “reduced” data
set.