class-20

advertisement
VECTOR PROJECTIONS
𝐿𝑦,π‘₯
𝑦π‘₯
=
π‘₯
𝑦
90˚
𝐿𝑦,π‘₯
π‘₯
MATRIX OPERATION: INVERSE MATRIX
Important for solving a set of linear equations, is the matrix operation that
defines an inverse of a matrix.
-1
X : Inverse matrix of X
-1
X X=I
where I is the identity matrix:
all entries on the diagonal are 1,
all others 0
1 0
0 1
0 0
0
0 ( here for 3 x 3 matrix)
1
MATRIX OPERATION:
Important for solving a set of linear equations, is the matrix operation that
defines an inverse of a matrix.
-1
X : Inverse matrix of X
-1
X X = I where I is the identity matrix
Not all matrices have an inverse matrix
and there is not a simple rule how to calculate the entries in an inverse matrix!
We skip the formal mathematical aspects and note here only the important facts:
For symmetric square matrices like covariance matrices or correlation matrices
the inverse exists
SUMMARY
Simple Linear Regression
Principal Component Analysis
SUMMARY
2-dimensional sample space:
Simple Linear Regression:
Minimizes the Summed Squared Errors
(measured in the vertical direction between
Fitted regression line and observed data points)
Principal Component Analysis:
Finds the direction of vector that
maximizes the variance
that is projecting onto this vector.
REGRESSION ANALYSIS IN R
Simple linear regression in R:
the function res<-lm( y ~ x )
calculates the linear regression line
It returns a number of useful additional
statistical measures of the quality of the
regression line.
Regression line using res$fitted
Residuals (errors) res$residuals
Remember: We assumed that errors
are uncorrelated to the ‘predictor’
variable x. It is recommended to check
that the errors itself do NOT have an
organized structure when plotted over x.
Histogram of residuals (errors) hist(res$residuals)
Remember: We assumed that errors
are uncorrelated to the ‘predictor’
variable x. It is recommended to check
also if the errors follow a Gaussian
(bell-shaped) distribution.
Note: the function fgauss() is defined in myfunctions.R [call source(“scripts/myfunctions.R”)
LINEAR REGRESSION STATISTICS
When applying linear regression, a number of test statistics are
calculated in R’s lm() function.
Slope of regression line
Regression
Parameter
(slope)
Statistical
significance:
The smaller the
value, the higher
the significance
of the linear
relationship
(slope >0)
Correlation coefficient between the fitted y-values and observed y-values
LINEAR REGRESSION:
USE THE LINEAR REGRESSION WITH CAUTION!
The sample space is important!
If you only observed x and y in a
limited range or a subdomain of the
sample space,
Outliers can have a large effect
and suggest a linear relationship
where there is none!
It can be tested for the influence
of single outlier observations.
LINEAR REGRESSION:
THE DANGER OF USING THE LINEAR
REGRESSION!
Outliers can have a large effect
and suggest a linear relationship
where there is none!
It can be tested for the influence
of single outlier observations.
The sample space is important!
If you only observed x and y in a
limited range or a subdomain of the
sample space, extrapolation can give misleading results
MULTIPLE LINEAR REGRESSION
Predictand
(e.g. Albany Airport
Temperature anomalies)
Random Error
(noise)
Predictors:
e.g.: Temperatures from nearby stations
or: Indices of Large-Scale Climate Modes
like El Nino Southern Oscillation, North Atlantic Oscillation
or: prescribed time-dependent functions like linear trend,
periodic oscillation, polynoms
Source: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis
(figures retrieved April 2014)
MULTIPLE LINEAR REGRESSION
Write a set of
linear equations
for each observation
in the sample (e.g.
for each year of
temperature
observations
Source: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis
(figures retrieved April 2014)
MULTIPLE LINEAR REGRESSION
Or in short Matrix notation
𝑦 = 𝑋𝛽 +πœ€
Source: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis
(figures retrieved April 2014)
MULTIPLE LINEAR REGRESSION
𝑦
=
𝑋𝛽
size of the vectors / matrices:
nx1
nxk
+
kx1
πœ€
nx1
The mathematical problem we need to solve is:
Given all the observations of the predictand (stored in vector 𝑦 ) and the
predictor variables stored in matrix X, we want to find simultaneously a
for each predictor variable a proper scaling factor, such that the fitted estimated
values minimize the sum of the squared errors.
Source: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis
(figures retrieved April 2014)
MULTIPLE LINEAR REGRESSION
We find here the covariance
matrix (scaled by n)
of the predictor variables.
The ‘-1’ indicates
another fundamentally
important matrix operation:
The inverse of a matrix
𝑦 = 𝑋𝛽 +πœ€
𝑦 = 𝑋𝛽
𝛽 =
𝑇
𝑋 𝑋
size of the vectors / matrices:
kx1
( kxn nxk )
(k x k)
−1
𝑇
𝑋 𝑦
kxn nx1
(k x 1)
Covariance
(scaled by n)
of all predictors
with the
predictand
Source: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis
(figures retrieved April 2014)
MULTIPLE LINEAR REGRESSION
The resulting kx1 matrix
(i.e. vector) contains a
proper scaling factor
for each predictor.
In other words: multiple linear
regression is a weighted sum
of the predictors (after conversion
into units of the predictand y).
𝑦 = 𝑋𝛽 +πœ€
𝑦 = 𝑋𝛽
𝛽 =
𝑇
𝑋 𝑋
size of the vectors / matrices:
kx1
( kxn nxk )
(k x k)
−1
𝑇
𝑋 𝑦
kxn nx1
(kx1)
Source: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis
(figures retrieved April 2014)
EXAMPLE MULTIPLE LINEAR REGRESSION
WITH 2 PREDICTORS
The scatter cloud shows a linear
dependence of the values in y
along the two predictor
dimensions x1 x2.
TIPS FOR MULTIPLE LINEAR REGRESSION (MLR)
οƒ’
οƒ’
οƒ’
οƒ’
οƒ’
General rule: work with as few predictors as possible. (every time you add a
new predictor
you increase the risk of over-fitting the model)
Observe how good the fitted values 𝑦 and observed values 𝑦 match
(correlation)
Choose predictors that provide independent
information about the predictand
The problem of collinearity:
If the predictors are all highly correlated among each other
then the MLR can become very ambiguous (because it gets harder to
calculate accurately the inverse of the covariance matrix)
Last but not least: the regression coefficients from the MLR are not
‘unique’. If you add or remove one predictor, all regression coefficients can
change.
PRINCIPAL COMPONENT ANALYSIS
οƒ’
Global Sea Surface Temperatures
From voluntary ship observations
colors show the percentage of months
with at least one observation in a
2 by 2 degree grid box.
From paper in Annual Review
of Marine Science (2010)
PRINCIPAL COMPONENT ANALYSIS
οƒ’
Global Sea Surface Temperatures
Climatology 1982-2008
Red areas mark regions with highest
SST variability
PRINCIPAL COMPONENT ANALYSIS
οƒ’
Global Sea Surface Temperatures
Principal Component Analysis (PCA)
(Empirical Orthogonal Functions (EOF))
The first leading Eigenvector
Eigenvectors form now
geographic pattern. Grids with high
positive values and large negative
values are covarying out of phase
(negative correlation). Green regions
show small variations in this
Eigenvector #1.
The Principal Component is a time series
showing the temporal evolution of the SST
variations. This mode is associated
with the El Niño - Southern Oscillation
Download