VECTOR PROJECTIONS πΏπ¦,π₯ π¦π₯ = π₯ π¦ 90Λ πΏπ¦,π₯ π₯ MATRIX OPERATION: INVERSE MATRIX Important for solving a set of linear equations, is the matrix operation that defines an inverse of a matrix. -1 X : Inverse matrix of X -1 X X=I where I is the identity matrix: all entries on the diagonal are 1, all others 0 1 0 0 1 0 0 0 0 ( here for 3 x 3 matrix) 1 MATRIX OPERATION: Important for solving a set of linear equations, is the matrix operation that defines an inverse of a matrix. -1 X : Inverse matrix of X -1 X X = I where I is the identity matrix Not all matrices have an inverse matrix and there is not a simple rule how to calculate the entries in an inverse matrix! We skip the formal mathematical aspects and note here only the important facts: For symmetric square matrices like covariance matrices or correlation matrices the inverse exists SUMMARY Simple Linear Regression Principal Component Analysis SUMMARY 2-dimensional sample space: Simple Linear Regression: Minimizes the Summed Squared Errors (measured in the vertical direction between Fitted regression line and observed data points) Principal Component Analysis: Finds the direction of vector that maximizes the variance that is projecting onto this vector. REGRESSION ANALYSIS IN R Simple linear regression in R: the function res<-lm( y ~ x ) calculates the linear regression line It returns a number of useful additional statistical measures of the quality of the regression line. Regression line using res$fitted Residuals (errors) res$residuals Remember: We assumed that errors are uncorrelated to the ‘predictor’ variable x. It is recommended to check that the errors itself do NOT have an organized structure when plotted over x. Histogram of residuals (errors) hist(res$residuals) Remember: We assumed that errors are uncorrelated to the ‘predictor’ variable x. It is recommended to check also if the errors follow a Gaussian (bell-shaped) distribution. Note: the function fgauss() is defined in myfunctions.R [call source(“scripts/myfunctions.R”) LINEAR REGRESSION STATISTICS When applying linear regression, a number of test statistics are calculated in R’s lm() function. Slope of regression line Regression Parameter (slope) Statistical significance: The smaller the value, the higher the significance of the linear relationship (slope >0) Correlation coefficient between the fitted y-values and observed y-values LINEAR REGRESSION: USE THE LINEAR REGRESSION WITH CAUTION! The sample space is important! If you only observed x and y in a limited range or a subdomain of the sample space, Outliers can have a large effect and suggest a linear relationship where there is none! It can be tested for the influence of single outlier observations. LINEAR REGRESSION: THE DANGER OF USING THE LINEAR REGRESSION! Outliers can have a large effect and suggest a linear relationship where there is none! It can be tested for the influence of single outlier observations. The sample space is important! If you only observed x and y in a limited range or a subdomain of the sample space, extrapolation can give misleading results MULTIPLE LINEAR REGRESSION Predictand (e.g. Albany Airport Temperature anomalies) Random Error (noise) Predictors: e.g.: Temperatures from nearby stations or: Indices of Large-Scale Climate Modes like El Nino Southern Oscillation, North Atlantic Oscillation or: prescribed time-dependent functions like linear trend, periodic oscillation, polynoms Source: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis (figures retrieved April 2014) MULTIPLE LINEAR REGRESSION Write a set of linear equations for each observation in the sample (e.g. for each year of temperature observations Source: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis (figures retrieved April 2014) MULTIPLE LINEAR REGRESSION Or in short Matrix notation π¦ = ππ½ +π Source: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis (figures retrieved April 2014) MULTIPLE LINEAR REGRESSION π¦ = ππ½ size of the vectors / matrices: nx1 nxk + kx1 π nx1 The mathematical problem we need to solve is: Given all the observations of the predictand (stored in vector π¦ ) and the predictor variables stored in matrix X, we want to find simultaneously a for each predictor variable a proper scaling factor, such that the fitted estimated values minimize the sum of the squared errors. Source: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis (figures retrieved April 2014) MULTIPLE LINEAR REGRESSION We find here the covariance matrix (scaled by n) of the predictor variables. The ‘-1’ indicates another fundamentally important matrix operation: The inverse of a matrix π¦ = ππ½ +π π¦ = ππ½ π½ = π π π size of the vectors / matrices: kx1 ( kxn nxk ) (k x k) −1 π π π¦ kxn nx1 (k x 1) Covariance (scaled by n) of all predictors with the predictand Source: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis (figures retrieved April 2014) MULTIPLE LINEAR REGRESSION The resulting kx1 matrix (i.e. vector) contains a proper scaling factor for each predictor. In other words: multiple linear regression is a weighted sum of the predictors (after conversion into units of the predictand y). π¦ = ππ½ +π π¦ = ππ½ π½ = π π π size of the vectors / matrices: kx1 ( kxn nxk ) (k x k) −1 π π π¦ kxn nx1 (kx1) Source: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis (figures retrieved April 2014) EXAMPLE MULTIPLE LINEAR REGRESSION WITH 2 PREDICTORS The scatter cloud shows a linear dependence of the values in y along the two predictor dimensions x1 x2. TIPS FOR MULTIPLE LINEAR REGRESSION (MLR) ο ο ο ο ο General rule: work with as few predictors as possible. (every time you add a new predictor you increase the risk of over-fitting the model) Observe how good the fitted values π¦ and observed values π¦ match (correlation) Choose predictors that provide independent information about the predictand The problem of collinearity: If the predictors are all highly correlated among each other then the MLR can become very ambiguous (because it gets harder to calculate accurately the inverse of the covariance matrix) Last but not least: the regression coefficients from the MLR are not ‘unique’. If you add or remove one predictor, all regression coefficients can change. PRINCIPAL COMPONENT ANALYSIS ο Global Sea Surface Temperatures From voluntary ship observations colors show the percentage of months with at least one observation in a 2 by 2 degree grid box. From paper in Annual Review of Marine Science (2010) PRINCIPAL COMPONENT ANALYSIS ο Global Sea Surface Temperatures Climatology 1982-2008 Red areas mark regions with highest SST variability PRINCIPAL COMPONENT ANALYSIS ο Global Sea Surface Temperatures Principal Component Analysis (PCA) (Empirical Orthogonal Functions (EOF)) The first leading Eigenvector Eigenvectors form now geographic pattern. Grids with high positive values and large negative values are covarying out of phase (negative correlation). Green regions show small variations in this Eigenvector #1. The Principal Component is a time series showing the temporal evolution of the SST variations. This mode is associated with the El Niño - Southern Oscillation