Regression Analysis: Linear Models & High-Dimensional Data

Regression Analysis Claudia Angelini, Istituto per le Applicazioni del Calcolo “M. Picone”, Napoli, Italy r 2018 Elsevier Inc. All rights reserved. Nomenclature Y ¼ ðY1 ; Y2 …; Yn ÞT denotes the nx1 vector of responses e ¼ ð0 e1 ; e2 …; en ÞT denotes 1 the nx1 vector of errors/noise 1; x11 ; x12 ; …:; x1p B 1; x21 ; x22 ; …:; x2p C C B C B C denotes the nx(p þ 1) matrix of …:: X¼B C B C regressors (i.e., design matrix) B …:: A @ 1; xn1 ; xn2 ; …:; xnp T T xi ¼ xi1 ; …; xip (or xi ¼ 1; xi1 ; …; xip ) denotes the px1 (or (p þ 1)x1) vector of covariates observed on the ith statistical unit Xk ¼ ðx1k ; …; xnk ÞT denotes the nx1 vector of observations for the kth covariate T denotes the (p þ 1)x1 vector of b ¼ b0 ; b1 ; …; bp coefficients unknown regression T ^ ^ ^ ^ denotes the (p þ 1)x1 vector of b ¼ b 0 ; b 1 ; …; b p coefficients estimated regression T denotes the nx1 vector of responses Y^ ¼ Y^ 1 ; Y^ 2 …; Y^ n 1 H ¼ X ðX T X Þ X T denotes the nxn hat-matrix. E() denotes the Expected value Tnp1 denotes a Student distribution with n-p 1 degree of freedom tnp1;a denotes the (1 a) quantile of a Tnp1 distribution p P ‖bi ‖22 ¼ b2i denotes the l2 vector norm ‖b1 ‖1 ¼ i¼1 p P i¼1 jbi j denotes the l1 vector norm Introduction Regression analysis is a well-known statistical learning technique useful to infer the relationship between a dependent variable Y and p independent variables X¼ [X1|…|Xp]. The dependent variable Y is also known as response variable or outcome, and the variables Xk (k¼ 1,…,p) as predictors, explanatory variables, or covariates. More precisely, regression analysis aims to estimate 00the mathematical relation f() for explaining Y in terms of X as, Y ¼ f(X), using the observations (xi,Yi),i¼ 1,…,n, collected on n observed statistical units. If Y describes a univariate random variable the regression is said to be univariate regression, otherwise it is referred as multivariate regression. If Y depends on only one variable x (i.e., p ¼ 1), the regression is said simple, otherwise (i.e., p41), the regression is said multiple, see (Abramovich and Ritov, 2013; Casella and Berger, 2001; Faraway, 2004; Fahrmeir et al., 2013; Jobson, 1991; Rao, 2002; Sen and Srivastava, 1990). T For the sake of brevity, in this chapter we limit our0attention 1 to univariate regression (simple and multiple), so that Y¼ (Y1,Y2…Yn) xT1 C B represents the vector of observed outcomes, and X ¼ @ …: A represents the design matrix of observed covariates, where xi ¼ (xi1,…,xip)T T xn (or xi ¼ (1, xi1,…,xip)T ). In this setting X¼ [X1|…|Xp] is a p-dimensional variable (pZ1). Regression analysis techniques can be organized into two main categories: non-parametric and parametric. The first category contains those techniques that do not assume a particular form for f(); while the second category includes those techniques based on the assumption of knowing the relationship f() up to a fixed number of parameters b that need to be estimated from the observed data. When the relation between the explanatory variables, X, and the parameters, b , is linear, the model in known as Linear Regression Model. The Linear Regression Model is one of the oldest and more studied topics in statistics and is the type of regression most used in applications. For example, regression analysis can be used for investigating how a certain phenotype (e.g., blood pressure) depends on a series of clinical parameters (e.g., cholesterol level, age, diet, and others) or how gene expression depends on a set of transcription factors that can up/down regulate the transcriptional level, and so on. Despite the fact that linear models are simple and easy to handle mathematically, they often provide an adequate and interpretable estimate of the relationship between X and Y. Technically speaking, the linear regression model assumes the response Y to be a continuous variable defined on the real scale and each observed data is modeled as Yi ¼ b0 þ b1 xi1 þ … þ bp xip þ ei ¼ xTi b þ ei i ¼ 1;…; n where b¼ (b0, b1,…,bp) is a vector of unknown parameters called regression coefficients and e represents the errors or noise term that accounts for the randomness of the measured data or the residual variability not explained by X. The regression coefficients b can be estimated by fitting the observed data using the least squares approach. Under the Gauss-Markov conditions (i.e., the i are assumed to be independent and identically distributed random variables, with zero mean and finite variance s2), the ordinary least squares estimates b^ are guaranteed to provide the best linear unbiased estimator (BLUE). Moreover, under the further assumptions that eBN(0,r2In), b^ allows statistical inference to be carried out on the model, as described later. The validity of both the T Encyclopedia of Bioinformatics and Computational Biology doi:10.1016/B978-0-12-809633-8.20360-9 1 2 Regression Analysis Gauss-Markov conditions and the normal distribution of the error term are known as “white noise” conditions. In this context, linear regression model is also known as the regression of the “mean”, since it models the conditional expectation of Y given X, as ^ where E(Y|X) denotes the conditional expected value of Y for fixed values of the regressors X. follows Y^ ¼ EðYjXÞ ¼ X T b, The linear regression model not only allows estimating the regression coefficients b as b^ (and hence quantifying the strength of the relationship between Y and each of the p explanatory variables when the remaining p-1 are fixed), but also selecting those variables that have no relationship with Y (when the remaining ones are fixed), as well as identifying which subsets of explanatory variables have to be considered in order to explain sufficiently well the response Y. These tasks can be carried out by testing the significance of each individual regression coefficient when the others are fixed, by removing the coefficients that are not significant and re-fitting the linear model and/or by using model selection approaches. Moreover, linear regression model can be also used for ^ corresponding to any ^ it is possible to predict the response, Y^0 ¼ xT b, prediction. For this purpose, given the estimated values, b, 0 novel value x0 and to estimate the uncertainty of such prediction. The uncertainty depends on the type of prediction one wants to make. In fact, it is possible to compute two types of confidence intervals: the one for the expectation of a predicted value at a given point x0 , and the one for a future generic observation at a given point x0 . As the number, p, of explanatory variables increases, the least squares approach suffers from a series of problems, such as lack of prediction accuracy and difficulty of interpretation. To address these problems, it is desirable to have a model with only a small number of “important” variables, which is able to provide a good explanation of the outcome and good generalization at the price of sacrificing some details. Model selection consists in identifying which subsets of explanatory variables have to be “selected” to sufficiently explain the response Y making a compromise referred as the bias-variance trade-off. This is equivalent to choosing between competing linear regression models (i.e., with different combinations of variables). On the one hand, one has to consider that including too few variables leads to so-called "underfit" of the data, characterized by poor prediction performance with high bias and low variance. On the other hand, selecting to many variables rise to so-called "overfit" of the data, characterized by poor prediction performance with low bias and high variance. Stepwise linear regression is an attempt to address this problem (Miller et al., 2002; Sen and Srivastava, 1990) constituting a specific example of subset regression analysis. Although model selection can be used in classical regression context, it is one of the most effective tool in high dimensional data analysis. Classical regression deal with the case nZp where n denotes the number of independent observations (i.e., the sample size) and p the number of variables. Nowadays, in many applications especially in biomedical science, high-throughput assays are capable of measuring from thousands to hundreds of thousands of variables on a single statistical unit. Therefore, one has often to deal with the case p 44n. In such a case, ordinary least squares cannot be applied, and other types of approaches (for example, including the use of a penalization function) such as Ridge regression, Lasso or Elastic net regression (Hastie et al., 2009; James et al., 2013; Tibshirani, 1996, 2011) have to be used for estimating the regression coefficients. In particular, Lasso is very effective since it also performs also variable selection and has opened the new framework of high-dimensional regression (Bühlmann and van de Geer, 2011; Hastie et al., 2015). Model selection and high dimensional data analysis are strongly connected, and they might also benefit from dimension reduction techniques such as principal component analysis, or feature selection. In the classical framework, Y is treated as a random variable, while X are considered fixed, hence, depending on the distribution of Y, different types of regression models can be defined. With X fixed, the assumptions on the distribution of Y are elicited through the distribution of error term e ¼ ðe1 ; …; en ÞT . As above mentioned, classical linear regression requires the error term to satisfy the Gauss-Markov conditions and be normally distributed. However, when the error term is not normally distributed, linear regression might be not appropriate. Generalized linear models (GLM) constitute a generalization of classical linear regression that allows the response variable Y to have an error distribution other than normal (McCullagh and Nelder, 1989). In this way, GLM generalize linear regression by allowing the linear model to be related to the response variable via a link function, and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. In this way GLM represent a wide framework that includes linear regression, logistic regression, Poisson regression, multinomial regression, etc. In this framework the regression coefficients can be estimated using the maximum likelihood approach, often solved by iteratively reweighted least squares algorithms. In the following, we briefly summarize the key concepts and definitions related to linear regression, moving from the simple linear model, to the multiple linear model. In particular, we discuss the Gauss-Markov conditions and the properties of the least squares estimate. We discuss the concepts of model selection and also provide suggestions on how to handling outliers and deviation from standard assumptions. Then, we discuss modern forms of regression, such as Ridge regression, Lasso and Elastic Net, which are based on penalization terms and are particularly useful when the dimension of the variable space, p, increases. We conclude by extending the linear regression concepts to the Generalized Linear Models (GLM). Background/Fundamentals In the following we first introduce the main concepts and formulae for simple linear regression (p¼ 1), then we extend the regression model to the case p41. We note that, when p ¼ 1, simple linear regression is the “best” straight line through the observed data points, for p41 it represents the “best” hyper-plane across the observed data points. Moreover, while Y has to be a quantitative variable, the Xk can be either quantitative or categorical. However, categorical covariates have to be transformed into a series of dummy variables using indicators. Regression Analysis 3 Simple Linear Regression As mentioned before, simple linear regression (Altman and Krzywinski, 2015; Casella and Berger, 2001) is a statistical model used to study the relationship f() between two (quantitative) variables x and Y, where Y represents the response variable and x the explanatory variable (i.e., x is one particular Xk), assuming that the mathematical relation can be described by a straight line. In particular, the aim of simple linear regression is estimating the straight line that best fits the observed data, by estimating its coefficients (i.e., the intercept and the slope), quantifying the uncertainty of such estimate, testing the significance of the relationship, and finally using the estimated line to predict the value of Y using the observed values of x. Typically the x variable is treated as fixed, and the aim is estimating the relationship between the mean of Y and x (i.e., E½Yjx where E denotes the conditional expectation of Y given x). More formally, to fix the notation and introduce the general concepts, assume we have collected n observations ðxi ; Yi Þ; i ¼ 1; …; n. A simple linear regression model can be written as Yi ¼ b0 þ b1 xi þ ei i ¼ 1; …; n where the ei are independent and identically distributed random variables (with zero mean and finite variance s2 ) and b0 þ b1 x represents a straight line, b0 being the intercept and b1 the slope. In particular, when b1 40, x and Y vary together; when b1 o0 then x and Y vary in opposite directions. Let Y ¼ ðY1 ; Y2 ; …; Yn ÞT and e ¼ ðe1 ; e2 ; …; en ÞT denote the vectors of the observed outcomes and the noise, respectively, ! T T 1; 1; …:; 1 1n ¼ , the matrix of regressors (usually called the design matrix), and b ¼ ðb0 ; b1 ÞT , the vector of X¼ x1 ; x2 ; …; xn x unknown coefficients, then the regression problem can be rewritten in matrix form as, Y ¼ Xb þ e The aim is estimating the coefficients b ¼ ðb0 ; b1 ÞT that provide a “best” fit for the observed data. This “best” fit is often achieved by using the ordinary least-squares approach, i.e., by finding the coefficients that minimizes the sum of the squared residuals, or in mathematical terms by solving the following problem argmin X n b0 ;b1 i¼1 argmin X n e2i ¼ b0 ;b1 ðYi b0 b1 xi Þ2 i¼1 After straightforward calculations, it is possible to prove that Pn i ¼ 1 ðxi xÞ Yi Y ^ b1 ¼ Pn 2 i ¼ 1 ðxi xÞ Pn where Y ¼ Y i¼1 i n Pn and x ¼ x i¼1 i n b^ 0 ¼ Y b^ 1 x denote the sample mean of Y and x, respectively. The estimate of b^ 1 can be rewritten as sxY sYY ¼ rxY b^ 1 ¼ sxx sxx where sxx and sYY denote the sample variance of x and Y, respectively; rxY and sxY denote the sample correlation and sample covariance coefficient between x and Y, respectively. Given the estimated parameters b^ 0 , b^ 1 , it is possible to estimate the response, Y^ i , as Y^ i ¼ b^ 0 þ b^ 1 xi i ¼ 1; …; n The least squares approach provides a good of the unknown parameters estimate if the so-called Gauss-Markov conditions are satisfied (i.e., i. Ε ðei Þ ¼ 0, 8 i ¼ 1; …; n, ii. E e2i ¼ s2 , 8 i ¼ 1; …; n and iii. Ε ei ej ¼ 0; 8 ia j, where E() denotes the expected value). In particular, according to the Gauss-Markov theorem, if such conditions are satisfied, the least squares approach provides the best linear unbiased estimator (BLUE) of the parameters, i.e., the estimate with the lowest variance compared to all linear unbiased estimators. Furthermore, if we also assume that ei B N ð0; s2 Þ then, the least squares solution provides the best estimator among all unbiased estimators, and also the maximum likelihood estimator. Additionally, under such assumptions (“white noise” assumptions) it is possible to prove that bi b^ i B Tn2 ; sb^ i i ¼ 0; 1 ffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P n 1 e2 n2 i¼1 i ^ where sb^ i ¼ P n 2 is the standard error of the estimator b i , and Tn-2 denotes the Student T-distribution with n-2 degree of i¼1 ðxi xÞ freedom. On the basis of this result, the (1-a) % percent confidence interval for the coefficient, bi , is given by h i bi ¼ b^ i sb^ i tn2;2a ; b^ i þ sb^ i tn2;2a ; i ¼ 0; 1 where tn2;2a denotes the (1-a/2) quantile of Tn-2. Moreover, the significance of each coefficient can be evaluated by testing the null hypothesis H0;i : bi ¼ 0 ^ versus the alternative H1;i : bi a 0. To perform such test, it is possible to use the following test statistics sbî B Tn2 ; i ¼ 0; 1. bi 4 Regression Analysis Finally, the coefficient of determination, R2, is commonly used to evaluate the goodness of fit within a simple linear regression model. It is defined as follows 2 Pn Pn ^ 2 Yî Y i ¼ 1 Yi Yi R2 ¼ Pin¼ 1 2 ¼ 1 Pn 2 i ¼ 1 Yi Y i ¼ 1 Yi Y We have that R 2 A½0; 1, where 0 (or a value close to 0) indicates that the model explains none (or little) of the variability of the response data around its mean; 1 (or a value close to 1) indicates that the model explains all (or most) of the variability of the response data around its mean. Moreover, if R 2 ¼ 1 the observed data are distributed on the regression line (perfect fitting). Note that R 2 ¼ r 2 were, r, denotes the Pearson correlation coefficient that can be computed as P P P n ni¼ 1 xi yi ni¼ 1 xi ni¼ 1 yi q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn 2 Pn 2 P P n ni¼ 1 x2i n ni¼ 1 yi2 i ¼ 1 xi i ¼ 1 yi Ordinary least squares regression is the most common approach to defining the “best” straight line that fit the data. However, there are other regression methods that can be used in place of ordinary least squares, such as the least absolute deviations (i.e., the line that minimizes the sum of the absolute values of the residuals) argmin X n b0 ;b1 argmin X n jei j ¼ i¼1 b0 ;b1 jYi b0 b1 xi j i¼1 see (Sen and Srivastava, 1990) for a more detailed discussion. Multiple Linear Regression Multiple linear regression generalizes the above mentioned concepts and formulae to the case where more predictors are used (Abramovich and Ritov, 2013; Casella and Berger, 2001; Faraway, 2004; Fahrmeir et al., 2013; Jobson, 1991; Krzywinski and Altman, 2015; Rao, 2002; Sen and Srivastava, 1990). Let Y ¼ ðY1 ; Y2 …; Yn ÞT and e ¼ ðe1 ; e2 …; en ÞT denote the vectors of the observed outcomes and the noise, respectively, 1 0 1; x11 ; x12 ; …:; x1p B 1; x21 ; x22 ; …:; x2p C C B C B C …:: X¼B C B C B …:: A @ 1; xn1 ; xn2 ; …:; xnp T is the design matrix, and b ¼ b0 ; b1 ; …; bp is the vector of unknown coefficients, then the regression problem can be rewritten in matrix form as, Y ¼ Xb þ e The least squares estimate of b can be obtaining by solving the following problem n n 2 argmin argmin X 2 argmin X ðY XbÞT ðY XbÞ ei ¼ Yi b0 b1 xi1 b2 xi2 …: bp xip ¼ b0 ;b1 ;…:;bp i¼1 b0 ;b1 ;…:;bp b i¼1 T T 1 It can be proved that if X X is a non-singular pxp matrix, and X X solution given by 1 b^ ¼ X T X X T Y denotes its inverse matrix, then there exists a unique þ If X T X is singular, the solution is not unique, but it can be still computed in terms of the pseudo-inverse matrix X T X as follows þ b^ ¼ X T X X T Y Once the estimate b^ has been computed, it is possible to estimate Y^ as 1 Y^ ¼ X b^ ¼ X X T X X T Y ¼ HY 1 where H ¼ X X T X X T is called the hat-matrix or projection matrix. From a geometrical point of view, in a n-dimensional Euclidean space, Y^ can be seen as the orthogonal projection of Y in the subspace generated by the columns of X, and the vector of residuals, defined as e ¼ Y Y^ , is orthogonal to the subspace generated by the columns of X. ^ is the BLUE, and an Moreover, under the assumption that the Gauss-Markov conditions hold, the least squares estimate, b, unbiased estimate of the variance is given by s^ 2 ¼ n X 2 1 Yi Yî n p 1i¼1 Regression Analysis 5 Pn 2 ðYî Y Þ Analogously to simple linear regression, the coefficient of determination, R2 ¼ Pin ¼ 1 2 , can be used to evaluate the goodness ðYi Y Þ i¼ 1 of fit ðR 2 A½0; 1Þ. However, the value of R2 increases as p increases. Therefore, it is not a useful measure for selecting parsimonious models. The adjusted R2, R2a , is defined as Pn R2a ¼1 ðYi Yî Þ i ¼ 1 ðnp1Þ 2 Pn ðYi Y Þ 2 ¼1 i ¼ 1 ðn1Þ SSE=ðn p 1Þ SST=ðn 1Þ which adjusts for the number of explanatory variables in the model. 2 P Overall we have the following partition of the errors: SST ¼ SSR þ SSE where SST ¼ ni¼ 1 Yi Y i denotes the total sum of 2 2 Pn Pn squares, SSR ¼ i ¼ 1 Y i Yî , the sum of squares explained by the regression, and SSE ¼ i ¼ 1 Yi Yî , the residual sum of squares due to the error term. Gauss-Markov Conditions and Diagnostics Linear regression and the ordinary least squares approach are based on the so-called Gauss-Markov conditions 1) Eðei Þ¼ 0 (Zero-mean error) 2) Ee2i ¼ s2 8 i ¼ 1; …; n (Homoscedastic error) 3) E ei ej ¼ 0 8 ia j (Independent errors) 1 that guarantee Y^ ¼ X b^ ¼ X X T X X T Y is the best linear unbiased estimator (BLUE) of the coefficients (Gauss-Markov theorem, Jobson, 1991; Sen and Srivastava, 1990). When using linear regression functions it is important to verify the validity of such assumptions using so-called residual plots (Altman and Krzywinski, 2016b). Moreover, in order to perform inference on the significance of the regression coefficients and prediction of future outcomes, it is necessary that eB N ð0; r2 I n Þ also holds. This condition can be verified using the Shapiro-Wilk test or the Lilliefors test, see (Razali and Wah, 2011) or the Chapter Statistical Inference Techniques in this volume. Tests of significance and confidence regions One of the most important aspects of regression is that it is possible to make inference on the estimated model, this means it is possible to test the significance of each coefficients (i.e., H0 : bj ¼ 0 vs H1 : bj a 0), assuming the other are fixed, and to build confidence intervals. In particular, under the “white noise” assumptions, it is possible to prove that 1 b^ B N b; X T X s2 and 2 ðn p 1Þ^ s 2 B s2 χnp1 2 where χnp1 denotes a chi-squared distribution with n-p-1 degree of freedom. Moreover the two estimates are statistically independent. Hence, it is possible to show that (under the null hypothesis) b^ j bj B Tnp1 sb^ i is a Student distribution Tnp1 with n-p-1 degree of freedom. These results can be used to decide whether bj are significant (where, qffiffiffiffiffiffiffiffiffiffiffiffiffiffi sb^ i , denotes the standard deviation of the estimated coefficients, i.e., sb^ i ¼ varðb^ j Þ). ^ Moreover the 100(1-a)% confidence interval for the regression coefficient, b^ j , is given by b^ j 7sb^ i tnp1;2a , where tnp1;2a is the corresponding quantile of the T distribution with n-p-1 degree of freedom. More details about significance testing, including the case of contrasts between coefficients, can be found in Sen and Srivastava (1990), Jobson (1991). Inference for the model The overall goodness of fit of the model can be measured testing the null assumption H0 : b1 ¼ b2 ¼ … ¼ bp ¼ 0 using the F-test F¼ SSR=p SSE=ðn p 1Þ which has a Fisher distribution with p and n-p-1 degree of freedom. Confidence interval (CI)for the expectation of a predicted value at x0 Let x0 ¼ x00 ; x01 ; …; x0p be a novel observation of the independent variables X, and let y0 be the unobserved response. Then, the predicted response is y^0 ¼ x0 T b^ and the 100(1-a)% confidence interval of its expected value is h 1 i ^y 0 7tnp1;a s^ 2 x0 T X T X x0 6 Regression Analysis CI for a future observation x0 Under the same settings, the 100(1-a)% confidence interval for the predicted response is h 1 i ^y 0 7tnp1;a s^ 2 1 þ x0 T X T X x0 Subset Linear Regression In many cases, the number of observed variables, X, is large and one seeks for a regression model with a smaller number of “important” variables in order to gain explicative power and to address the so-called variance-bias trade off. In these cases a small subset of the original variables must be selected, and the regression hyper-plane fit using the subset. In principle, one can chose a goodness criterion and compare all potential models using such criteria. However if p denotes the number of observed variables, there will be 2p potential models to compare. Therefore, for large p such an approach becomes unfeasible. Stepwise linear regression methods allow one to select a subset of variables by adding or removing one variable at time and re-fitting the model. In this way, it is possible to choose a “good” model by fitting only a limited number of potential models. Mallows’ CP, AIC (Akaike's information criterion), and BIC (Bayesian information criterion) are widely used for choosing “good” models. More details can be found in Sen and Srivastava (1990), Jobson (1991), Miller (2002), Lever et al., (2016). Regression Analysis in Practice How to Report the Results of a Linear Regression Analysis and Diagnostic Plots After fitting a linear regression model one has to report the estimates of the regression coefficients and their significance. However, this is not sufficient for evaluating the overall significance of the fit. The R2 (or the R2a ) are typical measures of quality that should be reported, as well as the overall significance, F, of the regression model. Moreover, results have to be accompanied by a series of diagnostic plots and statistics on residuals that can be used to validate the Gauss-Markov conditions. Such analyses include normquantile residual and fitted-residual plots. The key aspect to keep in mind is that when fitting a linear regression model by using any software, the result will provide the “best” hyper-plane or the “best” straight line, regardless of the fact that the model is inappropriate. For example, in the case of simple linear regression the best line will be returned even though the data are not linear. Therefore, to avoid false conclusions, all of the above-mentioned points have to be considered. The leverage residual, DFFITS and Cook’s distance of observations are other measures that can be used to assess the quality and robustness of the fit, see (Sen and Srivastava, 1990; Jobson, 1991; Altman and Krzywinski, 2016b) for more details. In cases where the assumptions are violated, one can use data transformations to mitigate the deviation from the assumptions, or use more sophisticated regression models, such as generalized linear models, non-linear models, and non-parametric regression approaches depending on the type of assumptions one can place on the data. Outliers and Influential Observations The presence of outliers can be connected to potential problems that may inflate the apparent validity of a regression model (Sen and Srivastava, 1990; Jobson, 1991; Altman and Krzywinski, 2016a). In particular, an outlier is a point that does not follow the general trend of the rest of the data. Hence it can show either an extreme value for the response Y, or for any of the predictors Xk, or both. Its presence may be due either to measurement error, or to the presence of a sub-population that does not follow the expected distribution. One might observe one or few outliers when fitting a regression model. An influential point is an outlier that strongly affects the estimates of the regression coefficients. Anscombe's quartet (Anscombe, 1973) is a typical example used to illustrate how influential points can inflate conclusions. Roughly speaking, to measure the influence of an outlier, one can compute the regression coefficients with and without the outlier. More formal approaches for detecting outliers and influential points are described in Sen and Srivastava (1990), Jobson (1991), Altman and Krzywinski (2016a). Transformations As already mentioned above, when the Gauss-Markov conditions are not satisfied ordinary least squares cannot be used. However, depending on deviation from the assumptions, either a transformation can be applied in order to match the assumptions, or other regression approaches can be used. More in general, transformations can be used on the data for wider purposes, for example for centering and standardizing observations when the variables have different magnitude. Other widely used transformations are logarithmic and square-root transforms that can help in linearizing relationships. Additionally, in the context of linear regression, variance-stabilizing transformations can be used to accommodate heteroscedasticity, and normalizing transformations to better match the assumption of normal distribution. Although several transformations have been proposed in the literature, there is no standard approach. Moreover, an intensive use of transformations might be questionable. The reader is referred to (Sen and Srivastava, 1990) for more details. Regression Analysis 7 Beyond ordinary least squares approaches As previously stated, linear regression is also know as the regression of the “mean” since the conditional mean of Y given X is modeled as linear combination of the regressors, X, plus an error term. In this case, the least squares approach constitutes the BLUE when the Gauss-Markov conditions hold. However, other types of criteria, such as quantile regression, have been proposed (McKean, 2004; Koenker, 2005; Maronna et al., 2006). In particular, quantile regression aims at estimating either the conditional median or other conditional quantiles of the response variable. In this case, the estimates are obtained using linear programming optimization algorithms that solve the corresponding minimization problem. Such types of regression models are more robust than least squares with respect to the presence of outliers. Advanced Approaches High Dimensional Regression (p44n) Classical linear regression, as described above, requires the matrix X T X to be invertible, this implies that nZp, where n denotes the number of independent observations (i.e., the sample size) and p the number of variables. Current advance in science and technology allow the measurement of thousands or millions of variables on the same statistical unit. Therefore, one has often to deal with the case of p 44n, where ordinary least squares cannot be applied, and other types of approaches must be used. The solution in such cases is to use penalized approaches such as Ridge regression, Lasso regression, or Elastic net or others (Hastie et al., 2009; James et al., 2013; Tibshirani, 1996, 2011; Zou and Hastie, 2005), where a penalty term, P(b), is added to the fitting criteria, as follows ðY XbÞT ðY XbÞ þ l PðbÞ where l is the so-called regularization parameter, which controls the trade-off between bias and variance. In practice, by sacrificing the unbiased property of the ordinary least squares, one can achieve generalization and flexibility. Different penalty terms lead to different regressions methods (Bühlmann and van de Geer, 2011; Hastie et al., 2015) and are also known ad regularization techniques. The regularization parameter, l, is usually selected using criteria such as cross-validation. Other possible approaches include the use of dimension reduction techniques such as principal component analysis or feature selection, such as described in the Chapter Dimension Reduction Techniques in this volume. Ridge regression The estimates of the regression coefficients are obtained by solving the following minimization problem " # p n 2 h i X Ridge argmin X argmin 2 ^ b ¼ Yi b0 b1 xi1 b2 xi2 …: bp xip þ l bj ¼ ðY XbÞT ðY XbÞ þ ljjbjj22 b i¼1 j¼1 b where l40 is a suitable regularization parameter. It is easy to show, since the minimization problem is convex, it has a unique solution. The solution can be computed in a closed form as 1 Ridge b^ ¼ X T X þ lI X T Y which corresponds to shrinking the ordinary least square coefficients by an amount that is controlled by l. Ridge regression can be used when X T X is singular or quasi-singular and ordinary least squares does not provide a unique solution. In fact, in such Ridge tends to b^ (i.e., those circumstance X T X þ lI is still invertible. Note that, as l-0 the penalty plays a “minor” role, thus b^ Ridge ^ tends to zero (i.e., to the so-called intercept-only coefficients obtained by using ordinary least squares), when l- þ 1 the b model). Different values of l provide a trade-off between bias and variance (larger l increases the bias, but reduces the variance). A suitable value of l can be chosen from the data by cross-validation. Although ridge regression can be used for fitting high dimensional data in a more effective way than using ordinary least squares approach, there are however some limitations such as a large bias toward zero for large regression coefficients, and a lack of interpretability of the regression solution since “unimportant” coefficients are shrunken towards zero, but they’re still in the model instead of being killed to zero. As a consequence, the ridge regression does not act as model selection. Lasso regression The estimates of the regression coefficients are obtained by solving the following minimization problem " # p n 2 h i X Lasso argmin X argmin ^ b ¼ Yi b0 b1 xi1 b2 xi2 …: bp xip þ l jbj j ¼ ðY XbÞT ðY XbÞ þ ljjbjj1 b i¼1 j¼1 b where l40 is a suitable regularization parameter. The underlying idea in lasso approach is that it seeks a set of sparse solutions meaning that it will set some regression coefficients exactly equal to 0. As a consequence, lasso also performs model selection. Larger is the value of l, more will be the coefficients set to zero. Unfortunately, the solution of lasso minimization problem is not available in closed form, however it can be obtained by using convex minimization approaches and several algorithms have been 8 Regression Analysis proposed such as the least-angle regression (LARS) (Efron et al., 2004) to efficiently fit the model. Analogously to ridge regression, a suitable value of l can be chosen from the data by cross-validation. Overall, lasso has opened a new framework in the so-called high-dimensional regression models and several generalizations have been proposed (Bühlmann and van de Geer, 2011; Hastie et al., 2015) to overcome its limitations and to extend the original idea to different regression contexts. Elastic net regression The estimates of the regression coefficients are obtained by solving the following minimization problem Elastic net argmin b^ ¼ b ¼ h argmin b p p n 2 X X X Yi b0 b1 xi1 b2 xi2 …: bp xip þ l1 b2j þ l2 jbj j j¼1 i¼1 ðY XbÞT ðY XbÞ þ l1 jjbjj22 þ l2 jjbjj1 j¼1 i where l140 and l240 are suitable regularization parameters (Zou and Hastie, 2005). Note that for l1 ¼ 0 one can obtain lasso regression, while the case l2 ¼ 0 corresponds to ridge regression. Different combinations of l1 and l2 compromise between shrinking and selecting coefficients. Indeed, elastic net was designed to overcome some of the limitations of both ridge regression and lasso regression, since the quadratic penalty term shrinks the regression coefficients toward zero, while and the absolute penalty term act as model selection by keeping or killing regression coefficients. For example, elastic net is more robust than lasso when correlated predictors are present in the model. In fact, when there is a group of highly correlated variables, lasso usually selects only one variable from the group (ignoring the others), elastic net allows more variables to be selected. Moreover, when p4n lasso can select at most n variables before saturating, whereas, thanks to the quadratic penalty, elastic net can select a larger number of variables in the model. On the other hand, “unimportant” regression coefficients are often set to zero to perform variable selection. Other Types of Regressions Generalized linear models (GLM) are a well-known generalization of the above-described linear model. GLM allow the dependent variable, Y, to be generated by any distribution f() belonging to the exponential family. The exponential family includes normal, binomial, Poisson, and gamma distribution among many others. Therefore GLM constitute a general framework in which to handle different type of relationships. The model assumes that the mean of Y depends on X by means of a link function, g(), EðY Þ ¼ l ¼ g 1 ðXbÞ where E() denotes expectation and g() the link function (an invertible, continuous and differentiable function). In practice, g ðlÞ ¼ Xb, so there is a linear relationship between X and a function of the mean of Y. Moreover, in this context the variance is also function of the mean Var ðY Þ ¼ V ðlÞ ¼ V ðg 1 ðXbÞÞ. In this context, the linear regression framework can be reformulated choosing the identity as the link function, Poisson regression corresponds to Log() as the link function, and binomial regression to choosing Logit, and so on. The unknown regression coefficients b are typically estimated with maximum likelihood, maximum quasi-likelihood, or Bayesian techniques. Inference can then be carried out in a similar way as for linear regression. A formal description and mathematical treatment of GLM can be found in McCullagh and Nelder (1989), Madsen and Thyregod (2011). GLM are very important for biomedical applications since they include logistic and Poisson regression, which are often used in biomedical science to model binary outcomes or counts data, respectively. Recently, penalized regression approaches, as those described in section high dimensional regression, have been extended to the generalized linear models. In this context, the regularization is achieved by penalizing the log-likelihood function; see (Bühlmann and van de Geer, 2011; Hastie et al., 2015) for more details. Closing Remarks Regression constitutes one of the most relevant frameworks of modern statistical inference with many applications to the analysis of biomedical data, since it allows one to study the relationship between a dependent variable Y and a series of p independent variables X¼ [X1|…|Xp], from a set of independent observations ðxi ; Yi Þ; i ¼ 1; …; n. When the relationship is linear in the coefficients one has linear regression. Despite its simplicity, linear regression allows testing the significance of the coefficients, estimating the uncertainty and predicting novel outcomes. The Gauss-Markov conditions guarantee that the least squares estimator has the BLUE property, and the normality of the residuals allows carrying out inference. When p4n, classical linear regression cannot be applied, and penalized approaches have to be used. Ridge, lasso and elastic net regression are the most well known approaches in the context of high dimensional data analysis (Bühlmann and van de Geer, 2011; Hastie et al., 2015). When the Regression Analysis 9 white-noise conditions are violated data transformation of other types of models have to be considered. Analogously, when the relationship is not linear other approaches or non-parametric models should be used. References Abramovich, F., Ritov, Y., 2013. Statistical Theory: A Concise Introduction. Chapman & Hall/CRC. Altman, N., Krzywinski, M., 2015. Simple linear regression. Nature Methods 12 (11), 999–1000. Altman, N., Krzywinski, M., 2016a. Analyzing outliers: Influential or nuisance? Nature Methods 13 (4), 281–282. Altman, N., Krzywinski, M., 2016b. Regression diagnostics. Nature Methods 13 (5), 385–386. Anscombe, F.J., 1973. Graphs in statistical analysis. American Statistician 27 (1), 17–21. Bühlmann, P., van de Geer, S., 2011. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer series in statistics. Casella, G., Berger, R., 2001. Statistical Inference, second ed. Duxbury. Efron, B., Hastie, T., Johnstone, J., Tibshirani, R., 2004. Least angle regression. Annals of Statistics 32 (2), 407–499. Faraway, J.J., 2004. Linear Models with R. Chapman & Hall/CRC. Fahrmeir, L., Kneib, T., Lang, S., Marx, B., 2013. Regression: Models, Methods and Applications. Springer. Hastie, T., Tibshirani, R., Friedman, J., 2009. The Elements of Statistical Learning, second ed. Springer series in Statistics. Hastie, T., Tibshirani, R., Wainwright, M., 2015. Statistical Learning with sparsity: The Lasso and Generalizations. CRC Press. James, G., Witten, D., Hastie, T., Tibshirani, T., 2013. An Introduction to Statistical Learning: With Applications in R. Springer. Jobson, J.D., 1991. Applied Multivariate Data Analysis. Volume I: Regression and Experimental Design. Springer Texts in Statistics. Koenker, R., 2005. Quantile Regression. Cambridge University Press. Krzywinski, M., Altman, N., 2015. Multiple linear regression. Nature Methods 12 (12), 1103–1104. Lever, J., Krzywinski, M., Altman, N., 2016. Model selection and overfitting. Nature Methods 13 (9), 703–704. Madsen, H., Thyregod, P., 2011. Introduction to General and Generalized Linear Models. Chapman & Hall/CRC. Maronna, R., Martin, D., Yohai, V., 2006. Robust Statistics: Theory and Methods. Wiley. McCullagh, P., Nelder, J., 1989. Generalized linear models, second ed. Boca Raton, FL: Chapman and Hall/CRC. McKean, Joseph W., 2004. Robust analysis of linear models. Statistical Science. 19 (4), 562–570. Miller, A., 2002. Subset Selection in Regression. Chapman and Hall/CRC. Rao, C.R., 2002. Linear Statistical Inference and its Applications, Wiley series in probability and statistics, second ed. New York: Wiley. Razali, N.M., Wah, Y.B., 2011. Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests. Journal of Statistical Modeling and Analytics 2 (1), 21–33. Sen, A., Srivastava, M., 1990. Regression Analysis: Theory, Methods and Applications. Springer Texts in Statistics. Tibshirani, R., 1996. Regression shrinkage and selection via the Lasso. Journal of the Royal statistical Scociety B 58 (1), 267–288. Tibshirani, R., 2011. Regression shrinkage and selection via the lasso: A retrospective 73 (3), 273–282. Zou, H., Hastie, T., 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B 67 (2), 301–320.

Regression Analysis: Linear Models & High-Dimensional Data

Related documents

Products

Support

Regression Analysis: Linear Models & High-Dimensional Data

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib