Uploaded by unitedgamersassn

Regression Analysis

advertisement
Regression Analysis
Claudia Angelini, Istituto per le Applicazioni del Calcolo “M. Picone”, Napoli, Italy
r 2018 Elsevier Inc. All rights reserved.
Nomenclature
Y ¼ ðY1 ; Y2 …; Yn ÞT denotes the nx1 vector of
responses
e ¼ ð0
e1 ; e2 …; en ÞT denotes
1 the nx1 vector of errors/noise
1; x11 ; x12 ; …:; x1p
B 1; x21 ; x22 ; …:; x2p C
C
B
C
B
C denotes the nx(p þ 1) matrix of
…::
X¼B
C
B
C regressors (i.e., design matrix)
B
…::
A
@
1; xn1 ; xn2 ; …:; xnp
T
T
xi ¼ xi1 ; …; xip (or xi ¼ 1; xi1 ; …; xip ) denotes the px1
(or (p þ 1)x1) vector of covariates observed on the ith
statistical unit
Xk ¼ ðx1k ; …; xnk ÞT denotes the nx1 vector of observations
for the kth covariate
T
denotes the (p þ 1)x1 vector of
b ¼ b0 ; b1 ; …; bp
coefficients
unknown regression
T
^
^
^
^
denotes the (p þ 1)x1 vector of
b ¼ b 0 ; b 1 ; …; b p
coefficients
estimated regression
T
denotes the nx1 vector of responses
Y^ ¼ Y^ 1 ; Y^ 2 …; Y^ n
1
H ¼ X ðX T X Þ X T denotes the nxn hat-matrix.
E() denotes the Expected value
Tnp1 denotes a Student distribution with n-p 1 degree
of freedom
tnp1;a denotes the (1 a) quantile of a Tnp1
distribution
p
P
‖bi ‖22 ¼
b2i denotes the l2 vector norm
‖b1 ‖1 ¼
i¼1
p
P
i¼1
jbi j
denotes the l1 vector norm
Introduction
Regression analysis is a well-known statistical learning technique useful to infer the relationship between a dependent variable Y and
p independent variables X¼ [X1|…|Xp]. The dependent variable Y is also known as response variable or outcome, and the variables Xk
(k¼ 1,…,p) as predictors, explanatory variables, or covariates. More precisely, regression analysis aims to estimate 00the mathematical
relation f() for explaining Y in terms of X as, Y ¼ f(X), using the observations (xi,Yi),i¼ 1,…,n, collected on n observed statistical
units. If Y describes a univariate random variable the regression is said to be univariate regression, otherwise it is referred as
multivariate regression. If Y depends on only one variable x (i.e., p ¼ 1), the regression is said simple, otherwise (i.e., p41), the
regression is said multiple, see (Abramovich and Ritov, 2013; Casella and Berger, 2001; Faraway, 2004; Fahrmeir et al., 2013;
Jobson, 1991; Rao, 2002; Sen and Srivastava, 1990).
T
For the sake of brevity, in this chapter we limit our0attention
1 to univariate regression (simple and multiple), so that Y¼ (Y1,Y2…Yn)
xT1
C
B
represents the vector of observed outcomes, and X ¼ @ …: A represents the design matrix of observed covariates, where xi ¼ (xi1,…,xip)T
T
xn
(or xi ¼ (1, xi1,…,xip)T ). In this setting X¼ [X1|…|Xp] is a p-dimensional variable (pZ1).
Regression analysis techniques can be organized into two main categories: non-parametric and parametric. The first category
contains those techniques that do not assume a particular form for f(); while the second category includes those techniques based
on the assumption of knowing the relationship f() up to a fixed number of parameters b that need to be estimated from the
observed data. When the relation between the explanatory variables, X, and the parameters, b , is linear, the model in known as
Linear Regression Model.
The Linear Regression Model is one of the oldest and more studied topics in statistics and is the type of regression most used in
applications. For example, regression analysis can be used for investigating how a certain phenotype (e.g., blood pressure) depends
on a series of clinical parameters (e.g., cholesterol level, age, diet, and others) or how gene expression depends on a set of
transcription factors that can up/down regulate the transcriptional level, and so on. Despite the fact that linear models are simple
and easy to handle mathematically, they often provide an adequate and interpretable estimate of the relationship between X and
Y. Technically speaking, the linear regression model assumes the response Y to be a continuous variable defined on the real scale
and each observed data is modeled as
Yi ¼ b0 þ b1 xi1 þ … þ bp xip þ ei ¼ xTi b þ ei
i ¼ 1;…; n
where b¼ (b0, b1,…,bp) is a vector of unknown parameters called regression coefficients and e represents the errors or noise term that
accounts for the randomness of the measured data or the residual variability not explained by X. The regression coefficients b can
be estimated by fitting the observed data using the least squares approach. Under the Gauss-Markov conditions (i.e., the i are
assumed to be independent and identically distributed random variables, with zero mean and finite variance s2), the ordinary least
squares estimates b^ are guaranteed to provide the best linear unbiased estimator (BLUE). Moreover, under the further assumptions
that eBN(0,r2In), b^ allows statistical inference to be carried out on the model, as described later. The validity of both the
T
Encyclopedia of Bioinformatics and Computational Biology
doi:10.1016/B978-0-12-809633-8.20360-9
1
2
Regression Analysis
Gauss-Markov conditions and the normal distribution of the error term are known as “white noise” conditions. In this context,
linear regression model is also known as the regression of the “mean”, since it models the conditional expectation of Y given X, as
^ where E(Y|X) denotes the conditional expected value of Y for fixed values of the regressors X.
follows Y^ ¼ EðYjXÞ ¼ X T b,
The linear regression model not only allows estimating the regression coefficients b as b^ (and hence quantifying the strength of
the relationship between Y and each of the p explanatory variables when the remaining p-1 are fixed), but also selecting those
variables that have no relationship with Y (when the remaining ones are fixed), as well as identifying which subsets of explanatory
variables have to be considered in order to explain sufficiently well the response Y. These tasks can be carried out by testing the
significance of each individual regression coefficient when the others are fixed, by removing the coefficients that are not significant
and re-fitting the linear model and/or by using model selection approaches. Moreover, linear regression model can be also used for
^ corresponding to any
^ it is possible to predict the response, Y^0 ¼ xT b,
prediction. For this purpose, given the estimated values, b,
0
novel value x0 and to estimate the uncertainty of such prediction. The uncertainty depends on the type of prediction one wants to
make. In fact, it is possible to compute two types of confidence intervals: the one for the expectation of a predicted value at a given
point x0 , and the one for a future generic observation at a given point x0 .
As the number, p, of explanatory variables increases, the least squares approach suffers from a series of problems, such as lack of
prediction accuracy and difficulty of interpretation. To address these problems, it is desirable to have a model with only a small
number of “important” variables, which is able to provide a good explanation of the outcome and good generalization at the price
of sacrificing some details. Model selection consists in identifying which subsets of explanatory variables have to be “selected” to
sufficiently explain the response Y making a compromise referred as the bias-variance trade-off. This is equivalent to choosing
between competing linear regression models (i.e., with different combinations of variables). On the one hand, one has to consider
that including too few variables leads to so-called "underfit" of the data, characterized by poor prediction performance with high
bias and low variance. On the other hand, selecting to many variables rise to so-called "overfit" of the data, characterized by poor
prediction performance with low bias and high variance. Stepwise linear regression is an attempt to address this problem (Miller
et al., 2002; Sen and Srivastava, 1990) constituting a specific example of subset regression analysis. Although model selection can
be used in classical regression context, it is one of the most effective tool in high dimensional data analysis.
Classical regression deal with the case nZp where n denotes the number of independent observations (i.e., the sample size) and
p the number of variables. Nowadays, in many applications especially in biomedical science, high-throughput assays are capable of
measuring from thousands to hundreds of thousands of variables on a single statistical unit. Therefore, one has often to deal with
the case p 44n. In such a case, ordinary least squares cannot be applied, and other types of approaches (for example, including the
use of a penalization function) such as Ridge regression, Lasso or Elastic net regression (Hastie et al., 2009; James et al., 2013;
Tibshirani, 1996, 2011) have to be used for estimating the regression coefficients. In particular, Lasso is very effective since it also
performs also variable selection and has opened the new framework of high-dimensional regression (Bühlmann and van de Geer,
2011; Hastie et al., 2015). Model selection and high dimensional data analysis are strongly connected, and they might also benefit
from dimension reduction techniques such as principal component analysis, or feature selection.
In the classical framework, Y is treated as a random variable, while X are considered fixed, hence, depending on the distribution
of Y, different types of regression models can be defined. With X fixed, the assumptions on the distribution of Y are elicited
through the distribution of error term e ¼ ðe1 ; …; en ÞT . As above mentioned, classical linear regression requires the error term to
satisfy the Gauss-Markov conditions and be normally distributed. However, when the error term is not normally distributed, linear
regression might be not appropriate. Generalized linear models (GLM) constitute a generalization of classical linear regression that
allows the response variable Y to have an error distribution other than normal (McCullagh and Nelder, 1989). In this way, GLM
generalize linear regression by allowing the linear model to be related to the response variable via a link function, and by allowing
the magnitude of the variance of each measurement to be a function of its predicted value. In this way GLM represent a wide
framework that includes linear regression, logistic regression, Poisson regression, multinomial regression, etc. In this framework
the regression coefficients can be estimated using the maximum likelihood approach, often solved by iteratively reweighted least
squares algorithms.
In the following, we briefly summarize the key concepts and definitions related to linear regression, moving from the simple
linear model, to the multiple linear model. In particular, we discuss the Gauss-Markov conditions and the properties of the least
squares estimate. We discuss the concepts of model selection and also provide suggestions on how to handling outliers and
deviation from standard assumptions. Then, we discuss modern forms of regression, such as Ridge regression, Lasso and Elastic
Net, which are based on penalization terms and are particularly useful when the dimension of the variable space, p, increases. We
conclude by extending the linear regression concepts to the Generalized Linear Models (GLM).
Background/Fundamentals
In the following we first introduce the main concepts and formulae for simple linear regression (p¼ 1), then we extend the
regression model to the case p41. We note that, when p ¼ 1, simple linear regression is the “best” straight line through the
observed data points, for p41 it represents the “best” hyper-plane across the observed data points. Moreover, while Y has to be a
quantitative variable, the Xk can be either quantitative or categorical. However, categorical covariates have to be transformed into a
series of dummy variables using indicators.
Regression Analysis
3
Simple Linear Regression
As mentioned before, simple linear regression (Altman and Krzywinski, 2015; Casella and Berger, 2001) is a statistical model used
to study the relationship f() between two (quantitative) variables x and Y, where Y represents the response variable and x the
explanatory variable (i.e., x is one particular Xk), assuming that the mathematical relation can be described by a straight line. In
particular, the aim of simple linear regression is estimating the straight line that best fits the observed data, by estimating its
coefficients (i.e., the intercept and the slope), quantifying the uncertainty of such estimate, testing the significance of the relationship, and finally using the estimated line to predict the value of Y using the observed values of x. Typically the x variable is
treated as fixed, and the aim is estimating the relationship between the mean of Y and x (i.e., E½Yjx where E denotes the
conditional expectation of Y given x).
More formally, to fix the notation and introduce the general concepts, assume we have collected n observations
ðxi ; Yi Þ; i ¼ 1; …; n. A simple linear regression model can be written as
Yi ¼ b0 þ b1 xi þ ei i ¼ 1; …; n
where the ei are independent and identically distributed random variables (with zero mean and finite variance s2 ) and b0 þ b1 x
represents a straight line, b0 being the intercept and b1 the slope. In particular, when b1 40, x and Y vary together; when b1 o0 then
x and Y vary in opposite directions.
Let Y ¼ ðY1 ; Y2 ; …; Yn ÞT and e ¼ ðe1 ; e2 ; …; en ÞT denote the vectors of the observed outcomes and the noise, respectively,
! T T
1; 1; …:; 1
1n
¼
, the matrix of regressors (usually called the design matrix), and b ¼ ðb0 ; b1 ÞT , the vector of
X¼
x1 ; x2 ; …; xn
x
unknown coefficients, then the regression problem can be rewritten in matrix form as,
Y ¼ Xb þ e
The aim is estimating the coefficients b ¼ ðb0 ; b1 ÞT that provide a “best” fit for the observed data. This “best” fit is often achieved
by using the ordinary least-squares approach, i.e., by finding the coefficients that minimizes the sum of the squared residuals, or in
mathematical terms by solving the following problem
argmin X
n
b0 ;b1
i¼1
argmin X
n
e2i ¼
b0 ;b1
ðYi b0 b1 xi Þ2
i¼1
After straightforward calculations, it is possible to prove that
Pn
i ¼ 1 ðxi xÞ Yi Y
^
b1 ¼
Pn
2
i ¼ 1 ðxi xÞ
Pn
where Y ¼
Y
i¼1 i
n
Pn
and x ¼
x
i¼1 i
n
b^ 0 ¼ Y b^ 1 x
denote the sample mean of Y and x, respectively. The estimate of b^ 1 can be rewritten as
sxY
sYY
¼ rxY
b^ 1 ¼
sxx
sxx
where sxx and sYY denote the sample variance of x and Y, respectively; rxY and sxY denote the sample correlation and sample
covariance coefficient between x and Y, respectively.
Given the estimated parameters b^ 0 , b^ 1 , it is possible to estimate the response, Y^ i , as
Y^ i ¼ b^ 0 þ b^ 1 xi
i ¼ 1; …; n
The least squares approach provides a good
of the unknown parameters
estimate
if the so-called Gauss-Markov conditions are
satisfied (i.e., i. Ε ðei Þ ¼ 0, 8 i ¼ 1; …; n, ii. E e2i ¼ s2 , 8 i ¼ 1; …; n and iii. Ε ei ej ¼ 0; 8 ia j, where E() denotes the expected
value). In particular, according to the Gauss-Markov theorem, if such conditions are satisfied, the least squares approach provides
the best linear unbiased estimator (BLUE) of the parameters, i.e., the estimate with the lowest variance compared to all linear
unbiased estimators. Furthermore, if we also assume that ei B N ð0; s2 Þ then, the least squares solution provides the best estimator
among all unbiased estimators, and also the maximum likelihood estimator. Additionally, under such assumptions (“white noise”
assumptions) it is possible to prove that
bi b^ i
B Tn2 ;
sb^ i
i ¼ 0; 1
ffi
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P
n
1
e2
n2
i¼1 i
^
where sb^ i ¼ P
n
2 is the standard error of the estimator b i , and Tn-2 denotes the Student T-distribution with n-2 degree of
i¼1
ðxi xÞ
freedom. On the basis of this result, the (1-a) % percent confidence interval for the coefficient, bi , is given by
h
i
bi ¼ b^ i sb^ i tn2;2a ; b^ i þ sb^ i tn2;2a ; i ¼ 0; 1
where tn2;2a denotes the (1-a/2) quantile of Tn-2.
Moreover, the significance of each coefficient can be evaluated by testing the null hypothesis H0;i : bi ¼ 0
^
versus the alternative H1;i : bi a 0. To perform such test, it is possible to use the following test statistics sb^i B Tn2 ; i ¼ 0; 1.
bi
4
Regression Analysis
Finally, the coefficient of determination, R2, is commonly used to evaluate the goodness of fit within a simple linear regression
model. It is defined as follows
2
Pn Pn ^ 2
Y^i Y
i ¼ 1 Yi Yi
R2 ¼ Pin¼ 1 2 ¼ 1 Pn 2
i ¼ 1 Yi Y
i ¼ 1 Yi Y
We have that R 2 A½0; 1, where 0 (or a value close to 0) indicates that the model explains none (or little) of the variability of the
response data around its mean; 1 (or a value close to 1) indicates that the model explains all (or most) of the variability of the
response data around its mean. Moreover, if R 2 ¼ 1 the observed data are distributed on the regression line (perfect fitting). Note
that R 2 ¼ r 2 were, r, denotes the Pearson correlation coefficient that can be computed as
P
P
P
n ni¼ 1 xi yi ni¼ 1 xi ni¼ 1 yi
q
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
r ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Pn
2
Pn
2
P
P
n ni¼ 1 x2i n ni¼ 1 yi2 i ¼ 1 xi
i ¼ 1 yi
Ordinary least squares regression is the most common approach to defining the “best” straight line that fit the data. However,
there are other regression methods that can be used in place of ordinary least squares, such as the least absolute deviations (i.e., the
line that minimizes the sum of the absolute values of the residuals)
argmin X
n
b0 ;b1
argmin X
n
jei j ¼
i¼1
b0 ;b1
jYi b0 b1 xi j
i¼1
see (Sen and Srivastava, 1990) for a more detailed discussion.
Multiple Linear Regression
Multiple linear regression generalizes the above mentioned concepts and formulae to the case where more predictors are used
(Abramovich and Ritov, 2013; Casella and Berger, 2001; Faraway, 2004; Fahrmeir et al., 2013; Jobson, 1991; Krzywinski and
Altman, 2015; Rao, 2002; Sen and Srivastava, 1990).
Let Y ¼ ðY1 ; Y2 …; Yn ÞT and e ¼ ðe1 ; e2 …; en ÞT denote the vectors of the observed outcomes and the noise, respectively,
1
0
1; x11 ; x12 ; …:; x1p
B 1; x21 ; x22 ; …:; x2p C
C
B
C
B
C
…::
X¼B
C
B
C
B
…::
A
@
1; xn1 ; xn2 ; …:; xnp
T
is the design matrix, and b ¼ b0 ; b1 ; …; bp is the vector of unknown coefficients, then the regression problem can be rewritten in
matrix form as,
Y ¼ Xb þ e
The least squares estimate of b can be obtaining by solving the following problem
n
n 2
argmin
argmin X 2 argmin X
ðY XbÞT ðY XbÞ
ei ¼
Yi b0 b1 xi1 b2 xi2 …: bp xip ¼
b0 ;b1 ;…:;bp
i¼1
b0 ;b1 ;…:;bp
b
i¼1
T
T
1
It can be proved that if X X is a non-singular pxp matrix, and X X
solution given by
1
b^ ¼ X T X X T Y
denotes its inverse matrix, then there exists a unique
þ
If X T X is singular, the solution is not unique, but it can be still computed in terms of the pseudo-inverse matrix X T X as
follows
þ
b^ ¼ X T X X T Y
Once the estimate b^ has been computed, it is possible to estimate Y^ as
1
Y^ ¼ X b^ ¼ X X T X X T Y ¼ HY
1
where H ¼ X X T X X T is called the hat-matrix or projection matrix. From a geometrical point of view, in a n-dimensional Euclidean
space, Y^ can be seen as the orthogonal projection of Y in the subspace generated by the columns of X, and the vector of residuals,
defined as e ¼ Y Y^ , is orthogonal to the subspace generated by the columns of X.
^ is the BLUE, and an
Moreover, under the assumption that the Gauss-Markov conditions hold, the least squares estimate, b,
unbiased estimate of the variance is given by
s^ 2 ¼
n X
2
1
Yi Y^i
n p 1i¼1
Regression Analysis
5
Pn
2
ðY^i Y Þ
Analogously to simple linear regression, the coefficient of determination, R2 ¼ Pin ¼ 1
2 , can be used to evaluate the goodness
ðYi Y Þ
i¼ 1
of fit ðR 2 A½0; 1Þ. However, the value of R2 increases as p increases. Therefore, it is not a useful measure for selecting parsimonious
models. The adjusted R2, R2a , is defined as
Pn
R2a
¼1
ðYi Y^i Þ
i ¼ 1 ðnp1Þ
2
Pn
ðYi Y Þ
2
¼1
i ¼ 1 ðn1Þ
SSE=ðn p 1Þ
SST=ðn 1Þ
which adjusts for the number of explanatory variables in the model.
2
P
Overall we have the following partition of the errors: SST ¼ SSR þ SSE where SST ¼ ni¼ 1 Yi Y i denotes the total sum of
2
2
Pn Pn squares, SSR ¼ i ¼ 1 Y i Y^i , the sum of squares explained by the regression, and SSE ¼ i ¼ 1 Yi Y^i , the residual sum of
squares due to the error term.
Gauss-Markov Conditions and Diagnostics
Linear regression and the ordinary least squares approach are based on the so-called Gauss-Markov conditions
1) Eðei Þ¼ 0 (Zero-mean error)
2) Ee2i ¼ s2 8 i ¼ 1; …; n (Homoscedastic error)
3) E ei ej ¼ 0 8 ia j (Independent errors)
1
that guarantee Y^ ¼ X b^ ¼ X X T X X T Y is the best linear unbiased estimator (BLUE) of the coefficients (Gauss-Markov theorem,
Jobson, 1991; Sen and Srivastava, 1990).
When using linear regression functions it is important to verify the validity of such assumptions using so-called residual plots
(Altman and Krzywinski, 2016b). Moreover, in order to perform inference on the significance of the regression coefficients and
prediction of future outcomes, it is necessary that eB N ð0; r2 I n Þ also holds. This condition can be verified using the Shapiro-Wilk
test or the Lilliefors test, see (Razali and Wah, 2011) or the Chapter Statistical Inference Techniques in this volume.
Tests of significance and confidence regions
One of the most important aspects of regression is that it is possible to make inference on the estimated model, this means it is
possible to test the significance of each coefficients (i.e., H0 : bj ¼ 0 vs H1 : bj a 0), assuming the other are fixed, and to build
confidence intervals. In particular, under the “white noise” assumptions, it is possible to prove that
1 b^ B N b; X T X s2
and
2
ðn p 1Þ^
s 2 B s2 χnp1
2
where χnp1
denotes a chi-squared distribution with n-p-1 degree of freedom. Moreover the two estimates are statistically
independent. Hence, it is possible to show that (under the null hypothesis)
b^ j bj
B Tnp1
sb^ i
is a Student distribution Tnp1 with n-p-1 degree of freedom. These results can
be used to decide whether bj are significant (where,
qffiffiffiffiffiffiffiffiffiffiffiffiffiffi
sb^ i , denotes the standard deviation of the estimated coefficients, i.e., sb^ i ¼ varðb^ j Þ).
^
Moreover the 100(1-a)% confidence interval for the regression coefficient, b^ j , is given by b^ j 7sb^ i tnp1;2a , where tnp1;2a is the
corresponding quantile of the T distribution with n-p-1 degree of freedom. More details about significance testing, including the
case of contrasts between coefficients, can be found in Sen and Srivastava (1990), Jobson (1991).
Inference for the model
The overall goodness of fit of the model can be measured testing the null assumption H0 : b1 ¼ b2 ¼ … ¼ bp ¼ 0 using the F-test
F¼
SSR=p
SSE=ðn p 1Þ
which has a Fisher distribution with p and n-p-1 degree of freedom.
Confidence
interval (CI)for the expectation of a predicted value at x0
Let x0 ¼ x00 ; x01 ; …; x0p be a novel observation of the independent variables X, and let y0 be the unobserved response. Then, the
predicted response is y^0 ¼ x0 T b^ and the 100(1-a)% confidence interval of its expected value is
h 1 i
^y 0 7tnp1;a s^ 2 x0 T X T X
x0
6
Regression Analysis
CI for a future observation x0
Under the same settings, the 100(1-a)% confidence interval for the predicted response is
h
1 i
^y 0 7tnp1;a s^ 2 1 þ x0 T X T X
x0
Subset Linear Regression
In many cases, the number of observed variables, X, is large and one seeks for a regression model with a smaller number of
“important” variables in order to gain explicative power and to address the so-called variance-bias trade off. In these cases a small
subset of the original variables must be selected, and the regression hyper-plane fit using the subset. In principle, one can chose a
goodness criterion and compare all potential models using such criteria. However if p denotes the number of observed variables,
there will be 2p potential models to compare. Therefore, for large p such an approach becomes unfeasible. Stepwise linear
regression methods allow one to select a subset of variables by adding or removing one variable at time and re-fitting the model. In
this way, it is possible to choose a “good” model by fitting only a limited number of potential models. Mallows’ CP, AIC (Akaike's
information criterion), and BIC (Bayesian information criterion) are widely used for choosing “good” models. More details can be
found in Sen and Srivastava (1990), Jobson (1991), Miller (2002), Lever et al., (2016).
Regression Analysis in Practice
How to Report the Results of a Linear Regression Analysis and Diagnostic Plots
After fitting a linear regression model one has to report the estimates of the regression coefficients and their significance. However,
this is not sufficient for evaluating the overall significance of the fit. The R2 (or the R2a ) are typical measures of quality that should
be reported, as well as the overall significance, F, of the regression model. Moreover, results have to be accompanied by a series of
diagnostic plots and statistics on residuals that can be used to validate the Gauss-Markov conditions. Such analyses include normquantile residual and fitted-residual plots. The key aspect to keep in mind is that when fitting a linear regression model by using
any software, the result will provide the “best” hyper-plane or the “best” straight line, regardless of the fact that the model is
inappropriate. For example, in the case of simple linear regression the best line will be returned even though the data are not
linear. Therefore, to avoid false conclusions, all of the above-mentioned points have to be considered. The leverage residual,
DFFITS and Cook’s distance of observations are other measures that can be used to assess the quality and robustness of the fit, see
(Sen and Srivastava, 1990; Jobson, 1991; Altman and Krzywinski, 2016b) for more details. In cases where the assumptions are
violated, one can use data transformations to mitigate the deviation from the assumptions, or use more sophisticated regression
models, such as generalized linear models, non-linear models, and non-parametric regression approaches depending on the type
of assumptions one can place on the data.
Outliers and Influential Observations
The presence of outliers can be connected to potential problems that may inflate the apparent validity of a regression model
(Sen and Srivastava, 1990; Jobson, 1991; Altman and Krzywinski, 2016a). In particular, an outlier is a point that does not follow
the general trend of the rest of the data. Hence it can show either an extreme value for the response Y, or for any of the predictors
Xk, or both. Its presence may be due either to measurement error, or to the presence of a sub-population that does not follow the
expected distribution. One might observe one or few outliers when fitting a regression model. An influential point is an outlier that
strongly affects the estimates of the regression coefficients. Anscombe's quartet (Anscombe, 1973) is a typical example used to
illustrate how influential points can inflate conclusions. Roughly speaking, to measure the influence of an outlier, one can
compute the regression coefficients with and without the outlier. More formal approaches for detecting outliers and influential
points are described in Sen and Srivastava (1990), Jobson (1991), Altman and Krzywinski (2016a).
Transformations
As already mentioned above, when the Gauss-Markov conditions are not satisfied ordinary least squares cannot be used. However,
depending on deviation from the assumptions, either a transformation can be applied in order to match the assumptions, or other
regression approaches can be used. More in general, transformations can be used on the data for wider purposes, for example for
centering and standardizing observations when the variables have different magnitude. Other widely used transformations are
logarithmic and square-root transforms that can help in linearizing relationships. Additionally, in the context of linear regression,
variance-stabilizing transformations can be used to accommodate heteroscedasticity, and normalizing transformations to better
match the assumption of normal distribution. Although several transformations have been proposed in the literature, there is no
standard approach. Moreover, an intensive use of transformations might be questionable. The reader is referred to (Sen and
Srivastava, 1990) for more details.
Regression Analysis
7
Beyond ordinary least squares approaches
As previously stated, linear regression is also know as the regression of the “mean” since the conditional mean of Y given X is
modeled as linear combination of the regressors, X, plus an error term. In this case, the least squares approach constitutes the BLUE
when the Gauss-Markov conditions hold. However, other types of criteria, such as quantile regression, have been proposed
(McKean, 2004; Koenker, 2005; Maronna et al., 2006). In particular, quantile regression aims at estimating either the conditional
median or other conditional quantiles of the response variable. In this case, the estimates are obtained using linear programming
optimization algorithms that solve the corresponding minimization problem. Such types of regression models are more robust
than least squares with respect to the presence of outliers.
Advanced Approaches
High Dimensional Regression (p44n)
Classical linear regression, as described above, requires the matrix X T X to be invertible, this implies that nZp, where n denotes
the number of independent observations (i.e., the sample size) and p the number of variables. Current advance in science and
technology allow the measurement of thousands or millions of variables on the same statistical unit. Therefore, one has often to
deal with the case of p 44n, where ordinary least squares cannot be applied, and other types of approaches must be used. The
solution in such cases is to use penalized approaches such as Ridge regression, Lasso regression, or Elastic net or others (Hastie
et al., 2009; James et al., 2013; Tibshirani, 1996, 2011; Zou and Hastie, 2005), where a penalty term, P(b), is added to the fitting
criteria, as follows
ðY XbÞT ðY XbÞ þ l PðbÞ
where l is the so-called regularization parameter, which controls the trade-off between bias and variance. In practice, by sacrificing
the unbiased property of the ordinary least squares, one can achieve generalization and flexibility. Different penalty terms lead to
different regressions methods (Bühlmann and van de Geer, 2011; Hastie et al., 2015) and are also known ad regularization
techniques. The regularization parameter, l, is usually selected using criteria such as cross-validation.
Other possible approaches include the use of dimension reduction techniques such as principal component analysis or feature
selection, such as described in the Chapter Dimension Reduction Techniques in this volume.
Ridge regression
The estimates of the regression coefficients are obtained by solving the following minimization problem
"
#
p
n 2
h
i
X
Ridge
argmin X
argmin
2
^
b
¼
Yi b0 b1 xi1 b2 xi2 …: bp xip þ l
bj ¼
ðY XbÞT ðY XbÞ þ ljjbjj22
b
i¼1
j¼1
b
where l40 is a suitable regularization parameter. It is easy to show, since the minimization problem is convex, it has a unique
solution. The solution can be computed in a closed form as
1
Ridge
b^
¼ X T X þ lI X T Y
which corresponds
to shrinking the ordinary least square coefficients by an amount that is controlled by l. Ridge regression can be
used when X T X is singular or quasi-singular and ordinary least squares does not provide a unique solution. In fact, in such
Ridge
tends to b^ (i.e., those
circumstance X T X þ lI is still invertible. Note that, as l-0 the penalty plays a “minor” role, thus b^
Ridge
^
tends to zero (i.e., to the so-called intercept-only
coefficients obtained by using ordinary least squares), when l- þ 1 the b
model). Different values of l provide a trade-off between bias and variance (larger l increases the bias, but reduces the variance). A
suitable value of l can be chosen from the data by cross-validation. Although ridge regression can be used for fitting high
dimensional data in a more effective way than using ordinary least squares approach, there are however some limitations such as a
large bias toward zero for large regression coefficients, and a lack of interpretability of the regression solution since “unimportant”
coefficients are shrunken towards zero, but they’re still in the model instead of being killed to zero. As a consequence, the ridge
regression does not act as model selection.
Lasso regression
The estimates of the regression coefficients are obtained by solving the following minimization problem
"
#
p
n 2
h
i
X
Lasso
argmin X
argmin
^
b
¼
Yi b0 b1 xi1 b2 xi2 …: bp xip þ l
jbj j ¼
ðY XbÞT ðY XbÞ þ ljjbjj1
b
i¼1
j¼1
b
where l40 is a suitable regularization parameter. The underlying idea in lasso approach is that it seeks a set of sparse solutions
meaning that it will set some regression coefficients exactly equal to 0. As a consequence, lasso also performs model selection.
Larger is the value of l, more will be the coefficients set to zero. Unfortunately, the solution of lasso minimization problem is not
available in closed form, however it can be obtained by using convex minimization approaches and several algorithms have been
8
Regression Analysis
proposed such as the least-angle regression (LARS) (Efron et al., 2004) to efficiently fit the model. Analogously to ridge regression,
a suitable value of l can be chosen from the data by cross-validation.
Overall, lasso has opened a new framework in the so-called high-dimensional regression models and several generalizations have
been proposed (Bühlmann and van de Geer, 2011; Hastie et al., 2015) to overcome its limitations and to extend the original idea
to different regression contexts.
Elastic net regression
The estimates of the regression coefficients are obtained by solving the following minimization problem
Elastic net
argmin
b^
¼
b
¼
h
argmin
b
p
p
n 2
X
X
X
Yi b0 b1 xi1 b2 xi2 …: bp xip þ l1
b2j þ l2
jbj j
j¼1
i¼1
ðY XbÞT ðY XbÞ þ l1 jjbjj22 þ l2 jjbjj1
j¼1
i
where l140 and l240 are suitable regularization parameters (Zou and Hastie, 2005). Note that for l1 ¼ 0 one can obtain lasso
regression, while the case l2 ¼ 0 corresponds to ridge regression. Different combinations of l1 and l2 compromise between
shrinking and selecting coefficients. Indeed, elastic net was designed to overcome some of the limitations of both ridge regression
and lasso regression, since the quadratic penalty term shrinks the regression coefficients toward zero, while and the absolute
penalty term act as model selection by keeping or killing regression coefficients. For example, elastic net is more robust than lasso
when correlated predictors are present in the model. In fact, when there is a group of highly correlated variables, lasso usually
selects only one variable from the group (ignoring the others), elastic net allows more variables to be selected. Moreover, when
p4n lasso can select at most n variables before saturating, whereas, thanks to the quadratic penalty, elastic net can select a larger
number of variables in the model. On the other hand, “unimportant” regression coefficients are often set to zero to perform
variable selection.
Other Types of Regressions
Generalized linear models (GLM) are a well-known generalization of the above-described linear model. GLM allow the dependent
variable, Y, to be generated by any distribution f() belonging to the exponential family. The exponential family includes normal,
binomial, Poisson, and gamma distribution among many others. Therefore GLM constitute a general framework in which to
handle different type of relationships.
The model assumes that the mean of Y depends on X by means of a link function, g(),
EðY Þ ¼ l ¼ g 1 ðXbÞ
where E() denotes expectation and g() the link function (an invertible, continuous and differentiable function). In practice,
g ðlÞ ¼ Xb, so there is a linear relationship between X and a function of the mean of Y. Moreover, in this context the variance is also
function of the mean Var ðY Þ ¼ V ðlÞ ¼ V ðg 1 ðXbÞÞ.
In this context, the linear regression framework can be reformulated choosing the identity as the link function, Poisson
regression corresponds to Log() as the link function, and binomial regression to choosing Logit, and so on.
The unknown regression coefficients b are typically estimated with maximum likelihood, maximum quasi-likelihood, or
Bayesian techniques. Inference can then be carried out in a similar way as for linear regression. A formal description and
mathematical treatment of GLM can be found in McCullagh and Nelder (1989), Madsen and Thyregod (2011). GLM are very
important for biomedical applications since they include logistic and Poisson regression, which are often used in biomedical
science to model binary outcomes or counts data, respectively.
Recently, penalized regression approaches, as those described in section high dimensional regression, have been extended to
the generalized linear models. In this context, the regularization is achieved by penalizing the log-likelihood function; see
(Bühlmann and van de Geer, 2011; Hastie et al., 2015) for more details.
Closing Remarks
Regression constitutes one of the most relevant frameworks of modern statistical inference with many applications to the analysis
of biomedical data, since it allows one to study the relationship between a dependent variable Y and a series of p independent
variables X¼ [X1|…|Xp], from a set of independent observations ðxi ; Yi Þ; i ¼ 1; …; n. When the relationship is linear in the coefficients one has linear regression. Despite its simplicity, linear regression allows testing the significance of the coefficients,
estimating the uncertainty and predicting novel outcomes. The Gauss-Markov conditions guarantee that the least squares estimator
has the BLUE property, and the normality of the residuals allows carrying out inference. When p4n, classical linear regression
cannot be applied, and penalized approaches have to be used. Ridge, lasso and elastic net regression are the most well known
approaches in the context of high dimensional data analysis (Bühlmann and van de Geer, 2011; Hastie et al., 2015). When the
Regression Analysis
9
white-noise conditions are violated data transformation of other types of models have to be considered. Analogously, when the
relationship is not linear other approaches or non-parametric models should be used.
References
Abramovich, F., Ritov, Y., 2013. Statistical Theory: A Concise Introduction. Chapman & Hall/CRC.
Altman, N., Krzywinski, M., 2015. Simple linear regression. Nature Methods 12 (11), 999–1000.
Altman, N., Krzywinski, M., 2016a. Analyzing outliers: Influential or nuisance? Nature Methods 13 (4), 281–282.
Altman, N., Krzywinski, M., 2016b. Regression diagnostics. Nature Methods 13 (5), 385–386.
Anscombe, F.J., 1973. Graphs in statistical analysis. American Statistician 27 (1), 17–21.
Bühlmann, P., van de Geer, S., 2011. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer series in statistics.
Casella, G., Berger, R., 2001. Statistical Inference, second ed. Duxbury.
Efron, B., Hastie, T., Johnstone, J., Tibshirani, R., 2004. Least angle regression. Annals of Statistics 32 (2), 407–499.
Faraway, J.J., 2004. Linear Models with R. Chapman & Hall/CRC.
Fahrmeir, L., Kneib, T., Lang, S., Marx, B., 2013. Regression: Models, Methods and Applications. Springer.
Hastie, T., Tibshirani, R., Friedman, J., 2009. The Elements of Statistical Learning, second ed. Springer series in Statistics.
Hastie, T., Tibshirani, R., Wainwright, M., 2015. Statistical Learning with sparsity: The Lasso and Generalizations. CRC Press.
James, G., Witten, D., Hastie, T., Tibshirani, T., 2013. An Introduction to Statistical Learning: With Applications in R. Springer.
Jobson, J.D., 1991. Applied Multivariate Data Analysis. Volume I: Regression and Experimental Design. Springer Texts in Statistics.
Koenker, R., 2005. Quantile Regression. Cambridge University Press.
Krzywinski, M., Altman, N., 2015. Multiple linear regression. Nature Methods 12 (12), 1103–1104.
Lever, J., Krzywinski, M., Altman, N., 2016. Model selection and overfitting. Nature Methods 13 (9), 703–704.
Madsen, H., Thyregod, P., 2011. Introduction to General and Generalized Linear Models. Chapman & Hall/CRC.
Maronna, R., Martin, D., Yohai, V., 2006. Robust Statistics: Theory and Methods. Wiley.
McCullagh, P., Nelder, J., 1989. Generalized linear models, second ed. Boca Raton, FL: Chapman and Hall/CRC.
McKean, Joseph W., 2004. Robust analysis of linear models. Statistical Science. 19 (4), 562–570.
Miller, A., 2002. Subset Selection in Regression. Chapman and Hall/CRC.
Rao, C.R., 2002. Linear Statistical Inference and its Applications, Wiley series in probability and statistics, second ed. New York: Wiley.
Razali, N.M., Wah, Y.B., 2011. Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests. Journal of Statistical Modeling and Analytics 2
(1), 21–33.
Sen, A., Srivastava, M., 1990. Regression Analysis: Theory, Methods and Applications. Springer Texts in Statistics.
Tibshirani, R., 1996. Regression shrinkage and selection via the Lasso. Journal of the Royal statistical Scociety B 58 (1), 267–288.
Tibshirani, R., 2011. Regression shrinkage and selection via the lasso: A retrospective 73 (3), 273–282.
Zou, H., Hastie, T., 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B 67 (2), 301–320.
Download