Beyond OLS

Beyond OLS: A non-technical guide to
econometric methods1
The workhorse method in quantitative social science has long been Ordinary Least Squares (OLS). It’s
popularity is due to several reasons. The first is that given a set of assumptions, which we will come
back to later, it yields the best, linear, unbiased estimators (blue), where “best” refers to the
estimator with the lowest variance. Second, the estimators based on OLS are relatively easy to
compute. Third, they are relatively straightforward to interpret. With the advent of powerful
statistical software packages, the second argument is no longer as crucial as it once was. Therefore,
with the recognition that the restrictive assumptions underlying the OLS-estimators as BLUE often do
not hold, social scientists are increasingly using more complicated techniques when analyzing data. In
this short paper, I will first review the crucial assumptions underlying the OLS-as-BLUE result. Then, I
will go on and look at some more advanced techniques used by social scientists in a non-technical
jargon, and explain why they are superior to OLS in different circumstances. This is only a cursory text
aimed at providing an initial presentation of these techniques, and the rationale underlying them.
Those who are interested in learning more about these techniques, or who are contemplating using
these techniques in their own analysis, can confront for example (after increasing degree of technical
difficulty) Kennedy (2003), Guajarati (2003) or Greene (2003). At some point, learning more about
these techniques requires a certain degree of mathematical knowledge. However, I believe it is
possible to comprehend the core of these techniques without indulging in too
technical/mathematical language. My aim is to give the readers enough understanding to be able to
read and understand empirical social science papers and books that use these techniques, and
maybe even be able to critically assess such works. Readers are assumed to have basic knowledge of
OLS and elementary statistics.
When is OLS BLUE?
Peter Kennedy (2003:48-9) lists up the five main assumptions underlying the so-called Classical Linear
Regression model, under which OLS gives the best, linear, unbiased estimators. Let me re-list these
assumptions and the possible violations as identified by Kennedy:
A1) The dependent variable can be calculated as a linear function of a set of specific independent
variables and an error term.
This assumption is crucial if the estimators are to be interpreted as decent “guesses” on effects from
independent variables on the dependent. Violations of this assumptions lead to what is generally
known as “specification errors”. One should always approach quantitative empirical studies in the
social sciences with the question “is the regression equation specified correctly?”
One particular type of specification error is excluding relevant regressors. This is for example crucial
when investigating the effect of one particular independent variable, let’s say democracy, on a
Some of the material is taken from an earlier working paper (Knutsen, 2008)
dependent variable, let’s say growth. If one important variable, let’s say level of income, is missing
from the regression equation, one risks facing omitted variable bias. The estimated effect of
democracy can now be systematically over- or understated, because income level affects both
democracy and economic growth. The democracy coefficient will pick up some of the effect that is
really due to income, on economic growth. Identifying all the right “control variables” is a crucial
task, and disputes over proper control variables can be found everywhere in the social sciences.
Another variety of this specification error is including irrelevant controls. If one for example wants to
estimate the total, and not only the “direct”, effect from democracy on growth, one should not
include variables that are theoretically expected to be intermediate variables. That is, one should not
include variables through which democracy affects growth. One example could be a specific type of
policy, A. If one controls for policy A, one controls away the effect of democracy on growth that is
due to democracies being more likely to push through policy A. If one controls for Policy A, one does
not estimate the total effect of democracy on growth.
Another specification error that can be conducted is assuming a linear relationship when the
relationship really is non-linear. In many instances, variables are not related in a fashion that is close
to linearity. Transformations of variables can however often be made that allows an analyst to stay
within an OLS-based framework. If one suspects a U- or inversely U-shaped relationship between two
variables one can square the independent variable before entering it into the regression model. If
one suspects that the effect of an increase in the independent variable is larger at lower levels of the
independent variable, one can log-transform the independent variable. The effect of an independent
variable might also be dependent upon the specific values taken by other variables, or be different in
different parts of the sample. Interaction terms and delineations of the sample are two suggested
ways to investigate such matters.
A2) The expected value of the disturbance term is zero
If this property is violated, there will be bias in the intercept, because OLS as a procedure forces the
average of the error terms to be zero. Note that the assumption is about the underlying structure of
the world, and the OLS-procedure follows this assumption and forces the error terms into a certain
A3) The disturbance terms have the same variance and are not correlated2
Violations of these two properties result in two problems that are very commonly associated with
quantitative, empirical studies, namely heteroskedasticity and autocorrelation. Heteroskedasticity
implies that the disturbances do not have the same variances, and the variances are often a function
of specific independent variables. There might be higher disturbance terms for poor countries in a
given model on economic growth than for rich countries, to take one example. Autocorrelation
implies that the disturbance terms are systematically correlated in one way or another. Growth in
year t for a country is correlated with growth in year t+1. In these cases, OLS is not BLUE, and more
efficient (smaller variance in the coefficient estimates) techniques can be utilized.
Notice that ”disturbance term” refers to the real world deviations from the real world linear relationship,
whereas ”error term” refers to the estimated deviation from the estimated linear relationship.
A4) Observations on the independent variable can be considered fixed in repeated samples.
The strictest interpretation of this assumption is that values on the independent variables are
manipulated by the researcher, as in experiments. However, a less strict form of exogeneity of the
independent variables is sufficient for OLS to function properly: The independent variables should
not be affected by the dependent variable. In the language of causality, the dependent variable must
be an effect and not a cause of the independent variable. In many cases, two variables can be both
causes and effects of each other. If so, OLS regressions will give biased results. For example, if high
GDP-levels increase average investment rates and high investment rates increase GDP-levels, an OLS
equation where investment rate is the independent variable and GDP the dependent variable will
systematically overstate the effect of investment rates on GDP.
Another problem related to assumption A4 is measurement error. If there is an unsystematic
measurement error in the dependent variable, we will get unbiased OLS-estimates. But, we will get
estimates with larger standard errors. If we however have unsystematic measurement errors in the
independent variable, we will get biased estimates. In a bivariate regression, such measurement
errors will tend to draw the coefficients towards zero, and this bias is known as the attenuation bias.
Readers are encouraged to take their favorite social science research question and think about how
systematic measurement errors might bias estimated relationships.
A5) There are more observations than independent variables and there are no exact linear
relationships between independent variables
The violation of the first part of the assumption points to the “degree of freedom” problem, and the
violation of the second to the “perfect multi-colinearity” problem. I will not dig deep into these
important issues here, but for a treatment on how these problems might affect also qualitative social
science research, see King et al (1994). Let me give one example of the degree of freedom problem,
where we have more independent variables than observations: If we have two countries, where one
experienced revolution and the other did not, and these countries differ in degree of development,
institutional structure and cultural background, it is impossible to discern which of these three
variables those were crucial to the existence of revolution in one country and non-existence in the
other. If we face negative degrees of freedom or perfect multi-colinearity, a software-package will
refuse to calculate results. However, one should note that also approximations to such situations will
create problems for inference. Few degrees of freedom or high multi-colinearity will tend to give high
standard deviations for coefficient estimates, thereby reducing the chance of obtaining significant
coefficients. When it comes to multi-colinearity, if for example a high level of literacy and a high
urbanization degree are strongly correlated with each other and with the probability of
democratization, it is hard to discern what particular effects the two variables have on
democratization. See Kennedy (2003) for a nice treatment of these issues.
Pooled data: Combining cross-section and time-series information
As social scientists have increasingly recognized, restricting the amount of information one uses
when drawing inferences to cross sectional snapshots or cross-sectional averages over long time
periods is often unnecessary. Analysis of cross-sectional data is therefore often substituted by
analysis of data with a pooled cross-section time-series (PCSTS) structure. The different cross3
sectional units have observations on several time periods, and this vastly increases the amount of
information available for inference. We can then draw on information from both cross-sectional and
temporal variation when making inferences. However, when one incorporates a time-structure in the
data, the problem of autocorrelation of disturbance terms flies directly in the analyst’s face. OLS is
therefore no longer appropriate, and one must switch to PCSTS methods. However, even with a pure
cross-sectional data structure, OLS analysis often encounters the other problem related to A2),
namely heteroskedasticity. Readers might know that this problem can be reduced by applying
Weighted Least Squares (WLS) instead of OLS in cross-section studies. Fortunately, PCSTS methods
can also deal with heteroskedasticity.
OLS with Panel Corrected Standard Errors
There are several varieties of PCSTS, but I will focus only on one, namely OLS with Panel Corrected
Standard Errors (PCSE). According to Beck and Katz (1995), OLS with PCSE is the most proper version
of PCSTS for data sets with relatively many cross-sectional units and relatively short time-series. This
is the situation for most data sets used in political economic research, and I will therefore not dwell
on the other varieties of PCTS. OLS with PCSE allows us to estimate coefficients even when we face
unbalanced panels (time series are not equally long for all cross section units).
Luckily for those reading or doing quantitative social science research, OLS with PCSE, as the name
indicates, builds on the familiar OLS-framework. Learning OLS with PCSE does therefore not require
too much effort. This method can be used in several software packages, for example STATA.
Essentially, the calculation of estimates is based on an OLS-procedure. However, the technique can
take into account that disturbances in period t can be autocorrelated (within panels or generally)
with the disturbances in period t-1. It can also take into account that disturbances might have
different variances (that is they are heteroskedastic) between different panels. It can also deal with
the problem of a disturbance term in one cross section unit at time t being correlated with the
disturbances in other cross section units at time t. This latter phenomenon is called
contemporaneous correlation.
Let me provide an example: If we run a model on economic growth, the procedure takes into account
that the disturbance term for Germany in year t (unexplained growth; maybe extraordinary low
growth because of a recession) is correlated with the unexplained growth in Germany in year t-1 and
with unexplained growth in France in t. The procedure also incorporates the possibility that Germany
might have a lower variation in its disturbance term than France. These features of OLS with PCSE
mitigate the problems related to A2), which would plague “ordinary” OLS. Moreover, the
interpretation of the coefficients in OLS with PCSE is exactly the same as that of interpretation of
OLS-coefficients: An increase in x1 by one unit, holding all other independent variables constant,
increases the predicted y with β1. Therefore, there is not much more to be said here about OLS with
PCSE. Interested readers are encouraged to check out Beck and Katz (1995) and Sayrs (1989).
Panel data methods, Fixed Effects and Random Effects
We now move on to two techniques that are much used in contemporary social science research,
namely Fixed- and Random Effects. These two techniques resemble each other closely, and we start
with Fixed Effects, which is easier to grasp. We remain in a panel data structure, with cross-section
units observed at many time points. The main objective is of course still to estimate or test the
effects of particular independent variables on the dependent variable. However, if we have different
assumptions of “how the world looks”, we will be obliged to use different methodologies. In the OLS
with PCSE, it can be somewhat simplistically said that differences in X going together with observed
differences in Y were used for inference independently of whether the differences were observed
along the time dimension within a unit or between two units at the same or different time points.
Let us concretize by assuming that we have a model where democracy affects economic growth,
which we want to investigate empirically. In the OLS with PCSE set-up, the fact that Afghanistan had
a low level of democracy and low economic growth in 1987 and that Norway had a high level of
democracy and a high level of growth in 2003 was used as information for inference. The same went
for information relying on comparisons of Norway in 1850 and 2003. We do of course control for
other variables, but the main point is that both cross-sectional and temporal variation is used as
information for drawing inferences. What if there are non-observed country specific factors that we
do not include in the regression analysis framework those determine both the rate of growth and the
degree of democracy in a country? More generally, what if there are non-observed factors that are
specific for each cross section unit those affect both the independent and dependent variables? In
that case, OLS with PCSE is inappropriate, since we should have controlled for such cross-section unit
specific effects.
If we believe that we have not identified all variables that are relevant to the analysis, and that we
therefore have such cross-section unit specific effects, one solution is to run a so-called Fixed Effects
regression. This analysis incorporates dummy-variables for all the cross-section units. Thereby, going
back to our example again, one will in Fixed Effects only infer the effect of democracy on economic
growth from investigating variation within nations along the time dimension as they become more or
less democratic. In this sense, Fixed Effects analysis is a very restrictive analysis, since it does not
allow us to infer anything about causal effects from cross-national variation. We still investigate the
effect of democracy on growth, but we are not allowed to use information from the AfghanistanNorway comparison, for example. The main benefit is that we reduce the possibility of omitted
variable bias. We take away the possibility that unidentified variables correlated with national
features are driving our results, and thereby biasing our estimates.
In Fixed Effects, we can also incorporate dummies for different time periods, thereby reducing the
possibility of time-specific effects driving results. One can run Fixed Effects with dummies only on
cross-sections, only on time periods, or on both. The choice depends on our assumptions of the
workings of the world in our particular research question.
However, we risk wasting a lot of information when using Fixed Effects to draw inferences. What if
the difference between growth rates in Afghanistan and Norway are partly due to the fact that
Norway is more democratic? In that case, on our quest to reduce omitted variable bias we risk
wasting valuable information. Fixed Effects risks “throwing the baby out with the bath water” (Beck
and Katz, 2001).This contributes to reduced efficiency in the Fixed Effects estimators. Since they are
not using all relevant information, the estimators tend to have larger standard errors than if we were
to use techniques that utilized more information. It is therefore more likely that we commit type II
errors, by not identifying relevant effects.
Fixed Effects analysis assumes that each cross-section unit has its own specific intercept (the specific
dummy variable-coefficient plus the common intercept) in the regression. Random Effects analysis
moderates this assumption. Random Effects, like Fixed Effects, creates a different intercept for each
cross-section unit, “but it interprets these differing intercepts in a novel way. This procedure views
the different intercepts as having been drawn from a bowl of possible intercepts, so they may be
interpreted as random … and treated as though they were part of the error term” (Kennedy,
2003:304). Under the assumption that the intercept is truly randomly selected, that is they will have
to be uncorrelated with the independent variables, Random Effects gives increased efficiency to the
estimates when compared to Fixed Effects. That is, the coefficients will have smaller standard errors.
However, Random Effects will be biased if the error term is correlated with any of the independent
This last segment can be hard to grasp. It is sufficient for the reader to know, that Random Effects
also controls for country specific effects, but is more efficient than Fixed Effects under certain
assumptions (country specific effects are not correlated with independent variables). In practice, this
can often lead to Random Effects finding significant effects when Fixed Effects does not. However, if
the country specific effects are highly correlated with certain independent variables, Random Effects
might be biased, and one should use Fixed Effects. Both Fixed Effects and Random Effects can be
calculated in STATA, and there are several varieties of both techniques. The estimation procedure
can for example rely on GLS or Maximum Likelihood estimation procedures (if you want to know
more about these general estimation procedures, you can look them up in any econometric
textbook, but these differences are of little relevance to us here). One can also incorporate the
possibilities of heteroskedasticity or autocorrelation into the estimation procedure.
The endogeneity problem: What if Y affects X? 2SLS!
The issue of reverse causality permeates many studies in the social sciences, thereby rendering A4)
false. In the study of the economic effects of democracy, for example, this problem comes to the
fore. Economic factors are likely to influence political organization, and a correlation between
democracy and economic growth cannot readily be attributed to the causal effect of democracy on
growth. Lagging the independent variable could be seen as one way to try to deal with the issue of
reverse causality, by exploiting the temporal sequence of cause and effect. However, there exist
other and more solid statistical solutions. One proposed solution, very often used in econometric
literature is to find so-called instrumental variables, or instruments, for endogenous independent
variables. There are two requirements for a variable to be a valid and “decent” instrument for an
endogenous independent variable. First, the instrument should be correlated with the independent
variable. If the correlation is low we will often face very large standard errors for the estimated
coefficients when using Instrumental Variable analysis. Second, an instrument should not be directly
related to the dependent variable. This means that the instrument should only be correlated with the
dependent variable through the independent variable it instruments for. The intuition behind the
procedure is that we only utilize the “exogenous” part of the variation in the independent variable
that is related to the exogenous instrument. We thereby get a better estimate of the causal effect
from the independent variable on the dependent. If this second condition is not satisfied, the
resulting estimates from the analysis will not be consistent.
Figure 2: Causal structure underlying Instrumental Variable analysis
independent varia ble
A common technique based on the use of instrumental variables is Two Stage Least Squares (2SLS).
There can be more than one instrument incorporated in a 2SLS analysis. Let us exemplify such
analysis by once again turning to the question of whether democracy increases economic growth
rates. We now recognize the problem that democracy might be endogenous to growth, and we need
to find a proper exogenous instrument for democracy. The procedure followed is to first use OLS on
an equation where democracy (the endogenous independent variable in the original regression) is
the dependent variable, and the instrument(s) and the control variables from the original regression
are entered as right hand side variables. We then take the predicted, instead of the actual,
democracy values from this first regression and enter them into the original regression equation. We
do regression analysis in two stages, where democracy is the dependent variable in the first stage,
and economic growth is the dependent variable in the second stage. The instruments only enter into
the regression equation on the first stage, but the regular control variables are used in both stages.
The 2SLS procedure does not give us unbiased estimates, but it gives us consistent estimates.
Consistency implies that the expected value of the estimator approaches the real value as the
number of observations increases (asymptotical unbiasedness). This should make us weary of relying
on 2SLS in small samples. 2SLS can be used on cross-sectional data, but it also has panel data
versions. Both of these estimation procedures are incorporated in the STATA-package.
One “problem” with 2SLS is that we tend to get relatively large standard errors for the coefficient of
the endogenous independent variable, especially if the correlation with the instrument is low. It is
therefore often difficult to find significant 2SLS results. Another problem is that it is very difficult to
find truly exogenous instruments that are not directly related to the dependent variable. I encourage
the reader to look up Acemoglu et al (2001), who utilizes settler mortality for colonists in former
colonies as an instrument for institutional structure, in the study of how institutions affect economic
development. Their main point is that the settler mortality levels decades ago have no direct link to
level of development today. Settler mortality is however related to institutional structure today,
since it affected the probability of colonizers settling down in colonies and building institutional
structures. These historical institutional structures, because of institutional inertia, affect the nature
of institutions today. The instrument is therefore correlated with the independent variable of
interest, institutions, but the instrument is not directly linked to the dependent variable,
development. It can therefore readily be used in a 2SLS framework, and the estimated effect found,
it is argued, is not open to the attack that the coefficient is due to development affecting institutions.
Non-linearity: Matching
Regression-based techniques assume linear effects, clearly stated in A1), and this could be
problematic when investigating particular social science research questions. Imposing a linearity
restriction might be a too strong assumption to make, thereby leading to a too crude estimation
procedure. Recently, there has been some interest in so-called matching-techniques among
researchers studying political-economic topics. Persson and Tabellini (2003) use matching in their
study of the economic effects of different forms of constitutions and electoral systems. Matching is a
so-called non-parametric estimation technique, where we relax assumptions of functional form. We
do not have to make an initial assumption on whether the relationship is linear or have any other
particular functional form, and we do not have to assume that the effect is independent of values on
contextual variables. As is so often the case in econometric work however, relaxing strict
assumptions bears with it a cost in terms of reduced efficiency; that is, we tend to get relatively large
standard errors for the estimates. This is analogous to the situation when we move from OLS-based
PCSTS to the more robust but less efficient 2SLS technique (there is no free lunch, as economists like
to point out).
Matching-techniques draw on experimental logic: “The central idea in matching is to approach the
evaluation of causal effects as one would in an experiment. If we are willing to make a conditional
independence assumption, we can largely re-create the conditions of a randomized experiment, even
though we have access only to observational data” (Persson and Tabellini, 2003:138). The main
underlying idea is that we split our independent variables into two groups, the control variables, and
the treatment variable, which we are interested in investigating the effect from. Further, we need to
dichotomize the treatment variable and assume so-called conditional independence; that is, we need
to assume that the selection on the treatment variable is uncorrelated with the dependent variable.
If specific units self-select to a certain value on the treatment variable, this will pose trouble for
Matching is based on the underlying idea that we should compare the most similar units, for example
most similar countries. In this sense, the logic does not only reflect that of experiments, but also that
of the Most Similar Systems logic utilized in small-n studies in comparative politics (John Stuart Mill’s
“Method of Difference”). We make “local” comparisons over units that are relatively similar on all
variables but the treatment variable we are interested in investigating (for example political regime
dichotomized to democracy and dictatorship). Then we estimate the effect of the treatment variable
after we have looked at how the matched units differ on the dependent variable. As Persson and
Tabellini (2003:139) puts it, we try to find “twins” or a “close set of close relatives” to each
observation, but these most similar countries need to differ on the treatment variable, for example
degree of democracy, as used in previous examples of studies on how democracy affects economic
growth. The estimated effect of democracy is computed for each of the pairwise comparisons made,
and we then go on and calculate an average of these effects to get our final (generalized) estimate.
“Matching allows us to draw inferences from local comparisons only: as we compare countries with
similar values of X [characteristics in terms of values on the control variables], we rely on
counterfactuals that are not very different from the factuals observed” (Persson and Tabellini,
2003:139). One example of a good pairwise match in the democracy-growth example might be Benin
and Togo. Both have relatively similar values on potential control variables such as colonizer (France),
location (West Africa), level of development etc. However, Benin can be classified as democratic, and
Togo as dictatorial on the treatment variable.
There are several different versions of the matching procedure, and there are specifications that
need to be made before one can start estimating. The literature on the use of matching in social
sciences is growing rapidly at this point in time, and I will not survey the different techniques, since
the point here is to show the general logic. Matching can for example be performed with and without
replacement. Replacement indicates that the same unit can be used as a match several times. Benin
can for example be used as a match for both Togo and Guinea. Packages used for matchingestimation can be downloaded from the internet and used on STATA. One example is the package
related to the “nnmatch” command.
One crucial specification that we must make initially is the number of “similar” experiences we want
to compare with when estimating effects from the treatment variable. We can use one or more than
one match for each unit. A second specification is whether we want to adjust for possible biases in
specific ways (Abadie and Imbens, 2002). A third is how we want to calculate the standard errors.
When it comes to the first, using several cases as matches of course increases the amount of
information one bases inferences upon, but we are then at risk of comparing a unit with other units
that are relatively dissimilar to each other. When it comes to bias-adjustment, one bias-adjustment
procedure has been specified by Abadie and Imbens (2002), but there are several others available.
When it comes to standard errors, there is often good reason to believe that the standard errors are
heteroskedastic, and it is therefore in many instances recommended to use “robust standard errors”.
STATA calculates such robust standard errors by running through a second matching process, and in
this second stage matches are done between observations that have similar values on the treatment
variable. The resulting standard errors are heteroskedasticity-consistent.
A Move from cross-sectional to pooled cross-sectional time-series data increases the amount of
information one can use when drawing inferences. Quantitative researchers in the social sciences
therefore increasingly use such data structures. One problem for students and researchers educated
only in OLS, or alternatively in WLS, is that these techniques run into serious problems under such a
data structure, with autocorrelation being a main scourge. Fortunately, there are techniques closely
resembling the logic of OLS that can be used to analyze such situations. One simple extension is the
OLS with Panel Corrected Standard Errors. Estimators are here calculated on the basis of both crosssectional and temporal variation, with for example country-year being the unit of analysis. However,
if there are non-observed country-specific effects that strongly drive results, analysts are encouraged
to switch to Fixed Effects, or alternatively the more lenient Random Effects. Fixed Effects incorporate
dummies for each cross-section units, and should be beloved by those who claim that each country is
so special that inter-country comparisons cannot be used for inference. This is however a very strong
claim, often bordering to nihilism (Beck and Katz, 2001), and Fixed Effects might therefore waste a lot
of valuable information.
Endogeneity is a general problem in the social sciences, and I sketched up a procedure that is
constructed for dealing with endogenous independent variables, namely 2SLS. 2SLS yields consistent
estimates, but standard errors are generally very large. Moreover, finding proper instruments is a
very difficult task, and this could be one of the reasons 2SLS has not diffused more widely into
political science. Matching draws on experimental logic, and this type of analysis allows analysts to
not assume linearity of effect. Every unit is compared with one or more similar units that differ on
the treatment variable of interest, and treatment effects are estimated and finally averaged up to an
average treatment effect. These two latter techniques are arguably more complex than OLS, and in
order to understand them properly, interested readers are encouraged to dig into the literature on
these techniques. 2SLS is widely used in economics, and matching is more used in psychology,
medicine and biology. However, these techniques are superior to simpler techniques in several
situations, and my guess is that one will see more widespread use of 2SLS and matching in political
science in the years to come.
